This module provides lazy stream encoding/decoding facilities for UTF-8,
the Unicode Transformation Format with 8-bit words.

Created 2002-09-02 by Sven Moritz Hallberg <pesco@gmx.de>

> {-# LANGUAGE DeriveDataTypeable #-}
> module UTF8
>   ( Codepoint
>   , Error(..)
>   , Error_Type(..)
>   , encode             -- :: String -> [Word8]
>   , decode             -- :: [Word8] -> String
>   , soft_decode        -- :: [Word8] -> [Either Error Char]
>   , encode_char        -- :: Char -> [Word8]
>   , decode_char        -- :: [Word8] -> (Either Error Char, [Word8])
>   , decode_codepoint   -- :: [Word8] -> (Either Error Codepoint, [Word8])
>   , decode_single_cp   -- :: [Word8] -> Codepoint
>   , bytes_expected     -- :: [Word8] -> Int
>   , codepoint_enclen   -- :: Codepoint -> Int
>   , char_enclen        -- :: Char -> Int
>   ) where

> import Data.Char (ord, chr, isAscii)
> import Data.Word (Word8, Word16, Word32)
> import Data.Bits (Bits, bitSize, testBit, shiftL, shiftR, (.&.), (.|.))
> import Data.Typeable (Typeable)
> import Control.OldException (throwDyn)


///- UTF-8 in General -///

Adapted from the Unicode standard, version 3.2,
Table 3.1 "UTF-8 Bit Distribution" (excluded are UTF-16 encodings):

  Scalar                    1st Byte  2nd Byte  3rd Byte  4th Byte
          000000000xxxxxxx  0xxxxxxx
          00000yyyyyxxxxxx  110yyyyy  10xxxxxx
          zzzzyyyyyyxxxxxx  1110zzzz  10yyyyyy  10xxxxxx
  000uuuzzzzzzyyyyyyxxxxxx  11110uuu  10zzzzzz  10yyyyyy  10xxxxxx

Also from the Unicode standard, version 3.2,
Table 3.1B "Legal UTF-8 Byte Sequences":

  Code Points         1st Byte  2nd Byte  3rd Byte  4th Byte
    U+0000..U+007F    00..7F
    U+0080..U+07FF    C2..DF    80..BF
    U+0800..U+0FFF    E0        A0..BF    80..BF
    U+1000..U+CFFF    E1..EC    80..BF    80..BF
    U+D000..U+D7FF    ED        80..9F    80..BF
    U+D800..U+DFFF    ill-formed
    U+E000..U+FFFF    EE..EF    80..BF    80..BF
   U+10000..U+3FFFF   F0        90..BF    80..BF    80..BF
   U+40000..U+FFFFF   F1..F3    80..BF    80..BF    80..BF
  U+100000..U+10FFFF  F4        80..8F    80..BF    80..BF


///- Encoding Functions -///

Must the encoder ensure that no illegal byte sequences are output or
can we trust the Haskell system to supply only legal values?
For now I include error cases for the surrogate values U+D800..U+DFFF and
out-of-range scalars.

The function is pretty much a transscript of table 3.1B with error checks.
It dispatches the actual encoding to functions specific to the number of
required bytes.

> type Codepoint = Int

> encode_char :: Char -> [Word8]
> encode_char = map fromIntegral . encode_cp_int . ord

> encode_cp_int :: Codepoint -> [Int]
> encode_cp_int n
>-- The report guarantees in (6.1.2) that this won't happen:
>--   | n < 0       = error "ord returned a negative value"
>     | n < 0x0080  = encode_cp_onebyte n
>     | n < 0x0800  = encode_cp_twobyte n
>     | n < 0xD800  = encode_cp_threebyte n
>     | n < 0xE000  = error "ord returned a surrogate value"
>     | n < 0x10000 = encode_cp_threebyte n
>-- Haskell 98 only talks about 16 bit characters, so below here is
>-- currently extension-land.
>     | n < 0x10FFFF      = encode_cp_fourbyte n
>     | otherwise  = error "ord returned a value above 0x10FFFF"


With the above, a stream decoder is trivial:

> encode :: [Char] -> [Word8]
> encode = concatMap encode_char


Now follow the individual encoders for certain numbers of bytes...
          _
         / |  __  ___  __ __
        / ^| //  /__/ // //
       /.==| \\ //_  // //
It's  //  || // \_/_//_//_  and it's here to stay!

> encode_cp_onebyte :: Codepoint -> [Int]
> encode_cp_onebyte cp = [cp]


00000yyyyyxxxxxx -> 110yyyyy 10xxxxxx

> encode_cp_twobyte :: Codepoint -> [Int]
> encode_cp_twobyte cp = [(0xC0 .|. ys), (0x80 .|. xs)]
>     where
>     ys = shiftR cp 6
>     xs = cp .&. 0x3F


zzzzyyyyyyxxxxxx -> 1110zzzz 10yyyyyy 10xxxxxx

> encode_cp_threebyte :: Codepoint -> [Int]
> encode_cp_threebyte cp = [(0xE0.|.zs), (0x80.|.ys), (0x80.|.xs)]
>     where
>     xs = cp .&. 0x3F
>     ys = (shiftR cp 6) .&. 0x3F
>     zs = shiftR cp 12


000uuuzzzzzzyyyyyyxxxxxx -> 11110uuu 10zzzzzz 10yyyyyy 10xxxxxx

> encode_cp_fourbyte :: Codepoint -> [Int]
> encode_cp_fourbyte cp = [0xF0.|.us, 0x80.|.zs, 0x80.|.ys, 0x80.|.xs]
>     where
>     xs = cp .&. 0x3F
>     ys = (shiftR cp 6) .&. 0x3F
>     zs = (shiftR cp 12) .&. 0x3F
>     us = shiftR cp 18


///- Decoding -///

The decoding is a bit more involved. The byte sequence could contain all
sorts of corruptions. The user must be able to either notice or ignore these
errors.

I will first look at the decoding of a single character. The process
consumes a certain number of bytes from the input. It returns the
remaining input and either an error or the decoded character. In case of
error, the erroneous byte sequence is skipped so decoding can still be
attempted on the rest of the stream. Errors are represented by the following
data type and always include the skipped byte sequence for inspection. There
is also a field to hold the position of the error in the overall byte stream
being decoded. The position is an index into the stream.

> data Error = Error Error_Type Int [Word8]  deriving (Eq, Show, Typeable)
> data Error_Type

The first byte in a sequence starts with either zero, two, three, or four
ones and one zero to indicate the length of the sequence. If it doesn't,
it is invalid. It is skipped and the next byte interpreted as the start
of a new sequence.

>     = Invalid_First_Byte

All bytes in the sequence except the first match the bit pattern 10xxxxxx.
If one doesn't, it is invalid. The sequence up to that point is skipped
and the "invalid" byte interpreted as the start of a new sequence. The
erroneous byte sequence field in type Error will include the inappropriate last
byte.

>     | Invalid_Later_Byte   -- last byte does not match 10xxxxxx

If a sequence ends prematurely, it has been truncated. It is skipped.

>     | Truncated            -- only n of m expected bytes were present

Some sequences would represent code points which would be encoded as a
shorter sequence by a conformant encoder. Such non-shortest sequences are
considered erroneous and skipped.

>     | Non_Shortest         -- n instead of m bytes were used

Unicode code points are in the range of [0..0x10FFFF]. Any values outside
of those bounds are simply invalid.

>     | Value_Out_Of_Bounds

The notion of "surrogate pairs" was removed from UTF-8. The corresponding code
points now form illegal byte sequences.

>     | Surrogate
>       deriving (Show, Eq, Typeable)


Second, third, and fourth bytes share the common requirement to start
with the bit sequence 10. So, here's the function to check that property.

> first_bits_not_10 :: Word8 -> Bool
> first_bits_not_10 b
>     | (b.&.0xC0) /= 0x80  = True
>     | otherwise           = False


The single-character decoding function's return type is a pair:
  - The first component contains the decoded character or an error.
  - The second component contains the remaining bytes of input.
The index field of the errors returned by the single-character functions is
always 0.
 
> decode_char :: [Word8] -> (Either Error Char, [Word8])
> decode_char = f . decode_codepoint
>   where f (e, bs) = (either Left (Right . chr) e, bs)

> decode_codepoint :: [Word8] -> (Either Error Codepoint, [Word8])
> decode_codepoint bs@(b1:rest)
>     | b1 < 0x80  = decode_cp_onebyte bs
>     | b1 < 0xC0  = (Left (Error Invalid_First_Byte 0 [b1]), rest)
>     | b1 < 0xE0  = decode_cp_twobyte bs
>     | b1 < 0xF0  = decode_cp_threebyte bs
>     | b1 < 0xF8  = decode_cp_fourbyte bs
>     | otherwise  = (Left (Error Invalid_First_Byte 0 [b1]), rest)


0xxxxxxx -> 000000000xxxxxxx

> decode_cp_onebyte :: [Word8] -> (Either Error Codepoint, [Word8])
> decode_cp_onebyte (b:bs) = (Right (fromIntegral b), bs)


110yyyyy 10xxxxxx -> 00000yyyyyxxxxxx

> decode_cp_twobyte :: [Word8] -> (Either Error Codepoint, [Word8])
> decode_cp_twobyte (b1:[])
>     = (Left (Error Truncated 0 [b1]), [])
> decode_cp_twobyte (b1:b2:bs)
>     | b1 < 0xC2
>         = (Left (Error Non_Shortest 0 [b1,b2]), bs)
>     | first_bits_not_10 b2
>         = (Left (Error Invalid_Later_Byte 0 [b1,b2]), (b2:bs))
>     | otherwise
>         = (Right result, bs)
>     where
>     xs = fromIntegral (b2.&.0x3F)
>     ys = fromIntegral (b1.&.0x1F)
>     result = shiftL ys 6 .|. xs


1110zzzz 10yyyyyy 10xxxxxx -> zzzzyyyyyyxxxxxx

> decode_cp_threebyte :: [Word8] -> (Either Error Codepoint, [Word8])
> decode_cp_threebyte (b1:[])   = (Left (Error Truncated 0 [b1]), [])
> decode_cp_threebyte (b1:b2:[]) = (Left (Error Truncated 0 [b1,b2]), [])
> decode_cp_threebyte bs@(b1:b2:b3:rest)
>     | first_bits_not_10 b2
>         = (Left (Error Invalid_Later_Byte 0 [b1,b2]), b2:b3:rest)
>     | first_bits_not_10 b3
>         = (Left (Error Invalid_Later_Byte 0 [b1,b2,b3]), b3:rest)
>     | result < 0x0080
>         = (Left (Error Non_Shortest 0 [b1,b2,b3]), rest)
>     | result < 0x0800
>         = (Left (Error Non_Shortest 0 [b1,b2,b3]), rest)
>     | result >= 0xD800 && result < 0xE000
>         = (Left (Error Surrogate 0 [b1,b2,b3]), rest)
>     | otherwise
>         = (Right result, rest)
>     where
>     xs = fromIntegral (b3.&.0x3F)
>     ys = fromIntegral (b2.&.0x3F)
>     zs = fromIntegral (b1.&.0x0F)
>     result = shiftL zs 12 .|. shiftL ys 6 .|. xs


11110uuu 10zzzzzz 10yyyyyy 10xxxxxx -> 000uuuzzzzzzyyyyyyxxxxxx

> decode_cp_fourbyte :: [Word8] -> (Either Error Codepoint, [Word8])
> decode_cp_fourbyte (b1:[])       = (Left (Error Truncated 0 [b1]), [])
> decode_cp_fourbyte (b1:b2:[])    = (Left (Error Truncated 0 [b1,b2]), [])
> decode_cp_fourbyte (b1:b2:b3:[]) = (Left (Error Truncated 0 [b1,b2,b3]), [])
> decode_cp_fourbyte (b1:b2:b3:b4:rest)
>     | first_bits_not_10 b2
>         = (Left (Error Invalid_Later_Byte 0 [b1,b2]), b2:b3:b4:rest)
>     | first_bits_not_10 b3
>         = (Left (Error Invalid_Later_Byte 0 [b1,b2,b3]), b3:b4:rest)
>     | first_bits_not_10 b4
>         = (Left (Error Invalid_Later_Byte 0 [b1,b2,b3,b4]), b4:rest)
>     | result < 0x10000
>         = (Left (Error Non_Shortest 0 [b1,b2,b3,b4]), rest)
>     | result > 0x10FFFF
>         = (Left (Error Value_Out_Of_Bounds 0 [b1,b2,b3,b4]), rest)
>     | otherwise
>         = (Right result, rest)
>     where
>     xs = fromIntegral (b4 .&. 0x3F)
>     ys = fromIntegral (b3 .&. 0x3F)
>     zs = fromIntegral (b2 .&. 0x3F)
>     us = fromIntegral (b1 .&. 0x07)
>     result = xs .|. shiftL ys 6 .|. shiftL zs 12 .|. shiftL us 18


The decode function takes a byte sequence and turns it into a string. If it
encounters an error, it throws it with the index field set appropriately.

> decode :: [Word8] -> String
> decode = throw_first_error . soft_decode

> soft_decode :: [Word8] -> [Either Error Char]

The function soft_decode is like decode but does not throw exceptions. It
reports errors in-line with the successfully decoded characters.

> soft_decode = index_errors . foo
>     where
>     foo [] = []
>     foo bs = let (x,bs') = decode_char bs in (x : foo bs')

Index_errors takes the list of Eithers inside soft_decode and sets each error's
index relative to the overall sequence.

> index_errors :: [Either Error Char] -> [Either Error Char]
> index_errors = zipWith (\i -> either (Left . update_index i) Right) [0..]
>     where update_index i (Error t _ bs) = Error t i bs

Throw_first_error takes the list of Eithers from index_errors and throws the
first index-error pair it encounters. If there are no errors, it returns
the correctly decoded String.

> throw_first_error :: [Either Error Char] -> String
> throw_first_error = map (either throwDyn id)


///- Properties for QuickCheck -///

> {-

> import Test.QuickCheck
> import Control.Monad (liftM)

> instance Arbitrary Char where
>     arbitrary = do i <- choose (0,max_char_code)
>                    if not (isSurrogate i || isNoncharacter i)
>                       then return (chr i)
>                       else arbitrary
>     coarbitrary = variant . ord

> max_char_code :: Int
> max_char_code = 0x10FFFF
>-- max_char_code = 0xFF                  -- Hugs doesn't know Unicode. :(

> ascii_char :: Gen Char
> ascii_char = liftM chr (choose (0,0x0080))

The simplest property of the codec is that decoding an encoded string
yields the original without errors.

> prop_DecEnc s = (decode.encode) s == s

An important characteristic of UTF-8 is that ASCII codes are encoded
to themselves.

> prop_Ascii = forAll ascii_char $
>              \c -> encode [c] == [fromIntegral (ord c)]

Concatenation of UTF-8 sequences is equivalent to concatenation of the
strings they represent.

> prop_Concat s t = encode s ++ encode t == encode (s ++ t)

This function checks whether a given Unicode scalar value represents a
noncharacter.

> isNoncharacter :: Int -> Bool
> isNoncharacter cp = cp `elem` noncharacters

> noncharacters :: [Int]
> noncharacters = [0xFDD0..0xFDEF] ++
>                 [n + v | n <- [0x0,0x10000..0x100000],
>                          v <- [0xFFFE, 0xFFFF] ]

And here one to see if it's a surrogate.

> isSurrogate :: Int -> Bool
> isSurrogate cp = (cp >= 0xD800 && cp < 0xE000)

> -}


///- Utility Functions -///

> char_enclen :: Char -> Int
> char_enclen = codepoint_enclen . ord

> codepoint_enclen :: Codepoint -> Int
> codepoint_enclen n
>     | n < 0x0080   = 1
>     | n < 0x0800   = 2
>     | n < 0xD800   = 3
>     | n < 0xE000   = error "ord returned a surrogate value"
>     | n < 0x10000  = 3
>     | n < 0x10FFFF = 4
>     | otherwise    = error "ord returned a value above 0x10FFFF"

> bytes_expected (b:_)
>     | b < 0x80  = 1
>     | b < 0xC0  = throwDyn (Error Invalid_First_Byte 0 [b])
>     | b < 0xE0  = 2
>     | b < 0xF0  = 3
>     | b < 0xF8  = 4
>     | otherwise = throwDyn (Error Invalid_First_Byte 0 [b])

> decode_single_cp :: [Word8] -> Codepoint
> decode_single_cp = either throwDyn id . fst . decode_codepoint


///- Some Random Convenience Functions -///

> showBits :: Bits a => a -> String
> showBits w = map showBit (bits w)

> bits :: Bits a => a -> [Bool]
> bits w = map (testBit w) (reverse [0..bitSize w - 1])

> showBit True = '1'
> showBit False = '0'