Unicode
Character Encoding Systems
- ASCII (American Standard Code for Information Interchange)
- Developed from telegraph codes
- 7 bit code
- 128 characters
- 95 printable characters
- EBCDIC (Extended Binary Coded Decimal Interchange Code)
- Developed from puch card codes
- Used on IBM mainframes
- 8 bit code
- Lowercase before uppercase
- Alphabet not contiguous
- Unicode
- 136755 characters
- UTF-8 encoding 8 bits
- UTF-16 encoding 16 bits
- UTF-32 encoding 32 bits
- Basic Multilingual Plane: characters for modern languages and symbols
- Supplementary Multilingual Plane: historic characters, symbols for some fields
- Supplementary Ideographic Plane: CJK ideographs
- Other supplementary planes: emoji and more
- UCS-2, UCS-4
- Essentially the same as UTF-16, UTF-32
UTF-8
- Backwards compatible with ASCII
- Can encode all Unicode characters
- Uses 1-4 bytes
- One byte characters: 0xxxxxxx
- Two byte characters: 110xxxxx 10xxxxxx
- Three byte characters: 1110xxxx 10xxxxxx 10xxxxxx
- Four byte characters: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
- Used by most web sites
UTF-16
- Can encode all BMP characters with 1 16 bit integer
- Characters can be single 16 bit integers or pairs of 16 bit integers
- Used by Java
Byte Ordering
- Some machines are bigendian (most significant byte first)
- Others are littleendian (least significant byte first)
- IBM mainframes, Motorolla 68000 series are bigendian
- Intel architecture is littleendian
- Network byte order (for IP addresses, etc) is bigendian
- ARM, SPARK, MIPS, IA-64 are bigendian
- BOM (byte order mark) feff indicates endianness
Wide Characters in C
- wchar_t
- Can be 16 bits or 32 bits
- Use UTF-16 (UCS-2) or UTF-32 (UCS-4)
- Actual implementation is system dependent
- The GNU C library uses UTF-32 on all implementions
- wint_t replaces int when reading characters (to accomodate WEOF = -1)
- Wide string and character functions: wcslen, wcscmp, wstof, iswalpha, towlower, etc
- In strings, use \uxxxx (four hex digits) or \Uxxxxxxxx (eight hex digits)
- Wide character string: L"..."
- Wide character constant: L'x'
Multibyte Characters in C
- External representations (like UTF-8)
- Characters are not all the same length
- Multibyte character and string functions: mblen, mbtowc, wctomb, etc.
Locales
- Determine how input streams are interpreted (which character set and number conventions)
- Determine how output is printed (date format, number format, etc)
- The C locale is the default locale (it's "minimal")
- Example locales: en_US.UTF-8 ko_KR.UTF-8 pt_BR.UTF-8
nonascii.c
wcat.c