Unicode

Character Encoding Systems

ASCII (American Standard Code for Information Interchange)
- Developed from telegraph codes
- 7 bit code
- 128 characters
- 95 printable characters
EBCDIC (Extended Binary Coded Decimal Interchange Code)
- Developed from puch card codes
- Used on IBM mainframes
- 8 bit code
- Lowercase before uppercase
- Alphabet not contiguous
Unicode
- 136755 characters
- UTF-8 encoding 8 bits
- UTF-16 encoding 16 bits
- UTF-32 encoding 32 bits
- Basic Multilingual Plane: characters for modern languages and symbols
- Supplementary Multilingual Plane: historic characters, symbols for some fields
- Supplementary Ideographic Plane: CJK ideographs
- Other supplementary planes: emoji and more
UCS-2, UCS-4
- Essentially the same as UTF-16, UTF-32

wchar_t
Can be 16 bits or 32 bits
Use UTF-16 (UCS-2) or UTF-32 (UCS-4)
Actual implementation is system dependent
The GNU C library uses UTF-32 on all implementions
wint_t replaces int when reading characters (to accomodate WEOF = -1)
Wide string and character functions: wcslen, wcscmp, wstof, iswalpha, towlower, etc
In strings, use \uxxxx (four hex digits) or \Uxxxxxxxx (eight hex digits)
Wide character string: L"..."
Wide character constant: L'x'

Determine how input streams are interpreted (which character set and number conventions)
Determine how output is printed (date format, number format, etc)
The C locale is the default locale (it's "minimal")
Example locales: en_US.UTF-8 ko_KR.UTF-8 pt_BR.UTF-8