Character Conversions and Mapping Tables Presented By: Markus Scherer George Rhoten Raghuram (Ram) Viswanadha [email protected] [email protected] [email protected] Globalization Center of Competency, Cupertino, CA 1 Agenda • • • • • Introduction Terminology & Concepts Problems Solutions & Tools Summary 2 Introduction • Text data used to be contained on a single computer system • Now text data is exchanged among different systems • Each type of system used different ways to encode text • Exchanging this text data requires a conversion • Text data is increasingly machine processed • Main emphasis on Unicode text processing 3 Terminology • • • • • • System Character set Code point Encoding/Charset Character mapping Alias 4 Concept of Character Mapping Unicode A Character Set roundtrip fa A ck a b ll Á 5 Character Mapping (continued) Unicode V Character Set roundtrip rev e rse fa l lba V ck 6 Doing a Conversion Unicode Repertoires: superset/subset ISO-8859-1 7 Doing a Conversion (continued) IBM SJIS Unicode 99.8% Same Sun SJIS 8 Text Data Exchange Problems • Unable to read text from another system • Unable to write correct text for other processes • Loss of text data because of mistakes – Maybe partial loss of data due to rare and obscure details – Happens more often to multibyte and stateful encodings • New Unicode character added and mapping changes – Character was mapped to PUA – Character is now mapped to a new Unicode character 9 Problems (continued) • Support of different repertoires of characters • Different text encoding models – Different bidi text models • Visual order • Logical order • Explicit embedding – – – – Composed and decomposed characters Shaping (Arabic) Reordering (Indic, Thai, etc.) Ligatures different 10 Examples µ \ ~ μ – ¥ ¯ Micro symbol (U+00B5) vs. Greek Mu (U+03BC) Hyphen-Minus (U+002D) vs. En Dash (U+2013) Backslash (U+005C) vs. Yen symbol (U+00A5) Tilde (U+007E) vs. Overline (U+00AF) NUL→☺ NL↔LF 0x1C 0x1A Graphical display of control characters Newline swapped with Linefeed ISO Control rotation 0x7F 11 Reasons For Problems • • • • Different mappings tables Fallback supported inconsistently Mapping tables were not shared Mappings tables were not published in machine readable format • Aliases – Existing registries (IANA, MIME, …) do not specify precise mappings – Different mapping tables for the same name (CP943, SJIS) – Different names for the same character set 12 Solutions • • • • Use precise names Use precise mapping tables Avoid fallbacks (controllable e.g. with ICU) Share the character set mappings – e.g. format: http://www.unicode.org/unicode/reports/tr22/ 13 Solutions (continued) • Do safe conversions – Exact subsets and supersets – Use precise replacements for unavailable characters (NCRs and escapes) – Algorithmic • JIS X 0208: SJIS ↔ EUC-JP ↔ ISO 2022-JP • All Unicode encodings among each other 14 Tools • ICU (International Components for Unicode) – Feature rich converter API – Allows to match conversion behavior of most other systems – http://oss.software.ibm.com/icu/ • Unicode mapping table repositories – http://www.unicode.org/Public/MAPPINGS/ – http://oss.software.ibm.com/icu/charset/ • iconv() and other platform converters 15 Summary • • • • Text data exchange can result in loss of data Using Unicode is safe without a conversion Conversion mapping tables are unsafe Use ICU ☺ Thank you for listening Are there any questions? 16