Subido por Cindy Juarez

conversions and mappings iuc19

Anuncio
Character Conversions
and Mapping Tables
Presented By:
Markus Scherer
George Rhoten
Raghuram (Ram) Viswanadha
[email protected]
[email protected]
[email protected]
Globalization Center of Competency, Cupertino, CA
1
Agenda
•
•
•
•
•
Introduction
Terminology & Concepts
Problems
Solutions & Tools
Summary
2
Introduction
• Text data used to be contained on a single computer
system
• Now text data is exchanged among different systems
• Each type of system used different ways to encode text
• Exchanging this text data requires a conversion
• Text data is increasingly machine processed
• Main emphasis on Unicode text processing
3
Terminology
•
•
•
•
•
•
System
Character set
Code point
Encoding/Charset
Character mapping
Alias
4
Concept of Character Mapping
Unicode
A
Character Set
roundtrip
fa
A
ck
a
b
ll
Á
5
Character Mapping (continued)
Unicode
V
Character Set
roundtrip
rev
e
rse
fa l
lba
V
ck
6
Doing a Conversion
Unicode
Repertoires: superset/subset
ISO-8859-1
7
Doing a Conversion (continued)
IBM SJIS
Unicode
99.8% Same
Sun SJIS
8
Text Data Exchange Problems
• Unable to read text from another system
• Unable to write correct text for other processes
• Loss of text data because of mistakes
– Maybe partial loss of data due to rare and obscure details
– Happens more often to multibyte and stateful encodings
• New Unicode character added and mapping changes
– Character was mapped to PUA
– Character is now mapped to a new Unicode character
9
Problems (continued)
• Support of different repertoires of characters
• Different text encoding models
– Different bidi text models
• Visual order
• Logical order
• Explicit embedding
–
–
–
–
Composed and decomposed characters
Shaping (Arabic)
Reordering (Indic, Thai, etc.)
Ligatures different
10
Examples
µ
\
~
μ
–
¥
¯
Micro symbol (U+00B5) vs. Greek Mu (U+03BC)
Hyphen-Minus (U+002D) vs. En Dash (U+2013)
Backslash (U+005C) vs. Yen symbol (U+00A5)
Tilde (U+007E) vs. Overline (U+00AF)
NUL→☺
NL↔LF
0x1C
0x1A
Graphical display of control characters
Newline swapped with Linefeed
ISO Control rotation
0x7F
11
Reasons For Problems
•
•
•
•
Different mappings tables
Fallback supported inconsistently
Mapping tables were not shared
Mappings tables were not published in machine readable
format
• Aliases
– Existing registries (IANA, MIME, …) do not specify precise mappings
– Different mapping tables for the same name (CP943, SJIS)
– Different names for the same character set
12
Solutions
•
•
•
•
Use precise names
Use precise mapping tables
Avoid fallbacks (controllable e.g. with ICU)
Share the character set mappings
– e.g. format:
http://www.unicode.org/unicode/reports/tr22/
13
Solutions (continued)
• Do safe conversions
– Exact subsets and supersets
– Use precise replacements for unavailable characters
(NCRs and escapes)
– Algorithmic
• JIS X 0208: SJIS ↔ EUC-JP ↔ ISO 2022-JP
• All Unicode encodings among each other
14
Tools
• ICU (International Components for Unicode)
– Feature rich converter API
– Allows to match conversion behavior of most other
systems
– http://oss.software.ibm.com/icu/
• Unicode mapping table repositories
– http://www.unicode.org/Public/MAPPINGS/
– http://oss.software.ibm.com/icu/charset/
• iconv() and other platform converters
15
Summary
•
•
•
•
Text data exchange can result in loss of data
Using Unicode is safe without a conversion
Conversion mapping tables are unsafe
Use ICU ☺
Thank you for listening
Are there any questions?
16
Descargar