js-codepage/misc/codepage.md.utf7

191 lines
7.0 KiB
Plaintext

+ACM Getting Codepages
The fields of the +AGA-pages.csv+AGA manifest are +AGA-codepage,url,bytes+AGA (SBCS+AD0-1, DBCS+AD0-2)
Note that the Windows rendering is used for the Mac code pages. The primary
difference is the use of the private +AGA-0xF8FF+AGA code (which renders as an Apple
logo on macs but as garbage on other operating systems). It may be desirable
to fall back to the behavior, in which case the files are under APPLE and not
MICSFT. This affects codepages 10000, 10006, 10007, 10029, 10079, 10081
The numbering scheme for the +AGA-ISO-8859-X+AGA series is +AGA-28590 +- X+AGA:
+ACMAIw Generated Codepages
The following codepages are available in .NET on Windows:
- 708 Arabic (ASMO 708)
- 720 Arabic (Transparent ASMO)+ADs Arabic (DOS)
- 858 OEM Multilingual Latin 1 +- Euro symbol
- 870 IBM EBCDIC Multilingual/ROECE (Latin 2)+ADs IBM EBCDIC Multilingual Latin 2
- 1047 IBM EBCDIC Latin 1/Open System
- 1140 IBM EBCDIC US-Canada (037 +- Euro symbol)+ADs IBM EBCDIC (US-Canada-Euro)
- 1141 IBM EBCDIC Germany (20273 +- Euro symbol)+ADs IBM EBCDIC (Germany-Euro)
- 1142 IBM EBCDIC Denmark-Norway (20277 +- Euro symbol)+ADs IBM EBCDIC (Denmark-Norway-Euro)
- 1143 IBM EBCDIC Finland-Sweden (20278 +- Euro symbol)+ADs IBM EBCDIC (Finland-Sweden-Euro)
- 1144 IBM EBCDIC Italy (20280 +- Euro symbol)+ADs IBM EBCDIC (Italy-Euro)
- 1145 IBM EBCDIC Latin America-Spain (20284 +- Euro symbol)+ADs IBM EBCDIC (Spain-Euro)
- 1146 IBM EBCDIC United Kingdom (20285 +- Euro symbol)+ADs IBM EBCDIC (UK-Euro)
- 1147 IBM EBCDIC France (20297 +- Euro symbol)+ADs IBM EBCDIC (France-Euro)
- 1148 IBM EBCDIC International (500 +- Euro symbol)+ADs IBM EBCDIC (International-Euro)
- 1149 IBM EBCDIC Icelandic (20871 +- Euro symbol)+ADs IBM EBCDIC (Icelandic-Euro)
- 1361 Korean (Johab)
- 10001 Japanese (Mac)
- 10002 MAC Traditional Chinese (Big5)+ADs Chinese Traditional (Mac)
- 10003 Korean (Mac)
- 10004 Arabic (Mac)
- 10005 Hebrew (Mac)
- 10008 MAC Simplified Chinese (GB 2312)+ADs Chinese Simplified (Mac)
- 10010 Romanian (Mac)
- 10017 Ukrainian (Mac)
- 10021 Thai (Mac)
- 10082 Croatian (Mac)
- 20000 CNS Taiwan+ADs Chinese Traditional (CNS)
- 20001 TCA Taiwan
- 20002 ETEN Taiwan+ADs Chinese Traditional (ETEN)
- 20003 IBM5550 Taiwan
- 20004 TeleText Taiwan
- 20005 Wang Taiwan
- 20105 IA5 (IRV International Alphabet No. 5, 7-bit)+ADs Western European (IA5)
- 20106 IA5 German (7-bit)
- 20107 IA5 Swedish (7-bit)
- 20108 IA5 Norwegian (7-bit)
- 20261 T.61
- 20269 ISO 6937 Non-Spacing Accent
- 20273 IBM EBCDIC Germany
- 20277 IBM EBCDIC Denmark-Norway
- 20278 IBM EBCDIC Finland-Sweden
- 20280 IBM EBCDIC Italy
- 20284 IBM EBCDIC Latin America-Spain
- 20285 IBM EBCDIC United Kingdom
- 20290 IBM EBCDIC Japanese Katakana Extended
- 20297 IBM EBCDIC France
- 20420 IBM EBCDIC Arabic
- 20423 IBM EBCDIC Greek
- 20424 IBM EBCDIC Hebrew
- 20833 IBM EBCDIC Korean Extended
- 20838 IBM EBCDIC Thai
- 20866 Russian (KOI8-R)+ADs Cyrillic (KOI8-R)
- 20871 IBM EBCDIC Icelandic
- 20880 IBM EBCDIC Cyrillic Russian
- 20905 IBM EBCDIC Turkish
- 20924 IBM EBCDIC Latin 1/Open System (1047 +- Euro symbol)
- 20932 Japanese (JIS 0208-1990 and 0212-1990)
- 20936 Simplified Chinese (GB2312)+ADs Chinese Simplified (GB2312-80)
- 20949 Korean Wansung
- 21025 IBM EBCDIC Cyrillic Serbian-Bulgarian
- 21027 Extended/Ext Alpha Lowercase
- 21866 Ukrainian (KOI8-U)+ADs Cyrillic (KOI8-U)
- 29001 Europa 3
- 38598 ISO 8859-8 Hebrew+ADs Hebrew (ISO-Logical)
- 51932 EUC Japanese
- 51936 EUC Simplified Chinese+ADs Chinese Simplified (EUC)
- 51949 EUC Korean
- 52936 HZ-GB2312 Simplified Chinese+ADs Chinese Simplified (HZ)
- 54936 Windows XP and later: GB18030 Simplified Chinese (4 byte)+ADs Chinese Simplified (GB18030)
- 57002 ISCII Devanagari
- 57003 ISCII Bengali
- 57004 ISCII Tamil
- 57005 ISCII Telugu
- 57006 ISCII Assamese
- 57007 ISCII Oriya
- 57008 ISCII Kannada
- 57009 ISCII Malayalam
- 57010 ISCII Gujarati
- 57011 ISCII Punjabi
The following codepages are dependencies for Visual FoxPro:
- 620 Mazovia (Polish) MS-DOS
- 895 Kamenick+AP0 (Czech) MS-DOS
+ACMAIw Building Notes
The script +AGA-make.sh+AGA (described later) will get these files and massage the data
(printing code-Unicode pairs). The eventual tables are dropped in the paths
+AGA./codepages/+ADw-CODEPAGE+AD4.TBL+AGA. For example, the last 10 lines of +AGA-10000.TBL+AGA are
+AGAAYABgAD4
0xF6 0x02C6
0xF7 0x02DC
0xF8 0x00AF
0xF9 0x02D8
0xFA 0x02D9
0xFB 0x02DA
0xFC 0x00B8
0xFD 0x02DD
0xFE 0x02DB
0xFF 0x02C7
+AGAAYABg
which implies that code +AGA-0xF6+AGA is +AGA-String.fromCharCode(0x02C6)+AGA and vice versa.
+ACMAIw Windows-dependent build step
To build the sources on windows, consult +AGA-dotnet/MakeEncoding.cs+AGA.
After saving standard output to +AGA-out+AGA, the +AGA-dotnet.sh+AGA script processes results.
+ACM Building the script
+AGA-make.njs+AGA takes a codepage argument, reads the corresponding table file and
generates JS code for encoding and decoding:
+ACMAIw Raw Codepages
The DBCS and SBCS code generation strategies are different. The maximum code is
used to distinguish (max +AGA-0xFF+AGA for SBCS).
The Unicode character +AGA-0xFFFD+AGA (REPLACEMENT CHARACTER) is used as a placeholder
for characters that are not specified in the map (for example, +AGA-0xF0+AGA is not in
code page 10000).
For SBCS, the idea is to embed a raw string with the contents of the 256 codes.
The +AGA-dec+AGA field is merely a split of the string, and +AGA-enc+AGA is an eversion:
DBCS is similar, except that the space is sliced in chunks of 256 bytes (strings
are only generated for those high-bytes represented in the codepage).
The strategy is to construct an array-of-arrays so that +AGA-dd+AFs-high+AF0AWw-low+AF0AYA is the
character associated with the code. This array is combined at runtime to yield
the complete decoding object (and the encoding object is an eversion):
+AGA-make.sh+AGA generates the tables used by +AGA-make.njs+AGA. The raw Unicode TXT files
are columnar: +AGA-code unicode +ACM-comments+AGA. For example, the last 10 lines of the
text file +AGA-ROMAN.TXT+AGA (for CP 10000) are:
+AGAAYABgAD4
0xF6 0x02C6 +ACM-MODIFIER LETTER CIRCUMFLEX ACCENT
0xF7 0x02DC +ACM-SMALL TILDE
0xF8 0x00AF +ACM-MACRON
0xF9 0x02D8 +ACM-BREVE
0xFA 0x02D9 +ACM-DOT ABOVE
0xFB 0x02DA +ACM-RING ABOVE
0xFC 0x00B8 +ACM-CEDILLA
0xFD 0x02DD +ACM-DOUBLE ACUTE ACCENT
0xFE 0x02DB +ACM-OGONEK
0xFF 0x02C7 +ACM-CARON
+AGAAYABg
In processing the data, the comments (after the +AGAAIwBg) are stripped and undefined
elements (like +AGA-0x7F+AGA for CP 10000) are removed.
+ACMAIw Utilities
The encode and decode functions are kept in a separate script (+AGA-cputils.js+AGA).
Both encode and decode deal with data represented as:
- String (encode expects JS string, decode interprets UCS2 chars as codes)
- Array (encode expects array of JS String characters, decode expects numbers)
- Buffer (encode expects UTF-8 string, decode expects codepoints/bytes).
The +AGA-ofmt+AGA variable controls +AGA-encode+AGA output (+AGA-str+AGA, +AGA-arr+AGA respectively)
while the input format is automatically determined.
+ACM Nitty Gritty
+AGAAYABgAD4.vocrc
+AHs +ACI-post+ACI: +ACI-make js+ACI +AH0
+AGAAYABg