191 lines
7.0 KiB
Plaintext
191 lines
7.0 KiB
Plaintext
+ACM Getting Codepages
|
|
|
|
The fields of the +AGA-pages.csv+AGA manifest are +AGA-codepage,url,bytes+AGA (SBCS+AD0-1, DBCS+AD0-2)
|
|
|
|
Note that the Windows rendering is used for the Mac code pages. The primary
|
|
difference is the use of the private +AGA-0xF8FF+AGA code (which renders as an Apple
|
|
logo on macs but as garbage on other operating systems). It may be desirable
|
|
to fall back to the behavior, in which case the files are under APPLE and not
|
|
MICSFT. This affects codepages 10000, 10006, 10007, 10029, 10079, 10081
|
|
|
|
The numbering scheme for the +AGA-ISO-8859-X+AGA series is +AGA-28590 +- X+AGA:
|
|
|
|
+ACMAIw Generated Codepages
|
|
|
|
The following codepages are available in .NET on Windows:
|
|
|
|
- 708 Arabic (ASMO 708)
|
|
- 720 Arabic (Transparent ASMO)+ADs Arabic (DOS)
|
|
- 858 OEM Multilingual Latin 1 +- Euro symbol
|
|
- 870 IBM EBCDIC Multilingual/ROECE (Latin 2)+ADs IBM EBCDIC Multilingual Latin 2
|
|
- 1047 IBM EBCDIC Latin 1/Open System
|
|
- 1140 IBM EBCDIC US-Canada (037 +- Euro symbol)+ADs IBM EBCDIC (US-Canada-Euro)
|
|
- 1141 IBM EBCDIC Germany (20273 +- Euro symbol)+ADs IBM EBCDIC (Germany-Euro)
|
|
- 1142 IBM EBCDIC Denmark-Norway (20277 +- Euro symbol)+ADs IBM EBCDIC (Denmark-Norway-Euro)
|
|
- 1143 IBM EBCDIC Finland-Sweden (20278 +- Euro symbol)+ADs IBM EBCDIC (Finland-Sweden-Euro)
|
|
- 1144 IBM EBCDIC Italy (20280 +- Euro symbol)+ADs IBM EBCDIC (Italy-Euro)
|
|
- 1145 IBM EBCDIC Latin America-Spain (20284 +- Euro symbol)+ADs IBM EBCDIC (Spain-Euro)
|
|
- 1146 IBM EBCDIC United Kingdom (20285 +- Euro symbol)+ADs IBM EBCDIC (UK-Euro)
|
|
- 1147 IBM EBCDIC France (20297 +- Euro symbol)+ADs IBM EBCDIC (France-Euro)
|
|
- 1148 IBM EBCDIC International (500 +- Euro symbol)+ADs IBM EBCDIC (International-Euro)
|
|
- 1149 IBM EBCDIC Icelandic (20871 +- Euro symbol)+ADs IBM EBCDIC (Icelandic-Euro)
|
|
- 1361 Korean (Johab)
|
|
- 10001 Japanese (Mac)
|
|
- 10002 MAC Traditional Chinese (Big5)+ADs Chinese Traditional (Mac)
|
|
- 10003 Korean (Mac)
|
|
- 10004 Arabic (Mac)
|
|
- 10005 Hebrew (Mac)
|
|
- 10008 MAC Simplified Chinese (GB 2312)+ADs Chinese Simplified (Mac)
|
|
- 10010 Romanian (Mac)
|
|
- 10017 Ukrainian (Mac)
|
|
- 10021 Thai (Mac)
|
|
- 10082 Croatian (Mac)
|
|
- 20000 CNS Taiwan+ADs Chinese Traditional (CNS)
|
|
- 20001 TCA Taiwan
|
|
- 20002 ETEN Taiwan+ADs Chinese Traditional (ETEN)
|
|
- 20003 IBM5550 Taiwan
|
|
- 20004 TeleText Taiwan
|
|
- 20005 Wang Taiwan
|
|
- 20105 IA5 (IRV International Alphabet No. 5, 7-bit)+ADs Western European (IA5)
|
|
- 20106 IA5 German (7-bit)
|
|
- 20107 IA5 Swedish (7-bit)
|
|
- 20108 IA5 Norwegian (7-bit)
|
|
- 20261 T.61
|
|
- 20269 ISO 6937 Non-Spacing Accent
|
|
- 20273 IBM EBCDIC Germany
|
|
- 20277 IBM EBCDIC Denmark-Norway
|
|
- 20278 IBM EBCDIC Finland-Sweden
|
|
- 20280 IBM EBCDIC Italy
|
|
- 20284 IBM EBCDIC Latin America-Spain
|
|
- 20285 IBM EBCDIC United Kingdom
|
|
- 20290 IBM EBCDIC Japanese Katakana Extended
|
|
- 20297 IBM EBCDIC France
|
|
- 20420 IBM EBCDIC Arabic
|
|
- 20423 IBM EBCDIC Greek
|
|
- 20424 IBM EBCDIC Hebrew
|
|
- 20833 IBM EBCDIC Korean Extended
|
|
- 20838 IBM EBCDIC Thai
|
|
- 20866 Russian (KOI8-R)+ADs Cyrillic (KOI8-R)
|
|
- 20871 IBM EBCDIC Icelandic
|
|
- 20880 IBM EBCDIC Cyrillic Russian
|
|
- 20905 IBM EBCDIC Turkish
|
|
- 20924 IBM EBCDIC Latin 1/Open System (1047 +- Euro symbol)
|
|
- 20932 Japanese (JIS 0208-1990 and 0212-1990)
|
|
- 20936 Simplified Chinese (GB2312)+ADs Chinese Simplified (GB2312-80)
|
|
- 20949 Korean Wansung
|
|
- 21025 IBM EBCDIC Cyrillic Serbian-Bulgarian
|
|
- 21027 Extended/Ext Alpha Lowercase
|
|
- 21866 Ukrainian (KOI8-U)+ADs Cyrillic (KOI8-U)
|
|
- 29001 Europa 3
|
|
- 38598 ISO 8859-8 Hebrew+ADs Hebrew (ISO-Logical)
|
|
- 51932 EUC Japanese
|
|
- 51936 EUC Simplified Chinese+ADs Chinese Simplified (EUC)
|
|
- 51949 EUC Korean
|
|
- 52936 HZ-GB2312 Simplified Chinese+ADs Chinese Simplified (HZ)
|
|
- 54936 Windows XP and later: GB18030 Simplified Chinese (4 byte)+ADs Chinese Simplified (GB18030)
|
|
- 57002 ISCII Devanagari
|
|
- 57003 ISCII Bengali
|
|
- 57004 ISCII Tamil
|
|
- 57005 ISCII Telugu
|
|
- 57006 ISCII Assamese
|
|
- 57007 ISCII Oriya
|
|
- 57008 ISCII Kannada
|
|
- 57009 ISCII Malayalam
|
|
- 57010 ISCII Gujarati
|
|
- 57011 ISCII Punjabi
|
|
|
|
The following codepages are dependencies for Visual FoxPro:
|
|
|
|
- 620 Mazovia (Polish) MS-DOS
|
|
- 895 Kamenick+AP0 (Czech) MS-DOS
|
|
|
|
+ACMAIw Building Notes
|
|
|
|
The script +AGA-make.sh+AGA (described later) will get these files and massage the data
|
|
(printing code-Unicode pairs). The eventual tables are dropped in the paths
|
|
+AGA./codepages/+ADw-CODEPAGE+AD4.TBL+AGA. For example, the last 10 lines of +AGA-10000.TBL+AGA are
|
|
|
|
+AGAAYABgAD4
|
|
0xF6 0x02C6
|
|
0xF7 0x02DC
|
|
0xF8 0x00AF
|
|
0xF9 0x02D8
|
|
0xFA 0x02D9
|
|
0xFB 0x02DA
|
|
0xFC 0x00B8
|
|
0xFD 0x02DD
|
|
0xFE 0x02DB
|
|
0xFF 0x02C7
|
|
+AGAAYABg
|
|
|
|
which implies that code +AGA-0xF6+AGA is +AGA-String.fromCharCode(0x02C6)+AGA and vice versa.
|
|
|
|
+ACMAIw Windows-dependent build step
|
|
|
|
To build the sources on windows, consult +AGA-dotnet/MakeEncoding.cs+AGA.
|
|
|
|
After saving standard output to +AGA-out+AGA, the +AGA-dotnet.sh+AGA script processes results.
|
|
|
|
+ACM Building the script
|
|
|
|
+AGA-make.njs+AGA takes a codepage argument, reads the corresponding table file and
|
|
generates JS code for encoding and decoding:
|
|
|
|
+ACMAIw Raw Codepages
|
|
|
|
The DBCS and SBCS code generation strategies are different. The maximum code is
|
|
used to distinguish (max +AGA-0xFF+AGA for SBCS).
|
|
|
|
The Unicode character +AGA-0xFFFD+AGA (REPLACEMENT CHARACTER) is used as a placeholder
|
|
for characters that are not specified in the map (for example, +AGA-0xF0+AGA is not in
|
|
code page 10000).
|
|
|
|
For SBCS, the idea is to embed a raw string with the contents of the 256 codes.
|
|
The +AGA-dec+AGA field is merely a split of the string, and +AGA-enc+AGA is an eversion:
|
|
|
|
DBCS is similar, except that the space is sliced in chunks of 256 bytes (strings
|
|
are only generated for those high-bytes represented in the codepage).
|
|
|
|
The strategy is to construct an array-of-arrays so that +AGA-dd+AFs-high+AF0AWw-low+AF0AYA is the
|
|
character associated with the code. This array is combined at runtime to yield
|
|
the complete decoding object (and the encoding object is an eversion):
|
|
|
|
+AGA-make.sh+AGA generates the tables used by +AGA-make.njs+AGA. The raw Unicode TXT files
|
|
are columnar: +AGA-code unicode +ACM-comments+AGA. For example, the last 10 lines of the
|
|
text file +AGA-ROMAN.TXT+AGA (for CP 10000) are:
|
|
|
|
+AGAAYABgAD4
|
|
0xF6 0x02C6 +ACM-MODIFIER LETTER CIRCUMFLEX ACCENT
|
|
0xF7 0x02DC +ACM-SMALL TILDE
|
|
0xF8 0x00AF +ACM-MACRON
|
|
0xF9 0x02D8 +ACM-BREVE
|
|
0xFA 0x02D9 +ACM-DOT ABOVE
|
|
0xFB 0x02DA +ACM-RING ABOVE
|
|
0xFC 0x00B8 +ACM-CEDILLA
|
|
0xFD 0x02DD +ACM-DOUBLE ACUTE ACCENT
|
|
0xFE 0x02DB +ACM-OGONEK
|
|
0xFF 0x02C7 +ACM-CARON
|
|
+AGAAYABg
|
|
|
|
In processing the data, the comments (after the +AGAAIwBg) are stripped and undefined
|
|
elements (like +AGA-0x7F+AGA for CP 10000) are removed.
|
|
|
|
+ACMAIw Utilities
|
|
|
|
The encode and decode functions are kept in a separate script (+AGA-cputils.js+AGA).
|
|
|
|
Both encode and decode deal with data represented as:
|
|
|
|
- String (encode expects JS string, decode interprets UCS2 chars as codes)
|
|
- Array (encode expects array of JS String characters, decode expects numbers)
|
|
- Buffer (encode expects UTF-8 string, decode expects codepoints/bytes).
|
|
|
|
The +AGA-ofmt+AGA variable controls +AGA-encode+AGA output (+AGA-str+AGA, +AGA-arr+AGA respectively)
|
|
while the input format is automatically determined.
|
|
|
|
+ACM Nitty Gritty
|
|
|
|
+AGAAYABgAD4.vocrc
|
|
+AHs +ACI-post+ACI: +ACI-make js+ACI +AH0
|
|
+AGAAYABg
|