2014-04-14 20:45:36 +00:00
|
|
|
# Getting Codepages
|
|
|
|
|
2018-01-18 22:47:47 +00:00
|
|
|
The fields of the `pages.csv` manifest are `codepage,url,bytes` (SBCS=1, DBCS=2)
|
2014-04-14 20:45:36 +00:00
|
|
|
|
|
|
|
Note that the Windows rendering is used for the Mac code pages. The primary
|
|
|
|
difference is the use of the private `0xF8FF` code (which renders as an Apple
|
|
|
|
logo on macs but as garbage on other operating systems). It may be desirable
|
|
|
|
to fall back to the behavior, in which case the files are under APPLE and not
|
2021-07-29 21:10:04 +00:00
|
|
|
MICSFT. This affects codepages 10000, 10006, 10007, 10029, 10079, 10081
|
2014-04-14 20:45:36 +00:00
|
|
|
|
|
|
|
The numbering scheme for the `ISO-8859-X` series is `28590 + X`:
|
|
|
|
|
|
|
|
## Generated Codepages
|
|
|
|
|
|
|
|
The following codepages are available in .NET on Windows:
|
|
|
|
|
|
|
|
- 708 Arabic (ASMO 708)
|
|
|
|
- 720 Arabic (Transparent ASMO); Arabic (DOS)
|
|
|
|
- 858 OEM Multilingual Latin 1 + Euro symbol
|
|
|
|
- 870 IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2
|
|
|
|
- 1047 IBM EBCDIC Latin 1/Open System
|
|
|
|
- 1140 IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)
|
|
|
|
- 1141 IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)
|
|
|
|
- 1142 IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)
|
|
|
|
- 1143 IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)
|
|
|
|
- 1144 IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)
|
|
|
|
- 1145 IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)
|
|
|
|
- 1146 IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)
|
|
|
|
- 1147 IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)
|
|
|
|
- 1148 IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)
|
|
|
|
- 1149 IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)
|
|
|
|
- 1361 Korean (Johab)
|
|
|
|
- 10001 Japanese (Mac)
|
|
|
|
- 10002 MAC Traditional Chinese (Big5); Chinese Traditional (Mac)
|
|
|
|
- 10003 Korean (Mac)
|
|
|
|
- 10004 Arabic (Mac)
|
|
|
|
- 10005 Hebrew (Mac)
|
|
|
|
- 10008 MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)
|
|
|
|
- 10010 Romanian (Mac)
|
|
|
|
- 10017 Ukrainian (Mac)
|
|
|
|
- 10021 Thai (Mac)
|
|
|
|
- 10082 Croatian (Mac)
|
|
|
|
- 20000 CNS Taiwan; Chinese Traditional (CNS)
|
|
|
|
- 20001 TCA Taiwan
|
2018-01-18 22:47:47 +00:00
|
|
|
- 20002 ETEN Taiwan; Chinese Traditional (ETEN)
|
2014-04-14 20:45:36 +00:00
|
|
|
- 20003 IBM5550 Taiwan
|
|
|
|
- 20004 TeleText Taiwan
|
|
|
|
- 20005 Wang Taiwan
|
|
|
|
- 20105 IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)
|
|
|
|
- 20106 IA5 German (7-bit)
|
|
|
|
- 20107 IA5 Swedish (7-bit)
|
|
|
|
- 20108 IA5 Norwegian (7-bit)
|
|
|
|
- 20261 T.61
|
|
|
|
- 20269 ISO 6937 Non-Spacing Accent
|
|
|
|
- 20273 IBM EBCDIC Germany
|
|
|
|
- 20277 IBM EBCDIC Denmark-Norway
|
|
|
|
- 20278 IBM EBCDIC Finland-Sweden
|
|
|
|
- 20280 IBM EBCDIC Italy
|
|
|
|
- 20284 IBM EBCDIC Latin America-Spain
|
|
|
|
- 20285 IBM EBCDIC United Kingdom
|
|
|
|
- 20290 IBM EBCDIC Japanese Katakana Extended
|
|
|
|
- 20297 IBM EBCDIC France
|
|
|
|
- 20420 IBM EBCDIC Arabic
|
|
|
|
- 20423 IBM EBCDIC Greek
|
|
|
|
- 20424 IBM EBCDIC Hebrew
|
|
|
|
- 20833 IBM EBCDIC Korean Extended
|
|
|
|
- 20838 IBM EBCDIC Thai
|
|
|
|
- 20866 Russian (KOI8-R); Cyrillic (KOI8-R)
|
|
|
|
- 20871 IBM EBCDIC Icelandic
|
|
|
|
- 20880 IBM EBCDIC Cyrillic Russian
|
|
|
|
- 20905 IBM EBCDIC Turkish
|
|
|
|
- 20924 IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
|
|
|
|
- 20932 Japanese (JIS 0208-1990 and 0212-1990)
|
|
|
|
- 20936 Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)
|
|
|
|
- 20949 Korean Wansung
|
|
|
|
- 21025 IBM EBCDIC Cyrillic Serbian-Bulgarian
|
2014-12-25 20:50:25 +00:00
|
|
|
- 21027 Extended/Ext Alpha Lowercase
|
2014-04-14 20:45:36 +00:00
|
|
|
- 21866 Ukrainian (KOI8-U); Cyrillic (KOI8-U)
|
|
|
|
- 29001 Europa 3
|
|
|
|
- 38598 ISO 8859-8 Hebrew; Hebrew (ISO-Logical)
|
|
|
|
- 51932 EUC Japanese
|
|
|
|
- 51936 EUC Simplified Chinese; Chinese Simplified (EUC)
|
|
|
|
- 51949 EUC Korean
|
|
|
|
- 52936 HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)
|
|
|
|
- 54936 Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)
|
|
|
|
- 57002 ISCII Devanagari
|
|
|
|
- 57003 ISCII Bengali
|
|
|
|
- 57004 ISCII Tamil
|
|
|
|
- 57005 ISCII Telugu
|
|
|
|
- 57006 ISCII Assamese
|
|
|
|
- 57007 ISCII Oriya
|
|
|
|
- 57008 ISCII Kannada
|
|
|
|
- 57009 ISCII Malayalam
|
|
|
|
- 57010 ISCII Gujarati
|
|
|
|
- 57011 ISCII Punjabi
|
|
|
|
|
2014-12-25 20:50:25 +00:00
|
|
|
The following codepages are dependencies for Visual FoxPro:
|
|
|
|
|
|
|
|
- 620 Mazovia (Polish) MS-DOS
|
|
|
|
- 895 Kamenick (Czech) MS-DOS
|
|
|
|
|
2014-04-14 20:45:36 +00:00
|
|
|
## Building Notes
|
|
|
|
|
|
|
|
The script `make.sh` (described later) will get these files and massage the data
|
2018-01-18 22:47:47 +00:00
|
|
|
(printing code-Unicode pairs). The eventual tables are dropped in the paths
|
2014-04-14 20:45:36 +00:00
|
|
|
`./codepages/<CODEPAGE>.TBL`. For example, the last 10 lines of `10000.TBL` are
|
|
|
|
|
|
|
|
```>
|
|
|
|
0xF6 0x02C6
|
|
|
|
0xF7 0x02DC
|
|
|
|
0xF8 0x00AF
|
|
|
|
0xF9 0x02D8
|
|
|
|
0xFA 0x02D9
|
|
|
|
0xFB 0x02DA
|
|
|
|
0xFC 0x00B8
|
|
|
|
0xFD 0x02DD
|
|
|
|
0xFE 0x02DB
|
|
|
|
0xFF 0x02C7
|
|
|
|
```
|
|
|
|
|
2018-01-18 22:47:47 +00:00
|
|
|
which implies that code `0xF6` is `String.fromCharCode(0x02C6)` and vice versa.
|
2014-04-14 20:45:36 +00:00
|
|
|
|
|
|
|
## Windows-dependent build step
|
|
|
|
|
|
|
|
To build the sources on windows, consult `dotnet/MakeEncoding.cs`.
|
|
|
|
|
2021-07-29 21:10:04 +00:00
|
|
|
After saving standard output to `out`, the `dotnet.sh` script processes results.
|
2014-04-14 20:45:36 +00:00
|
|
|
|
|
|
|
# Building the script
|
|
|
|
|
|
|
|
`make.njs` takes a codepage argument, reads the corresponding table file and
|
|
|
|
generates JS code for encoding and decoding:
|
|
|
|
|
|
|
|
## Raw Codepages
|
|
|
|
|
|
|
|
The DBCS and SBCS code generation strategies are different. The maximum code is
|
2018-01-18 22:47:47 +00:00
|
|
|
used to distinguish (max `0xFF` for SBCS).
|
2014-04-14 20:45:36 +00:00
|
|
|
|
2018-01-18 22:47:47 +00:00
|
|
|
The Unicode character `0xFFFD` (REPLACEMENT CHARACTER) is used as a placeholder
|
2014-04-14 20:45:36 +00:00
|
|
|
for characters that are not specified in the map (for example, `0xF0` is not in
|
|
|
|
code page 10000).
|
|
|
|
|
|
|
|
For SBCS, the idea is to embed a raw string with the contents of the 256 codes.
|
|
|
|
The `dec` field is merely a split of the string, and `enc` is an eversion:
|
|
|
|
|
2018-01-18 22:47:47 +00:00
|
|
|
DBCS is similar, except that the space is sliced in chunks of 256 bytes (strings
|
2014-04-14 20:45:36 +00:00
|
|
|
are only generated for those high-bytes represented in the codepage).
|
|
|
|
|
|
|
|
The strategy is to construct an array-of-arrays so that `dd[high][low]` is the
|
|
|
|
character associated with the code. This array is combined at runtime to yield
|
|
|
|
the complete decoding object (and the encoding object is an eversion):
|
|
|
|
|
2018-01-18 22:47:47 +00:00
|
|
|
`make.sh` generates the tables used by `make.njs`. The raw Unicode TXT files
|
2014-04-14 20:45:36 +00:00
|
|
|
are columnar: `code unicode #comments`. For example, the last 10 lines of the
|
2018-01-18 22:47:47 +00:00
|
|
|
text file `ROMAN.TXT` (for CP 10000) are:
|
2014-04-14 20:45:36 +00:00
|
|
|
|
|
|
|
```>
|
|
|
|
0xF6 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT
|
|
|
|
0xF7 0x02DC #SMALL TILDE
|
|
|
|
0xF8 0x00AF #MACRON
|
|
|
|
0xF9 0x02D8 #BREVE
|
|
|
|
0xFA 0x02D9 #DOT ABOVE
|
|
|
|
0xFB 0x02DA #RING ABOVE
|
|
|
|
0xFC 0x00B8 #CEDILLA
|
|
|
|
0xFD 0x02DD #DOUBLE ACUTE ACCENT
|
|
|
|
0xFE 0x02DB #OGONEK
|
|
|
|
0xFF 0x02C7 #CARON
|
|
|
|
```
|
|
|
|
|
|
|
|
In processing the data, the comments (after the `#`) are stripped and undefined
|
|
|
|
elements (like `0x7F` for CP 10000) are removed.
|
|
|
|
|
|
|
|
## Utilities
|
|
|
|
|
2018-01-18 22:47:47 +00:00
|
|
|
The encode and decode functions are kept in a separate script (`cputils.js`).
|
2014-04-14 20:45:36 +00:00
|
|
|
|
|
|
|
Both encode and decode deal with data represented as:
|
|
|
|
|
|
|
|
- String (encode expects JS string, decode interprets UCS2 chars as codes)
|
|
|
|
- Array (encode expects array of JS String characters, decode expects numbers)
|
|
|
|
- Buffer (encode expects UTF-8 string, decode expects codepoints/bytes).
|
|
|
|
|
|
|
|
The `ofmt` variable controls `encode` output (`str`, `arr` respectively)
|
|
|
|
while the input format is automatically determined.
|
|
|
|
|
|
|
|
# Nitty Gritty
|
|
|
|
|
|
|
|
```>.vocrc
|
|
|
|
{ "post": "make js" }
|
|
|
|
```
|