262 lines
12 KiB
Plaintext
262 lines
12 KiB
Plaintext
|
# Codepages for JS
|
||
|
|
||
|
[Codepages](https://en.wikipedia.org/wiki/Codepage) are character encodings. In
|
||
|
many contexts, single-byte character sets are used in lieu of standard multibyte
|
||
|
Unicode encodings. They use 256 characters with a simple mapping.
|
||
|
|
||
|
[unicode.org](http://www.unicode.org/Public/MAPPINGS/) hosts lists of mappings.
|
||
|
The build script automatically downloads and parses the mappings in order to
|
||
|
generate the full script. The `pages.csv` description in `codepage.md` controls
|
||
|
which codepages are used.
|
||
|
|
||
|
## Setup
|
||
|
|
||
|
In the browser:
|
||
|
|
||
|
<script src="cptable.js"></script>
|
||
|
<script src="cputils.js"></script>
|
||
|
|
||
|
The complete set of codepages is large due to some Double Byte Character Set
|
||
|
encodings. A much smaller file that just includes SBCS codepages is provided in
|
||
|
this repo (`sbcs.js`), as well as a file for other projects (`cpexcel.js`)
|
||
|
|
||
|
If you know which codepages you need, you can include individual scripts for
|
||
|
each codepage. The individual files are provided in the `bits/` directory.
|
||
|
For example, to include only the Mac codepages:
|
||
|
|
||
|
<script src="bits/10000.js"></script>
|
||
|
<script src="bits/10006.js"></script>
|
||
|
<script src="bits/10007.js"></script>
|
||
|
<script src="bits/10029.js"></script>
|
||
|
<script src="bits/10079.js"></script>
|
||
|
<script src="bits/10081.js"></script>
|
||
|
|
||
|
All of the browser scripts define and append to the `cptable` object. To rename
|
||
|
the object, edit the `JSVAR` shell variable in `make.sh` and run the script.
|
||
|
|
||
|
The utilities functions are contained in `cputils.js`, which assumes that the
|
||
|
appropriate codepage scripts were loaded.
|
||
|
|
||
|
In node:
|
||
|
|
||
|
var cptable = require('codepage');
|
||
|
|
||
|
## Usage
|
||
|
|
||
|
The codepages are indexed by number. To get the unicode character for a given
|
||
|
codepoint, use the `dec` property:
|
||
|
|
||
|
var unicode_cp10000_255 = cptable[10000].dec[255]; // ˇ
|
||
|
|
||
|
To get the codepoint for a given character, use the `enc` property:
|
||
|
|
||
|
var cp10000_711 = cptable[10000].enc[String.fromCharCode(711)]; // 255
|
||
|
|
||
|
There are a few utilities that deal with strings and buffers:
|
||
|
|
||
|
var 汇总 = cptable.utils.decode(936, [0xbb,0xe3,0xd7,0xdc]);
|
||
|
var buf = cptable.utils.encode(936, 汇总);
|
||
|
|
||
|
`cptable.utils.encode(CP, data, ofmt)` accepts a String or Array of characters
|
||
|
and returns a representation controlled by `ofmt`:
|
||
|
|
||
|
- Default output is a Buffer (or Array) of bytes (integers between 0 and 255).
|
||
|
- If `ofmt == 'str'`, return a String where `o.charCodeAt(i)` is the ith byte
|
||
|
- If `ofmt == 'arr'`, return an Array of bytes
|
||
|
|
||
|
## Building the script
|
||
|
|
||
|
This script uses [voc](npm.im/voc). The script to build the codepage tables and
|
||
|
the JS source is `codepage.md`, so building is as simple as `voc codepage.md`.
|
||
|
|
||
|
## Generated Codepages
|
||
|
|
||
|
The complete list of hardcoded codepages can be found in the file `pages.csv`.
|
||
|
|
||
|
Some codepages are easier to implement algorithmically. Since these are
|
||
|
hardcoded in utils, there is no corresponding entry (they are "magic")
|
||
|
|
||
|
| CP# | Information | Description |
|
||
|
| --: | ----------- | ----------- |
|
||
|
| 37| unicode.org |IBM EBCDIC US-Canada
|
||
|
| 437| unicode.org |OEM United States
|
||
|
| 500| unicode.org |IBM EBCDIC International
|
||
|
| 708|MakeEncoding.cs|Arabic (ASMO 708)
|
||
|
| 720|MakeEncoding.cs|Arabic (Transparent ASMO); Arabic (DOS)
|
||
|
| 737| unicode.org |OEM Greek (formerly 437G); Greek (DOS)
|
||
|
| 775| unicode.org |OEM Baltic; Baltic (DOS)
|
||
|
| 850| unicode.org |OEM Multilingual Latin 1; Western European (DOS)
|
||
|
| 852| unicode.org |OEM Latin 2; Central European (DOS)
|
||
|
| 855| unicode.org |OEM Cyrillic (primarily Russian)
|
||
|
| 857| unicode.org |OEM Turkish; Turkish (DOS)
|
||
|
| 858|MakeEncoding.cs|OEM Multilingual Latin 1 + Euro symbol
|
||
|
| 860| unicode.org |OEM Portuguese; Portuguese (DOS)
|
||
|
| 861| unicode.org |OEM Icelandic; Icelandic (DOS)
|
||
|
| 862| unicode.org |OEM Hebrew; Hebrew (DOS)
|
||
|
| 863| unicode.org |OEM French Canadian; French Canadian (DOS)
|
||
|
| 864| unicode.org |OEM Arabic; Arabic (864)
|
||
|
| 865| unicode.org |OEM Nordic; Nordic (DOS)
|
||
|
| 866| unicode.org |OEM Russian; Cyrillic (DOS)
|
||
|
| 869| unicode.org |OEM Modern Greek; Greek, Modern (DOS)
|
||
|
| 870|MakeEncoding.cs|IBM EBCDIC Multilingual/ROECE (Latin 2)
|
||
|
| 874| unicode.org |Windows Thai
|
||
|
| 875| unicode.org |IBM EBCDIC Greek Modern
|
||
|
| 932| unicode.org |Japanese Shift-JIS
|
||
|
| 936| unicode.org |Simplified Chinese GBK
|
||
|
| 949| unicode.org |Korean
|
||
|
| 950| unicode.org |Traditional Chinese Big5
|
||
|
| 1026| unicode.org |IBM EBCDIC Turkish (Latin 5)
|
||
|
| 1047|MakeEncoding.cs|IBM EBCDIC Latin 1/Open System
|
||
|
| 1140|MakeEncoding.cs|IBM EBCDIC US-Canada (037 + Euro symbol)
|
||
|
| 1141|MakeEncoding.cs|IBM EBCDIC Germany (20273 + Euro symbol)
|
||
|
| 1142|MakeEncoding.cs|IBM EBCDIC Denmark-Norway (20277 + Euro symbol)
|
||
|
| 1143|MakeEncoding.cs|IBM EBCDIC Finland-Sweden (20278 + Euro symbol)
|
||
|
| 1144|MakeEncoding.cs|IBM EBCDIC Italy (20280 + Euro symbol)
|
||
|
| 1145|MakeEncoding.cs|IBM EBCDIC Latin America-Spain (20284 + Euro symbol)
|
||
|
| 1146|MakeEncoding.cs|IBM EBCDIC United Kingdom (20285 + Euro symbol)
|
||
|
| 1147|MakeEncoding.cs|IBM EBCDIC France (20297 + Euro symbol)
|
||
|
| 1148|MakeEncoding.cs|IBM EBCDIC International (500 + Euro symbol)
|
||
|
| 1149|MakeEncoding.cs|IBM EBCDIC Icelandic (20871 + Euro symbol)
|
||
|
| 1200| magic |Unicode UTF-16, little endian (BMP of ISO 10646)
|
||
|
| 1201| magic |Unicode UTF-16, big endian
|
||
|
| 1250| unicode.org |Windows Central Europe
|
||
|
| 1251| unicode.org |Windows Cyrillic
|
||
|
| 1252| unicode.org |Windows Latin I
|
||
|
| 1253| unicode.org |Windows Green
|
||
|
| 1254| unicode.org |Windows Turkish
|
||
|
| 1255| unicode.org |Windows Hebrew
|
||
|
| 1256| unicode.org |Windows Arabic
|
||
|
| 1257| unicode.org |Windows Baltic
|
||
|
| 1258| unicode.org |Windows Vietnam
|
||
|
| 1361|MakeEncoding.cs|Korean (Johab)
|
||
|
|10000| unicode.org |MAC Roman
|
||
|
|10001|MakeEncoding.cs|Japanese (Mac)
|
||
|
|10002|MakeEncoding.cs|MAC Traditional Chinese (Big5)
|
||
|
|10003|MakeEncoding.cs|Korean (Mac)
|
||
|
|10004|MakeEncoding.cs|Arabic (Mac)
|
||
|
|10005|MakeEncoding.cs|Hebrew (Mac)
|
||
|
|10006| unicode.org |Greek (Mac)
|
||
|
|10007| unicode.org |Cyrillic (Mac)
|
||
|
|10008|MakeEncoding.cs|MAC Simplified Chinese (GB 2312)
|
||
|
|10010|MakeEncoding.cs|Romanian (Mac)
|
||
|
|10017|MakeEncoding.cs|Ukrainian (Mac)
|
||
|
|10021|MakeEncoding.cs|Thai (Mac)
|
||
|
|10029| unicode.org |MAC Latin 2 (Central European)
|
||
|
|10079| unicode.org |Icelandic (Mac)
|
||
|
|10081| unicode.org |Turkish (Mac)
|
||
|
|10082|MakeEncoding.cs|Croatian (Mac)
|
||
|
|12000| magic |Unicode UTF-32, little endian byte order
|
||
|
|12001| magic |Unicode UTF-32, big endian byte order
|
||
|
|20000|MakeEncoding.cs|CNS Taiwan (Chinese Traditional)
|
||
|
|20001|MakeEncoding.cs|TCA Taiwan
|
||
|
|20002|MakeEncoding.cs|Eten Taiwan (Chinese Traditional)
|
||
|
|20003|MakeEncoding.cs|IBM5550 Taiwan
|
||
|
|20004|MakeEncoding.cs|TeleText Taiwan
|
||
|
|20005|MakeEncoding.cs|Wang Taiwan
|
||
|
|20105|MakeEncoding.cs|Western European IA5 (IRV International Alphabet 5) 7-bit
|
||
|
|20106|MakeEncoding.cs|IA5 German (7-bit)
|
||
|
|20107|MakeEncoding.cs|IA5 Swedish (7-bit)
|
||
|
|20108|MakeEncoding.cs|IA5 Norwegian (7-bit)
|
||
|
|20127| magic |US-ASCII (7-bit)
|
||
|
|20261|MakeEncoding.cs|T.61
|
||
|
|20269|MakeEncoding.cs|ISO 6937 Non-Spacing Accent
|
||
|
|20273|MakeEncoding.cs|IBM EBCDIC Germany
|
||
|
|20277|MakeEncoding.cs|IBM EBCDIC Denmark-Norway
|
||
|
|20278|MakeEncoding.cs|IBM EBCDIC Finland-Sweden
|
||
|
|20280|MakeEncoding.cs|IBM EBCDIC Italy
|
||
|
|20284|MakeEncoding.cs|IBM EBCDIC Latin America-Spain
|
||
|
|20285|MakeEncoding.cs|IBM EBCDIC United Kingdom
|
||
|
|20290|MakeEncoding.cs|IBM EBCDIC Japanese Katakana Extended
|
||
|
|20297|MakeEncoding.cs|IBM EBCDIC France
|
||
|
|20420|MakeEncoding.cs|IBM EBCDIC Arabic
|
||
|
|20423|MakeEncoding.cs|IBM EBCDIC Greek
|
||
|
|20424|MakeEncoding.cs|IBM EBCDIC Hebrew
|
||
|
|20833|MakeEncoding.cs|IBM EBCDIC Korean Extended
|
||
|
|20838|MakeEncoding.cs|IBM EBCDIC Thai
|
||
|
|20866|MakeEncoding.cs|Russian Cyrillic (KOI8-R)
|
||
|
|20871|MakeEncoding.cs|IBM EBCDIC Icelandic
|
||
|
|20880|MakeEncoding.cs|IBM EBCDIC Cyrillic Russian
|
||
|
|20905|MakeEncoding.cs|IBM EBCDIC Turkish
|
||
|
|20924|MakeEncoding.cs|IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
|
||
|
|20932|MakeEncoding.cs|Japanese (JIS 0208-1990 and 0212-1990)
|
||
|
|20936|MakeEncoding.cs|Simplified Chinese (GB2312-80)
|
||
|
|20949|MakeEncoding.cs|Korean Wansung
|
||
|
|21025|MakeEncoding.cs|IBM EBCDIC Cyrillic Serbian-Bulgarian
|
||
|
|21866|MakeEncoding.cs|Ukrainian Cyrillic (KOI8-U)
|
||
|
|28591| unicode.org |ISO 8859-1 Latin 1 (Western European)
|
||
|
|28592| unicode.org |ISO 8859-2 Latin 2 (Central European)
|
||
|
|28593| unicode.org |ISO 8859-3 Latin 3
|
||
|
|28594| unicode.org |ISO 8859-4 Baltic
|
||
|
|28595| unicode.org |ISO 8859-5 Cyrillic
|
||
|
|28596| unicode.org |ISO 8859-6 Arabic
|
||
|
|28597| unicode.org |ISO 8859-7 Greek
|
||
|
|28598| unicode.org |ISO 8859-8 Hebrew (ISO-Visual)
|
||
|
|28599| unicode.org |ISO 8859-9 Turkish
|
||
|
|28600| unicode.org |ISO 8859-10 Latin 6
|
||
|
|28601| unicode.org |ISO 8859-11 Latin (Thai)
|
||
|
|28603| unicode.org |ISO 8859-13 Latin 7 (Estonian)
|
||
|
|28604| unicode.org |ISO 8859-14 Latin 8 (Celtic)
|
||
|
|28605| unicode.org |ISO 8859-15 Latin 9
|
||
|
|28606| unicode.org |ISO 8859-15 Latin 10
|
||
|
|29001|MakeEncoding.cs|Europa 3
|
||
|
|38598|MakeEncoding.cs|ISO 8859-8 Hebrew (ISO-Logical)
|
||
|
|50220|MakeEncoding.cs|ISO 2022 JIS Japanese with no halfwidth Katakana
|
||
|
|50221|MakeEncoding.cs|ISO 2022 JIS Japanese with halfwidth Katakana
|
||
|
|50222|MakeEncoding.cs|ISO 2022 Japanese JIS X 0201-1989 (1 byte Kana-SO/SI)
|
||
|
|50225|MakeEncoding.cs|ISO 2022 Korean
|
||
|
|50227|MakeEncoding.cs|ISO 2022 Simplified Chinese
|
||
|
|51932|MakeEncoding.cs|EUC Japanese
|
||
|
|51936|MakeEncoding.cs|EUC Simplified Chinese
|
||
|
|51949|MakeEncoding.cs|EUC Korean
|
||
|
|52936|MakeEncoding.cs|HZ-GB2312 Simplified Chinese
|
||
|
|54936|MakeEncoding.cs|GB18030 Simplified Chinese (4 byte)
|
||
|
|57002|MakeEncoding.cs|ISCII Devanagari
|
||
|
|57003|MakeEncoding.cs|ISCII Bengali
|
||
|
|57004|MakeEncoding.cs|ISCII Tamil
|
||
|
|57005|MakeEncoding.cs|ISCII Telugu
|
||
|
|57006|MakeEncoding.cs|ISCII Assamese
|
||
|
|57007|MakeEncoding.cs|ISCII Oriya
|
||
|
|57008|MakeEncoding.cs|ISCII Kannada
|
||
|
|57009|MakeEncoding.cs|ISCII Malayalam
|
||
|
|57010|MakeEncoding.cs|ISCII Gujarati
|
||
|
|57011|MakeEncoding.cs|ISCII Punjabi
|
||
|
|65000| magic |Unicode (UTF-7)
|
||
|
|65001| magic |Unicode (UTF-8)
|
||
|
|
||
|
Note that MakeEncoding.cs deviates from unicode.org for some codepages. In the
|
||
|
case of direct conflicts, unicode.org takes precedence. In cases where the
|
||
|
unicode.org listing does not prescribe a value, MakeEncoding.cs value is used.
|
||
|
|
||
|
## Missing Codepages
|
||
|
|
||
|
The following codepages are not implemented. Normative references may not be
|
||
|
available in all cases. Furthermore, other software packages are known to hack
|
||
|
certain codepages (for example, Mozilla treats ASMO-708 as an alias of Arabic
|
||
|
ISO-8869-6 when in fact there are many differences), so all implementations
|
||
|
*should* be cleanroom when possible.
|
||
|
|
||
|
- 709 Arabic (ASMO-449+, BCON V4)
|
||
|
- 710 Arabic - Transparent Arabic
|
||
|
- 21027 (deprecated) <-- is this necessary?
|
||
|
- 50229 ISO 2022 Traditional Chinese
|
||
|
- 50930 EBCDIC Japanese (Katakana) Extended
|
||
|
- 50931 EBCDIC US-Canada and Japanese
|
||
|
- 50933 EBCDIC Korean Extended and Korean
|
||
|
- 50935 EBCDIC Simplified Chinese Extended and Simplified Chinese
|
||
|
- 50936 EBCDIC Simplified Chinese
|
||
|
- 50937 EBCDIC US-Canada and Traditional Chinese
|
||
|
- 50939 EBCDIC Japanese (Latin) Extended and Japanese
|
||
|
- 51950 EUC Traditional Chinese
|
||
|
|
||
|
## Sources
|
||
|
|
||
|
- [Unicode Consortium Public Mappings](http://www.unicode.org/Public/MAPPINGS/)
|
||
|
- [Code Page Enumeration](http://msdn.microsoft.com/en-us/library/cc195051.aspx)
|
||
|
- [Code Page Identifiers](http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756.aspx)
|
||
|
|
||
|
## Badges
|
||
|
|
||
|
[![githalytics.com alpha](https://cruel-carlota.pagodabox.com/afa29a5e8495a01059ee5b353f9042cb "githalytics.com")](http://githalytics.com/SheetJS/js-codepage)
|
||
|
[![Build Status](https://travis-ci.org/SheetJS/js-codepage.svg?branch=master)](https://travis-ci.org/SheetJS/js-codepage)
|
||
|
[![Coverage Status](https://coveralls.io/repos/SheetJS/js-codepage/badge.png)](https://coveralls.io/r/SheetJS/js-codepage)
|