💱 Codepages for JS

character-encoding codepage codepage-tables codepoints encoding iconv string-conversion text unicode unicode-characters

Go to file

xu.chenhui f2ffe97f3c Fix the function of judging the commonJS environment		2014-04-05 04:57:55 -04:00
bits	Initial commit	2013-12-06 11:21:34 -05:00
codepages	cleaning up tables 708,720,858	2013-12-06 17:10:05 -05:00
.travis.yml	Initial commit	2013-12-06 11:21:34 -05:00
codepage.md	Initial commit	2013-12-06 11:21:34 -05:00
cptable.js	Fix the function of judging the commonJS environment	2014-04-05 04:57:55 -04:00
cputils.js	Fix the function of judging the commonJS environment	2014-04-05 04:57:55 -04:00
LICENSE	Initial commit	2013-12-06 11:21:34 -05:00
package.json	Initial commit	2013-12-06 11:21:34 -05:00
README.md	Initial commit	2013-12-06 11:21:34 -05:00
sbcs.js	Fix the function of judging the commonJS environment	2014-04-05 04:57:55 -04:00
test.js	Initial commit	2013-12-06 11:21:34 -05:00

README.md

Codepages for JS

Codepages are character encodings. In many contexts, single-byte character sets are used in lieu of standard multibyte Unicode encodings. They use 256 characters with a simple mapping.

unicode.org hosts lists of mappings. The build script automatically downloads and parses the mappings in order to generate the full script. The pages.csv description in codepage.md controls which codepages are used.

Setup

In the browser:

<script src="cptable.js"></script>
<script src="cputils.js"></script>

The complete set of codepages is large because of some Double Byte Character Set encodings. A much smaller file that just includes SBCS codepages is provided in this repo (sbcs.js).

If you know which codepages you need, you can include individual scripts for each codepage. The individual files are provided in the bits/ directory. For example, to include only the Mac codepages:

<script src="bits/10000.js"></script>
<script src="bits/10006.js"></script>
<script src="bits/10007.js"></script>
<script src="bits/10029.js"></script>
<script src="bits/10079.js"></script>
<script src="bits/10081.js"></script>

All of the browser scripts define and append to the cptable object. To rename the object, edit the JSVAR shell variable in make.sh and run the script.

The utilities functions are contained in cputils.js, which assumes that the appropriate codepage scripts were loaded.

In node:

var cptable = require('codepage');

Usage

The codepages are indexed by number. To get the unicode character for a given codepoint, use the dec property:

var unicode_cp10000_255 = cptable[10000].dec[255]; // ˇ

To get the codepoint for a given character, use the enc property:

var cp10000_711 = cptable[10000].enc[String.fromCharCode(711)]; // 255

There are a few utilities that deal with strings and buffers:

var 汇总 = cptable.utils.decode(936, [0xbb,0xe3,0xd7,0xdc]);
var buf =  cptable.utils.encode(936,  汇总);

Building the script

This script uses voc. The script to build the codepage tables and the JS source is codepage.md, so building is as simple as voc codepage.md.

Supported Codepages

The standard Windows codepages are supported:

1250 Windows Central Europe
1251 Windows Cyrillic
1252 Windows Latin I
1253 Windows Green
1254 Windows Turkish
1255 Windows Hebrew
1256 Windows Arabic
1257 Windows Baltic
1258 Windows Vietnam
874 Windows Thai

The full collection of ISO-8859 codepages are also supported. The East-Asian Double Byte Character Sets are also supported:

932 Japanese Shift-JIS
936 Simplified Chinese GBK
949 Korean
950 Traditional Chinese Big5

The complete list of supported codepages can be found in the file pages.csv.

Missing Codepages

The following codepages are not implemented. Normative references may not be available in all cases. Furthermore, other software packages are known to hack certain codepages (for example, Mozilla treats ASMO-708 as an alias of Arabic ISO-8869-6 when in fact there are many differences), so all implementations should be cleanroom when possible.

709 Arabic (ASMO-449+, BCON V4)
710 Arabic - Transparent Arabic
870 IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2
1047 IBM EBCDIC Latin 1/Open System
1140 IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)
1141 IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)
1142 IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)
1143 IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)
1144 IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)
1145 IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)
1146 IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)
1147 IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)
1148 IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)
1149 IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)
1200 Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications
1201 Unicode UTF-16, big endian byte order; available only to managed applications
1361 Korean (Johab)
10001 Japanese (Mac)
10002 MAC Traditional Chinese (Big5); Chinese Traditional (Mac)
10003 Korean (Mac)
10004 Arabic (Mac)
10005 Hebrew (Mac)
10008 MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)
10010 Romanian (Mac)
10017 Ukrainian (Mac)
10021 Thai (Mac)
10082 Croatian (Mac)
12000 Unicode UTF-32, little endian byte order; available only to managed applications
12001 Unicode UTF-32, big endian byte order; available only to managed applications
20000 CNS Taiwan; Chinese Traditional (CNS)
20001 TCA Taiwan
20002 Eten Taiwan; Chinese Traditional (Eten)
20003 IBM5550 Taiwan
20004 TeleText Taiwan
20005 Wang Taiwan
20105 IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)
20106 IA5 German (7-bit)
20107 IA5 Swedish (7-bit)
20108 IA5 Norwegian (7-bit)
20127 US-ASCII (7-bit)
20261 T.61
20269 ISO 6937 Non-Spacing Accent
20273 IBM EBCDIC Germany
20277 IBM EBCDIC Denmark-Norway
20278 IBM EBCDIC Finland-Sweden
20280 IBM EBCDIC Italy
20284 IBM EBCDIC Latin America-Spain
20285 IBM EBCDIC United Kingdom
20290 IBM EBCDIC Japanese Katakana Extended
20297 IBM EBCDIC France
20420 IBM EBCDIC Arabic
20423 IBM EBCDIC Greek
20424 IBM EBCDIC Hebrew
20833 IBM EBCDIC Korean Extended
20838 IBM EBCDIC Thai
20866 Russian (KOI8-R); Cyrillic (KOI8-R)
20871 IBM EBCDIC Icelandic
20880 IBM EBCDIC Cyrillic Russian
20905 IBM EBCDIC Turkish
20924 IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
20932 Japanese (JIS 0208-1990 and 0212-1990)
20936 Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)
20949 Korean Wansung
21025 IBM EBCDIC Cyrillic Serbian-Bulgarian
21027 (deprecated) <-- is this necessary?
21866 Ukrainian (KOI8-U); Cyrillic (KOI8-U)
29001 Europa 3
38598 ISO 8859-8 Hebrew; Hebrew (ISO-Logical)
50220 ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)
50221 ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana)
50222 ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI)
50225 ISO 2022 Korean
50227 ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022)
50229 ISO 2022 Traditional Chinese
50930 EBCDIC Japanese (Katakana) Extended
50931 EBCDIC US-Canada and Japanese
50933 EBCDIC Korean Extended and Korean
50935 EBCDIC Simplified Chinese Extended and Simplified Chinese
50936 EBCDIC Simplified Chinese
50937 EBCDIC US-Canada and Traditional Chinese
50939 EBCDIC Japanese (Latin) Extended and Japanese
51932 EUC Japanese
51936 EUC Simplified Chinese; Chinese Simplified (EUC)
51949 EUC Korean
51950 EUC Traditional Chinese
52936 HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)
54936 Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)
57002 ISCII Devanagari
57003 ISCII Bengali
57004 ISCII Tamil
57005 ISCII Telugu
57006 ISCII Assamese
57007 ISCII Oriya
57008 ISCII Kannada
57009 ISCII Malayalam
57010 ISCII Gujarati
57011 ISCII Punjabi
65000 Unicode (UTF-7)