bits | ||
codepages | ||
.travis.yml | ||
codepage.md | ||
cptable.js | ||
cputils.js | ||
LICENSE | ||
package.json | ||
README.md | ||
sbcs.js | ||
test.js |
Codepages for JS
Codepages are character encodings. In many contexts, single-byte character sets are used in lieu of standard multibyte Unicode encodings. They use 256 characters with a simple mapping.
unicode.org hosts lists of mappings.
The build script automatically downloads and parses the mappings in order to
generate the full script. The pages.csv
description in codepage.md
controls
which codepages are used.
Setup
In the browser:
<script src="cptable.js"></script>
<script src="cputils.js"></script>
The complete set of codepages is large because of some Double Byte Character Set
encodings. A much smaller file that just includes SBCS codepages is provided in
this repo (sbcs.js
).
If you know which codepages you need, you can include individual scripts for
each codepage. The individual files are provided in the bits/
directory.
For example, to include only the Mac codepages:
<script src="bits/10000.js"></script>
<script src="bits/10006.js"></script>
<script src="bits/10007.js"></script>
<script src="bits/10029.js"></script>
<script src="bits/10079.js"></script>
<script src="bits/10081.js"></script>
All of the browser scripts define and append to the cptable
object. To rename
the object, edit the JSVAR
shell variable in make.sh
and run the script.
The utilities functions are contained in cputils.js
, which assumes that the
appropriate codepage scripts were loaded.
In node:
var cptable = require('codepage');
Usage
The codepages are indexed by number. To get the unicode character for a given
codepoint, use the dec
property:
var unicode_cp10000_255 = cptable[10000].dec[255]; // ˇ
To get the codepoint for a given character, use the enc
property:
var cp10000_711 = cptable[10000].enc[String.fromCharCode(711)]; // 255
There are a few utilities that deal with strings and buffers:
var 汇总 = cptable.utils.decode(936, [0xbb,0xe3,0xd7,0xdc]);
var buf = cptable.utils.encode(936, 汇总);
Building the script
This script uses voc. The script to build the codepage tables and
the JS source is codepage.md
, so building is as simple as voc codepage.md
.
Supported Codepages
The standard Windows codepages are supported:
- 1250 Windows Central Europe
- 1251 Windows Cyrillic
- 1252 Windows Latin I
- 1253 Windows Green
- 1254 Windows Turkish
- 1255 Windows Hebrew
- 1256 Windows Arabic
- 1257 Windows Baltic
- 1258 Windows Vietnam
- 874 Windows Thai
The full collection of ISO-8859
codepages are also supported. The East-Asian
Double Byte Character Sets are also supported:
- 932 Japanese Shift-JIS
- 936 Simplified Chinese GBK
- 949 Korean
- 950 Traditional Chinese Big5
The complete list of supported codepages can be found in the file pages.csv
.
Missing Codepages
The following codepages are not implemented. Normative references may not be available in all cases. Furthermore, other software packages are known to hack certain codepages (for example, Mozilla treats ASMO-708 as an alias of Arabic ISO-8869-6 when in fact there are many differences), so all implementations should be cleanroom when possible.
- 709 Arabic (ASMO-449+, BCON V4)
- 710 Arabic - Transparent Arabic
- 870 IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2
- 1047 IBM EBCDIC Latin 1/Open System
- 1140 IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)
- 1141 IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)
- 1142 IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)
- 1143 IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)
- 1144 IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)
- 1145 IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)
- 1146 IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)
- 1147 IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)
- 1148 IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)
- 1149 IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)
- 1200 Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications
- 1201 Unicode UTF-16, big endian byte order; available only to managed applications
- 1361 Korean (Johab)
- 10001 Japanese (Mac)
- 10002 MAC Traditional Chinese (Big5); Chinese Traditional (Mac)
- 10003 Korean (Mac)
- 10004 Arabic (Mac)
- 10005 Hebrew (Mac)
- 10008 MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)
- 10010 Romanian (Mac)
- 10017 Ukrainian (Mac)
- 10021 Thai (Mac)
- 10082 Croatian (Mac)
- 12000 Unicode UTF-32, little endian byte order; available only to managed applications
- 12001 Unicode UTF-32, big endian byte order; available only to managed applications
- 20000 CNS Taiwan; Chinese Traditional (CNS)
- 20001 TCA Taiwan
- 20002 Eten Taiwan; Chinese Traditional (Eten)
- 20003 IBM5550 Taiwan
- 20004 TeleText Taiwan
- 20005 Wang Taiwan
- 20105 IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)
- 20106 IA5 German (7-bit)
- 20107 IA5 Swedish (7-bit)
- 20108 IA5 Norwegian (7-bit)
- 20127 US-ASCII (7-bit)
- 20261 T.61
- 20269 ISO 6937 Non-Spacing Accent
- 20273 IBM EBCDIC Germany
- 20277 IBM EBCDIC Denmark-Norway
- 20278 IBM EBCDIC Finland-Sweden
- 20280 IBM EBCDIC Italy
- 20284 IBM EBCDIC Latin America-Spain
- 20285 IBM EBCDIC United Kingdom
- 20290 IBM EBCDIC Japanese Katakana Extended
- 20297 IBM EBCDIC France
- 20420 IBM EBCDIC Arabic
- 20423 IBM EBCDIC Greek
- 20424 IBM EBCDIC Hebrew
- 20833 IBM EBCDIC Korean Extended
- 20838 IBM EBCDIC Thai
- 20866 Russian (KOI8-R); Cyrillic (KOI8-R)
- 20871 IBM EBCDIC Icelandic
- 20880 IBM EBCDIC Cyrillic Russian
- 20905 IBM EBCDIC Turkish
- 20924 IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
- 20932 Japanese (JIS 0208-1990 and 0212-1990)
- 20936 Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)
- 20949 Korean Wansung
- 21025 IBM EBCDIC Cyrillic Serbian-Bulgarian
- 21027 (deprecated) <-- is this necessary?
- 21866 Ukrainian (KOI8-U); Cyrillic (KOI8-U)
- 29001 Europa 3
- 38598 ISO 8859-8 Hebrew; Hebrew (ISO-Logical)
- 50220 ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)
- 50221 ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana)
- 50222 ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI)
- 50225 ISO 2022 Korean
- 50227 ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022)
- 50229 ISO 2022 Traditional Chinese
- 50930 EBCDIC Japanese (Katakana) Extended
- 50931 EBCDIC US-Canada and Japanese
- 50933 EBCDIC Korean Extended and Korean
- 50935 EBCDIC Simplified Chinese Extended and Simplified Chinese
- 50936 EBCDIC Simplified Chinese
- 50937 EBCDIC US-Canada and Traditional Chinese
- 50939 EBCDIC Japanese (Latin) Extended and Japanese
- 51932 EUC Japanese
- 51936 EUC Simplified Chinese; Chinese Simplified (EUC)
- 51949 EUC Korean
- 51950 EUC Traditional Chinese
- 52936 HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)
- 54936 Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)
- 57002 ISCII Devanagari
- 57003 ISCII Bengali
- 57004 ISCII Tamil
- 57005 ISCII Telugu
- 57006 ISCII Assamese
- 57007 ISCII Oriya
- 57008 ISCII Kannada
- 57009 ISCII Malayalam
- 57010 ISCII Gujarati
- 57011 ISCII Punjabi
- 65000 Unicode (UTF-7)