js-codepage/misc/README.md.utf7
SheetJS 93513b6e52 version bump 1.3.0: performance
- more specializations in cptable
- removed functional badnesses in cptable
- bits reworked to minimize functional impact (which caused deopts)

some loss in coverage due to standard codepages missing astral characters
2014-06-26 01:54:13 -04:00

279 lines
14 KiB
Plaintext

+ACM Codepages for JS
+AFs-Codepages+AF0(https://en.wikipedia.org/wiki/Codepage) are character encodings. In
many contexts, single-byte character sets are used in lieu of standard multibyte
Unicode encodings. They use 256 characters with a simple mapping.
+AFs-unicode.org+AF0(http://www.unicode.org/Public/MAPPINGS/) hosts lists of mappings.
The build script automatically downloads and parses the mappings in order to
generate the full script. The +AGA-pages.csv+AGA description in +AGA-codepage.md+AGA controls
which codepages are used.
+ACMAIw Setup
In node:
var cptable +AD0 require('codepage')+ADs
In the browser:
+ADw-script src+AD0AIg-cptable.js+ACIAPgA8-/script+AD4
+ADw-script src+AD0AIg-cputils.js+ACIAPgA8-/script+AD4
Alternatively, use the full version in the dist folder:
+ADw-script src+AD0AIg-cptable.full.js+ACIAPgA8-/script+AD4
The complete set of codepages is large due to some Double Byte Character Set
encodings. A much smaller file that just includes SBCS codepages is provided in
this repo (+AGA-sbcs.js+AGA), as well as a file for other projects (+AGA-cpexcel.js+AGA)
If you know which codepages you need, you can include individual scripts for
each codepage. The individual files are provided in the +AGA-bits/+AGA directory.
For example, to include only the Mac codepages:
+ADw-script src+AD0AIg-bits/10000.js+ACIAPgA8-/script+AD4
+ADw-script src+AD0AIg-bits/10006.js+ACIAPgA8-/script+AD4
+ADw-script src+AD0AIg-bits/10007.js+ACIAPgA8-/script+AD4
+ADw-script src+AD0AIg-bits/10029.js+ACIAPgA8-/script+AD4
+ADw-script src+AD0AIg-bits/10079.js+ACIAPgA8-/script+AD4
+ADw-script src+AD0AIg-bits/10081.js+ACIAPgA8-/script+AD4
All of the browser scripts define and append to the +AGA-cptable+AGA object. To rename
the object, edit the +AGA-JSVAR+AGA shell variable in +AGA-make.sh+AGA and run the script.
The utilities functions are contained in +AGA-cputils.js+AGA, which assumes that the
appropriate codepage scripts were loaded.
+ACMAIw Usage
The codepages are indexed by number. To get the unicode character for a given
codepoint, use the +AGA-dec+AGA property:
var unicode+AF8-cp10000+AF8-255 +AD0 cptable+AFs-10000+AF0.dec+AFs-255+AF0AOw // +Asc
To get the codepoint for a given character, use the +AGA-enc+AGA property:
var cp10000+AF8-711 +AD0 cptable+AFs-10000+AF0.enc+AFs-String.fromCharCode(711)+AF0AOw // 255
There are a few utilities that deal with strings and buffers:
var +bEdgOw +AD0 cptable.utils.decode(936, +AFs-0xbb,0xe3,0xd7,0xdc+AF0)+ADs
var buf +AD0 cptable.utils.encode(936, +bEdgOw)+ADs
var sushi+AD0 cptable.utils.decode(65001, +AFs-0xf0,0x9f,0x8d,0xa3+AF0)+ADs // +2DzfYw
var sbuf +AD0 cptable.utils.encode(65001, sushi)+ADs
+AGA-cptable.utils.encode(CP, data, ofmt)+AGA accepts a String or Array of characters
and returns a representation controlled by +AGA-ofmt+AGA:
- Default output is a Buffer (or Array) of bytes (integers between 0 and 255).
- If +AGA-ofmt +AD0APQ 'str'+AGA, return a String where +AGA-o.charCodeAt(i)+AGA is the ith byte
- If +AGA-ofmt +AD0APQ 'arr'+AGA, return an Array of bytes
+ACMAIw Known Excel Codepages
A much smaller script, including only the codepages known to be used in Excel,
is available under the name +AGA-cpexcel+AGA. It exposes the same variable +AGA-cptable+AGA
and is suitable as a drop-in replacement when the full codepage tables are not
needed.
In node:
var cptable +AD0 require('codepage/dist/cpexcel.full')+ADs
+ACMAIw Building the script
This script uses +AFs-voc+AF0(npm.im/voc). The script to build the codepage tables and
the JS source is +AGA-codepage.md+AGA, so building is as simple as +AGA-voc codepage.md+AGA.
+ACMAIw Generated Codepages
The complete list of hardcoded codepages can be found in the file +AGA-pages.csv+AGA.
Some codepages are easier to implement algorithmically. Since these are
hardcoded in utils, there is no corresponding entry (they are +ACI-magic+ACI)
+AHw CP+ACM +AHw Information +AHw Description +AHw
+AHw --: +AHw ----------- +AHw ----------- +AHw
+AHw 37+AHw unicode.org +AHw-IBM EBCDIC US-Canada
+AHw 437+AHw unicode.org +AHw-OEM United States
+AHw 500+AHw unicode.org +AHw-IBM EBCDIC International
+AHw 708+AHw-MakeEncoding.cs+AHw-Arabic (ASMO 708)
+AHw 720+AHw-MakeEncoding.cs+AHw-Arabic (Transparent ASMO)+ADs Arabic (DOS)
+AHw 737+AHw unicode.org +AHw-OEM Greek (formerly 437G)+ADs Greek (DOS)
+AHw 775+AHw unicode.org +AHw-OEM Baltic+ADs Baltic (DOS)
+AHw 850+AHw unicode.org +AHw-OEM Multilingual Latin 1+ADs Western European (DOS)
+AHw 852+AHw unicode.org +AHw-OEM Latin 2+ADs Central European (DOS)
+AHw 855+AHw unicode.org +AHw-OEM Cyrillic (primarily Russian)
+AHw 857+AHw unicode.org +AHw-OEM Turkish+ADs Turkish (DOS)
+AHw 858+AHw-MakeEncoding.cs+AHw-OEM Multilingual Latin 1 +- Euro symbol
+AHw 860+AHw unicode.org +AHw-OEM Portuguese+ADs Portuguese (DOS)
+AHw 861+AHw unicode.org +AHw-OEM Icelandic+ADs Icelandic (DOS)
+AHw 862+AHw unicode.org +AHw-OEM Hebrew+ADs Hebrew (DOS)
+AHw 863+AHw unicode.org +AHw-OEM French Canadian+ADs French Canadian (DOS)
+AHw 864+AHw unicode.org +AHw-OEM Arabic+ADs Arabic (864)
+AHw 865+AHw unicode.org +AHw-OEM Nordic+ADs Nordic (DOS)
+AHw 866+AHw unicode.org +AHw-OEM Russian+ADs Cyrillic (DOS)
+AHw 869+AHw unicode.org +AHw-OEM Modern Greek+ADs Greek, Modern (DOS)
+AHw 870+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Multilingual/ROECE (Latin 2)
+AHw 874+AHw unicode.org +AHw-Windows Thai
+AHw 875+AHw unicode.org +AHw-IBM EBCDIC Greek Modern
+AHw 932+AHw unicode.org +AHw-Japanese Shift-JIS
+AHw 936+AHw unicode.org +AHw-Simplified Chinese GBK
+AHw 949+AHw unicode.org +AHw-Korean
+AHw 950+AHw unicode.org +AHw-Traditional Chinese Big5
+AHw 1026+AHw unicode.org +AHw-IBM EBCDIC Turkish (Latin 5)
+AHw 1047+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Latin 1/Open System
+AHw 1140+AHw-MakeEncoding.cs+AHw-IBM EBCDIC US-Canada (037 +- Euro symbol)
+AHw 1141+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Germany (20273 +- Euro symbol)
+AHw 1142+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Denmark-Norway (20277 +- Euro symbol)
+AHw 1143+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Finland-Sweden (20278 +- Euro symbol)
+AHw 1144+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Italy (20280 +- Euro symbol)
+AHw 1145+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Latin America-Spain (20284 +- Euro symbol)
+AHw 1146+AHw-MakeEncoding.cs+AHw-IBM EBCDIC United Kingdom (20285 +- Euro symbol)
+AHw 1147+AHw-MakeEncoding.cs+AHw-IBM EBCDIC France (20297 +- Euro symbol)
+AHw 1148+AHw-MakeEncoding.cs+AHw-IBM EBCDIC International (500 +- Euro symbol)
+AHw 1149+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Icelandic (20871 +- Euro symbol)
+AHw 1200+AHw magic +AHw-Unicode UTF-16, little endian (BMP of ISO 10646)
+AHw 1201+AHw magic +AHw-Unicode UTF-16, big endian
+AHw 1250+AHw unicode.org +AHw-Windows Central Europe
+AHw 1251+AHw unicode.org +AHw-Windows Cyrillic
+AHw 1252+AHw unicode.org +AHw-Windows Latin I
+AHw 1253+AHw unicode.org +AHw-Windows Green
+AHw 1254+AHw unicode.org +AHw-Windows Turkish
+AHw 1255+AHw unicode.org +AHw-Windows Hebrew
+AHw 1256+AHw unicode.org +AHw-Windows Arabic
+AHw 1257+AHw unicode.org +AHw-Windows Baltic
+AHw 1258+AHw unicode.org +AHw-Windows Vietnam
+AHw 1361+AHw-MakeEncoding.cs+AHw-Korean (Johab)
+AHw-10000+AHw unicode.org +AHw-MAC Roman
+AHw-10001+AHw-MakeEncoding.cs+AHw-Japanese (Mac)
+AHw-10002+AHw-MakeEncoding.cs+AHw-MAC Traditional Chinese (Big5)
+AHw-10003+AHw-MakeEncoding.cs+AHw-Korean (Mac)
+AHw-10004+AHw-MakeEncoding.cs+AHw-Arabic (Mac)
+AHw-10005+AHw-MakeEncoding.cs+AHw-Hebrew (Mac)
+AHw-10006+AHw unicode.org +AHw-Greek (Mac)
+AHw-10007+AHw unicode.org +AHw-Cyrillic (Mac)
+AHw-10008+AHw-MakeEncoding.cs+AHw-MAC Simplified Chinese (GB 2312)
+AHw-10010+AHw-MakeEncoding.cs+AHw-Romanian (Mac)
+AHw-10017+AHw-MakeEncoding.cs+AHw-Ukrainian (Mac)
+AHw-10021+AHw-MakeEncoding.cs+AHw-Thai (Mac)
+AHw-10029+AHw unicode.org +AHw-MAC Latin 2 (Central European)
+AHw-10079+AHw unicode.org +AHw-Icelandic (Mac)
+AHw-10081+AHw unicode.org +AHw-Turkish (Mac)
+AHw-10082+AHw-MakeEncoding.cs+AHw-Croatian (Mac)
+AHw-12000+AHw magic +AHw-Unicode UTF-32, little endian byte order
+AHw-12001+AHw magic +AHw-Unicode UTF-32, big endian byte order
+AHw-20000+AHw-MakeEncoding.cs+AHw-CNS Taiwan (Chinese Traditional)
+AHw-20001+AHw-MakeEncoding.cs+AHw-TCA Taiwan
+AHw-20002+AHw-MakeEncoding.cs+AHw-Eten Taiwan (Chinese Traditional)
+AHw-20003+AHw-MakeEncoding.cs+AHw-IBM5550 Taiwan
+AHw-20004+AHw-MakeEncoding.cs+AHw-TeleText Taiwan
+AHw-20005+AHw-MakeEncoding.cs+AHw-Wang Taiwan
+AHw-20105+AHw-MakeEncoding.cs+AHw-Western European IA5 (IRV International Alphabet 5) 7-bit
+AHw-20106+AHw-MakeEncoding.cs+AHw-IA5 German (7-bit)
+AHw-20107+AHw-MakeEncoding.cs+AHw-IA5 Swedish (7-bit)
+AHw-20108+AHw-MakeEncoding.cs+AHw-IA5 Norwegian (7-bit)
+AHw-20127+AHw magic +AHw-US-ASCII (7-bit)
+AHw-20261+AHw-MakeEncoding.cs+AHw-T.61
+AHw-20269+AHw-MakeEncoding.cs+AHw-ISO 6937 Non-Spacing Accent
+AHw-20273+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Germany
+AHw-20277+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Denmark-Norway
+AHw-20278+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Finland-Sweden
+AHw-20280+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Italy
+AHw-20284+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Latin America-Spain
+AHw-20285+AHw-MakeEncoding.cs+AHw-IBM EBCDIC United Kingdom
+AHw-20290+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Japanese Katakana Extended
+AHw-20297+AHw-MakeEncoding.cs+AHw-IBM EBCDIC France
+AHw-20420+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Arabic
+AHw-20423+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Greek
+AHw-20424+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Hebrew
+AHw-20833+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Korean Extended
+AHw-20838+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Thai
+AHw-20866+AHw-MakeEncoding.cs+AHw-Russian Cyrillic (KOI8-R)
+AHw-20871+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Icelandic
+AHw-20880+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Cyrillic Russian
+AHw-20905+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Turkish
+AHw-20924+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Latin 1/Open System (1047 +- Euro symbol)
+AHw-20932+AHw-MakeEncoding.cs+AHw-Japanese (JIS 0208-1990 and 0212-1990)
+AHw-20936+AHw-MakeEncoding.cs+AHw-Simplified Chinese (GB2312-80)
+AHw-20949+AHw-MakeEncoding.cs+AHw-Korean Wansung
+AHw-21025+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Cyrillic Serbian-Bulgarian
+AHw-21866+AHw-MakeEncoding.cs+AHw-Ukrainian Cyrillic (KOI8-U)
+AHw-28591+AHw unicode.org +AHw-ISO 8859-1 Latin 1 (Western European)
+AHw-28592+AHw unicode.org +AHw-ISO 8859-2 Latin 2 (Central European)
+AHw-28593+AHw unicode.org +AHw-ISO 8859-3 Latin 3
+AHw-28594+AHw unicode.org +AHw-ISO 8859-4 Baltic
+AHw-28595+AHw unicode.org +AHw-ISO 8859-5 Cyrillic
+AHw-28596+AHw unicode.org +AHw-ISO 8859-6 Arabic
+AHw-28597+AHw unicode.org +AHw-ISO 8859-7 Greek
+AHw-28598+AHw unicode.org +AHw-ISO 8859-8 Hebrew (ISO-Visual)
+AHw-28599+AHw unicode.org +AHw-ISO 8859-9 Turkish
+AHw-28600+AHw unicode.org +AHw-ISO 8859-10 Latin 6
+AHw-28601+AHw unicode.org +AHw-ISO 8859-11 Latin (Thai)
+AHw-28603+AHw unicode.org +AHw-ISO 8859-13 Latin 7 (Estonian)
+AHw-28604+AHw unicode.org +AHw-ISO 8859-14 Latin 8 (Celtic)
+AHw-28605+AHw unicode.org +AHw-ISO 8859-15 Latin 9
+AHw-28606+AHw unicode.org +AHw-ISO 8859-15 Latin 10
+AHw-29001+AHw-MakeEncoding.cs+AHw-Europa 3
+AHw-38598+AHw-MakeEncoding.cs+AHw-ISO 8859-8 Hebrew (ISO-Logical)
+AHw-50220+AHw-MakeEncoding.cs+AHw-ISO 2022 JIS Japanese with no halfwidth Katakana
+AHw-50221+AHw-MakeEncoding.cs+AHw-ISO 2022 JIS Japanese with halfwidth Katakana
+AHw-50222+AHw-MakeEncoding.cs+AHw-ISO 2022 Japanese JIS X 0201-1989 (1 byte Kana-SO/SI)
+AHw-50225+AHw-MakeEncoding.cs+AHw-ISO 2022 Korean
+AHw-50227+AHw-MakeEncoding.cs+AHw-ISO 2022 Simplified Chinese
+AHw-51932+AHw-MakeEncoding.cs+AHw-EUC Japanese
+AHw-51936+AHw-MakeEncoding.cs+AHw-EUC Simplified Chinese
+AHw-51949+AHw-MakeEncoding.cs+AHw-EUC Korean
+AHw-52936+AHw-MakeEncoding.cs+AHw-HZ-GB2312 Simplified Chinese
+AHw-54936+AHw-MakeEncoding.cs+AHw-GB18030 Simplified Chinese (4 byte)
+AHw-57002+AHw-MakeEncoding.cs+AHw-ISCII Devanagari
+AHw-57003+AHw-MakeEncoding.cs+AHw-ISCII Bengali
+AHw-57004+AHw-MakeEncoding.cs+AHw-ISCII Tamil
+AHw-57005+AHw-MakeEncoding.cs+AHw-ISCII Telugu
+AHw-57006+AHw-MakeEncoding.cs+AHw-ISCII Assamese
+AHw-57007+AHw-MakeEncoding.cs+AHw-ISCII Oriya
+AHw-57008+AHw-MakeEncoding.cs+AHw-ISCII Kannada
+AHw-57009+AHw-MakeEncoding.cs+AHw-ISCII Malayalam
+AHw-57010+AHw-MakeEncoding.cs+AHw-ISCII Gujarati
+AHw-57011+AHw-MakeEncoding.cs+AHw-ISCII Punjabi
+AHw-65000+AHw magic +AHw-Unicode (UTF-7)
+AHw-65001+AHw magic +AHw-Unicode (UTF-8)
Note that MakeEncoding.cs deviates from unicode.org for some codepages. In the
case of direct conflicts, unicode.org takes precedence. In cases where the
unicode.org listing does not prescribe a value, MakeEncoding.cs value is used.
+ACMAIw Missing Codepages
The following codepages are not implemented. Normative references may not be
available in all cases. Furthermore, other software packages are known to hack
certain codepages (for example, Mozilla treats ASMO-708 as an alias of Arabic
ISO-8869-6 when in fact there are many differences), so all implementations
+ACo-should+ACo be cleanroom when possible.
- 709 Arabic (ASMO-449+-, BCON V4)
- 710 Arabic - Transparent Arabic
- 21027 (deprecated) +ADw--- is this necessary?
- 50229 ISO 2022 Traditional Chinese
- 50930 EBCDIC Japanese (Katakana) Extended
- 50931 EBCDIC US-Canada and Japanese
- 50933 EBCDIC Korean Extended and Korean
- 50935 EBCDIC Simplified Chinese Extended and Simplified Chinese
- 50936 EBCDIC Simplified Chinese
- 50937 EBCDIC US-Canada and Traditional Chinese
- 50939 EBCDIC Japanese (Latin) Extended and Japanese
- 51950 EUC Traditional Chinese
+ACMAIw Sources
- +AFs-Unicode Consortium Public Mappings+AF0(http://www.unicode.org/Public/MAPPINGS/)
- +AFs-Code Page Enumeration+AF0(http://msdn.microsoft.com/en-us/library/cc195051.aspx)
- +AFs-Code Page Identifiers+AF0(http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756.aspx)
+ACMAIw Badges
+AFsAIQBb-githalytics.com alpha+AF0(https://cruel-carlota.pagodabox.com/afa29a5e8495a01059ee5b353f9042cb +ACI-githalytics.com+ACI)+AF0(http://githalytics.com/SheetJS/js-codepage)
+AFsAIQBb-Build Status+AF0(https://travis-ci.org/SheetJS/js-codepage.svg?branch+AD0-master)+AF0(https://travis-ci.org/SheetJS/js-codepage)
+AFsAIQBb-Coverage Status+AF0(https://coveralls.io/repos/SheetJS/js-codepage/badge.png)+AF0(https://coveralls.io/r/SheetJS/js-codepage)