js-codepage/misc/README.md.utf7

293 lines
15 KiB
Plaintext

+ACM Codepages for JS
+AFs-Codepages+AF0(https://en.wikipedia.org/wiki/Codepage) are character encodings. In
many contexts, single- or double-byte character sets are used in lieu of Unicode
encodings. The codepages map between characters and numbers.
+AFs-unicode.org+AF0(http://www.unicode.org/Public/MAPPINGS/) hosts lists of mappings.
The build script automatically downloads and parses the mappings in order to
generate the full script. The +AGA-pages.csv+AGA description in +AGA-codepage.md+AGA controls
which codepages are used.
+ACMAIw Setup
In node:
var cptable +AD0 require('codepage')+ADs
In the browser:
+ADw-script src+AD0AIg-cptable.js+ACIAPgA8-/script+AD4
+ADw-script src+AD0AIg-cputils.js+ACIAPgA8-/script+AD4
Alternatively, use the full version in the dist folder:
+ADw-script src+AD0AIg-cptable.full.js+ACIAPgA8-/script+AD4
The complete set of codepages is large due to some Double Byte Character Set
encodings. A much smaller file that just includes SBCS codepages is provided in
this repo (+AGA-sbcs.js+AGA), as well as a file for other projects (+AGA-cpexcel.js+AGA)
If you know which codepages you need, you can include individual scripts for
each codepage. The individual files are provided in the +AGA-bits/+AGA directory.
For example, to include only the Mac codepages:
+ADw-script src+AD0AIg-bits/10000.js+ACIAPgA8-/script+AD4
+ADw-script src+AD0AIg-bits/10006.js+ACIAPgA8-/script+AD4
+ADw-script src+AD0AIg-bits/10007.js+ACIAPgA8-/script+AD4
+ADw-script src+AD0AIg-bits/10029.js+ACIAPgA8-/script+AD4
+ADw-script src+AD0AIg-bits/10079.js+ACIAPgA8-/script+AD4
+ADw-script src+AD0AIg-bits/10081.js+ACIAPgA8-/script+AD4
All of the browser scripts define and append to the +AGA-cptable+AGA object. To rename
the object, edit the +AGA-JSVAR+AGA shell variable in +AGA-make.sh+AGA and run the script.
The utilities functions are contained in +AGA-cputils.js+AGA, which assumes that the
appropriate codepage scripts were loaded.
+ACMAIw Usage
The codepages are indexed by number. To get the unicode character for a given
codepoint, use the +AGA-dec+AGA property:
var unicode+AF8-cp10000+AF8-255 +AD0 cptable+AFs-10000+AF0.dec+AFs-255+AF0AOw // +Asc
To get the codepoint for a given character, use the +AGA-enc+AGA property:
var cp10000+AF8-711 +AD0 cptable+AFs-10000+AF0.enc+AFs-String.fromCharCode(711)+AF0AOw // 255
There are a few utilities that deal with strings and buffers:
var +bEdgOw +AD0 cptable.utils.decode(936, +AFs-0xbb,0xe3,0xd7,0xdc+AF0)+ADs
var buf +AD0 cptable.utils.encode(936, +bEdgOw)+ADs
var sushi+AD0 cptable.utils.decode(65001, +AFs-0xf0,0x9f,0x8d,0xa3+AF0)+ADs // +2DzfYw
var sbuf +AD0 cptable.utils.encode(65001, sushi)+ADs
+AGA-cptable.utils.encode(CP, data, ofmt)+AGA accepts a String or Array of characters
and returns a representation controlled by +AGA-ofmt+AGA:
- Default output is a Buffer (or Array) of bytes (integers between 0 and 255).
- If +AGA-ofmt +AD0APQ 'str'+AGA, return a String where +AGA-o.charCodeAt(i)+AGA is the ith byte
- If +AGA-ofmt +AD0APQ 'arr'+AGA, return an Array of bytes
+ACMAIw Known Excel Codepages
A much smaller script, including only the codepages known to be used in Excel,
is available under the name +AGA-cpexcel+AGA. It exposes the same variable +AGA-cptable+AGA
and is suitable as a drop-in replacement when the full codepage tables are not
needed.
In node:
var cptable +AD0 require('codepage/dist/cpexcel.full')+ADs
+ACMAIw Rolling your own script
The +AGA-make.sh+AGA script in the repo can take a manifest and generate JS source.
Usage:
bash make.sh path+AF8-to+AF8-manifest output+AF8-file+AF8-name JSVAR
where
- +AGA-JSVAR+AGA is the name of the exported variable (generally +AGA-cptable+AGA)
- +AGA-output+AF8-file+AF8-name+AGA is the output file (e.g. +AGA-cpexcel.js+AGA, +AGA-cptable.js+AGA)
- +AGA-path+AF8-to+AF8-manifest+AGA is the path to the manifest file.
The manifest file is expected to be a CSV with 3 columns:
+ADw-codepage number+AD4,+ADw-source+AD4,+ADw-size+AD4
If a source is specified, it will try to download the specified file and parse.
The file format is expected to follow the format from the unicode.org site.
The size should be +AGA-1+AGA for a single-byte codepage and +AGA-2+AGA for a double-byte
codepage. For mixed codepages (which use some single- and some double-byte
codes), the script assumes the mapping is a prefix code and generates efficient
JS code.
Generated scripts only include the mapping. +AGA-cat+AGA a mapping with +AGA-cputils.js+AGA
to produce a complete script like +AGA-cpexcel.full.js+AGA.
+ACMAIw Building the complete script
This script uses +AFs-voc+AF0(npm.im/voc). The script to build the codepage tables and
the JS source is +AGA-codepage.md+AGA, so building is as simple as +AGA-voc codepage.md+AGA.
+ACMAIw Generated Codepages
The complete list of hardcoded codepages can be found in the file +AGA-pages.csv+AGA.
Some codepages are easier to implement algorithmically. Since these are
hardcoded in utils, there is no corresponding entry (they are +ACI-magic+ACI)
+AHw CP+ACM +AHw Information +AHw Description +AHw
+AHw --: +AHw :----------: +AHw :---------- +AHw
+AHw 37+AHw unicode.org +AHw-IBM EBCDIC US-Canada
+AHw 437+AHw unicode.org +AHw-OEM United States
+AHw 500+AHw unicode.org +AHw-IBM EBCDIC International
+AHw 620+AHw NLS +AHw-Mazovia (Polish) MS-DOS
+AHw 708+AHw-MakeEncoding.cs+AHw-Arabic (ASMO 708)
+AHw 720+AHw-MakeEncoding.cs+AHw-Arabic (Transparent ASMO)+ADs Arabic (DOS)
+AHw 737+AHw unicode.org +AHw-OEM Greek (formerly 437G)+ADs Greek (DOS)
+AHw 775+AHw unicode.org +AHw-OEM Baltic+ADs Baltic (DOS)
+AHw 850+AHw unicode.org +AHw-OEM Multilingual Latin 1+ADs Western European (DOS)
+AHw 852+AHw unicode.org +AHw-OEM Latin 2+ADs Central European (DOS)
+AHw 855+AHw unicode.org +AHw-OEM Cyrillic (primarily Russian)
+AHw 857+AHw unicode.org +AHw-OEM Turkish+ADs Turkish (DOS)
+AHw 858+AHw-MakeEncoding.cs+AHw-OEM Multilingual Latin 1 +- Euro symbol
+AHw 860+AHw unicode.org +AHw-OEM Portuguese+ADs Portuguese (DOS)
+AHw 861+AHw unicode.org +AHw-OEM Icelandic+ADs Icelandic (DOS)
+AHw 862+AHw unicode.org +AHw-OEM Hebrew+ADs Hebrew (DOS)
+AHw 863+AHw unicode.org +AHw-OEM French Canadian+ADs French Canadian (DOS)
+AHw 864+AHw unicode.org +AHw-OEM Arabic+ADs Arabic (864)
+AHw 865+AHw unicode.org +AHw-OEM Nordic+ADs Nordic (DOS)
+AHw 866+AHw unicode.org +AHw-OEM Russian+ADs Cyrillic (DOS)
+AHw 869+AHw unicode.org +AHw-OEM Modern Greek+ADs Greek, Modern (DOS)
+AHw 870+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Multilingual/ROECE (Latin 2)
+AHw 874+AHw unicode.org +AHw-Windows Thai
+AHw 875+AHw unicode.org +AHw-IBM EBCDIC Greek Modern
+AHw 895+AHw NLS +AHw-Kamenick+AP0 (Czech) MS-DOS
+AHw 932+AHw unicode.org +AHw-Japanese Shift-JIS
+AHw 936+AHw unicode.org +AHw-Simplified Chinese GBK
+AHw 949+AHw unicode.org +AHw-Korean
+AHw 950+AHw unicode.org +AHw-Traditional Chinese Big5
+AHw 1026+AHw unicode.org +AHw-IBM EBCDIC Turkish (Latin 5)
+AHw 1047+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Latin 1/Open System
+AHw 1140+AHw-MakeEncoding.cs+AHw-IBM EBCDIC US-Canada (037 +- Euro symbol)
+AHw 1141+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Germany (20273 +- Euro symbol)
+AHw 1142+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Denmark-Norway (20277 +- Euro symbol)
+AHw 1143+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Finland-Sweden (20278 +- Euro symbol)
+AHw 1144+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Italy (20280 +- Euro symbol)
+AHw 1145+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Latin America-Spain (20284 +- Euro symbol)
+AHw 1146+AHw-MakeEncoding.cs+AHw-IBM EBCDIC United Kingdom (20285 +- Euro symbol)
+AHw 1147+AHw-MakeEncoding.cs+AHw-IBM EBCDIC France (20297 +- Euro symbol)
+AHw 1148+AHw-MakeEncoding.cs+AHw-IBM EBCDIC International (500 +- Euro symbol)
+AHw 1149+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Icelandic (20871 +- Euro symbol)
+AHw 1200+AHw magic +AHw-Unicode UTF-16, little endian (BMP of ISO 10646)
+AHw 1201+AHw magic +AHw-Unicode UTF-16, big endian
+AHw 1250+AHw unicode.org +AHw-Windows Central Europe
+AHw 1251+AHw unicode.org +AHw-Windows Cyrillic
+AHw 1252+AHw unicode.org +AHw-Windows Latin I
+AHw 1253+AHw unicode.org +AHw-Windows Greek
+AHw 1254+AHw unicode.org +AHw-Windows Turkish
+AHw 1255+AHw unicode.org +AHw-Windows Hebrew
+AHw 1256+AHw unicode.org +AHw-Windows Arabic
+AHw 1257+AHw unicode.org +AHw-Windows Baltic
+AHw 1258+AHw unicode.org +AHw-Windows Vietnam
+AHw 1361+AHw-MakeEncoding.cs+AHw-Korean (Johab)
+AHw-10000+AHw unicode.org +AHw-MAC Roman
+AHw-10001+AHw-MakeEncoding.cs+AHw-Japanese (Mac)
+AHw-10002+AHw-MakeEncoding.cs+AHw-MAC Traditional Chinese (Big5)
+AHw-10003+AHw-MakeEncoding.cs+AHw-Korean (Mac)
+AHw-10004+AHw-MakeEncoding.cs+AHw-Arabic (Mac)
+AHw-10005+AHw-MakeEncoding.cs+AHw-Hebrew (Mac)
+AHw-10006+AHw unicode.org +AHw-Greek (Mac)
+AHw-10007+AHw unicode.org +AHw-Cyrillic (Mac)
+AHw-10008+AHw-MakeEncoding.cs+AHw-MAC Simplified Chinese (GB 2312)
+AHw-10010+AHw-MakeEncoding.cs+AHw-Romanian (Mac)
+AHw-10017+AHw-MakeEncoding.cs+AHw-Ukrainian (Mac)
+AHw-10021+AHw-MakeEncoding.cs+AHw-Thai (Mac)
+AHw-10029+AHw unicode.org +AHw-MAC Latin 2 (Central European)
+AHw-10079+AHw unicode.org +AHw-Icelandic (Mac)
+AHw-10081+AHw unicode.org +AHw-Turkish (Mac)
+AHw-10082+AHw-MakeEncoding.cs+AHw-Croatian (Mac)
+AHw-12000+AHw magic +AHw-Unicode UTF-32, little endian byte order
+AHw-12001+AHw magic +AHw-Unicode UTF-32, big endian byte order
+AHw-20000+AHw-MakeEncoding.cs+AHw-CNS Taiwan (Chinese Traditional)
+AHw-20001+AHw-MakeEncoding.cs+AHw-TCA Taiwan
+AHw-20002+AHw-MakeEncoding.cs+AHw-Eten Taiwan (Chinese Traditional)
+AHw-20003+AHw-MakeEncoding.cs+AHw-IBM5550 Taiwan
+AHw-20004+AHw-MakeEncoding.cs+AHw-TeleText Taiwan
+AHw-20005+AHw-MakeEncoding.cs+AHw-Wang Taiwan
+AHw-20105+AHw-MakeEncoding.cs+AHw-Western European IA5 (IRV International Alphabet 5) 7-bit
+AHw-20106+AHw-MakeEncoding.cs+AHw-IA5 German (7-bit)
+AHw-20107+AHw-MakeEncoding.cs+AHw-IA5 Swedish (7-bit)
+AHw-20108+AHw-MakeEncoding.cs+AHw-IA5 Norwegian (7-bit)
+AHw-20127+AHw magic +AHw-US-ASCII (7-bit)
+AHw-20261+AHw-MakeEncoding.cs+AHw-T.61
+AHw-20269+AHw-MakeEncoding.cs+AHw-ISO 6937 Non-Spacing Accent
+AHw-20273+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Germany
+AHw-20277+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Denmark-Norway
+AHw-20278+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Finland-Sweden
+AHw-20280+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Italy
+AHw-20284+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Latin America-Spain
+AHw-20285+AHw-MakeEncoding.cs+AHw-IBM EBCDIC United Kingdom
+AHw-20290+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Japanese Katakana Extended
+AHw-20297+AHw-MakeEncoding.cs+AHw-IBM EBCDIC France
+AHw-20420+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Arabic
+AHw-20423+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Greek
+AHw-20424+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Hebrew
+AHw-20833+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Korean Extended
+AHw-20838+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Thai
+AHw-20866+AHw-MakeEncoding.cs+AHw-Russian Cyrillic (KOI8-R)
+AHw-20871+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Icelandic
+AHw-20880+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Cyrillic Russian
+AHw-20905+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Turkish
+AHw-20924+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Latin 1/Open System (1047 +- Euro symbol)
+AHw-20932+AHw-MakeEncoding.cs+AHw-Japanese (JIS 0208-1990 and 0212-1990)
+AHw-20936+AHw-MakeEncoding.cs+AHw-Simplified Chinese (GB2312-80)
+AHw-20949+AHw-MakeEncoding.cs+AHw-Korean Wansung
+AHw-21025+AHw-MakeEncoding.cs+AHw-IBM EBCDIC Cyrillic Serbian-Bulgarian
+AHw-21027+AHw NLS +AHw-Extended/Ext Alpha Lowercase
+AHw-21866+AHw-MakeEncoding.cs+AHw-Ukrainian Cyrillic (KOI8-U)
+AHw-28591+AHw unicode.org +AHw-ISO 8859-1 Latin 1 (Western European)
+AHw-28592+AHw unicode.org +AHw-ISO 8859-2 Latin 2 (Central European)
+AHw-28593+AHw unicode.org +AHw-ISO 8859-3 Latin 3
+AHw-28594+AHw unicode.org +AHw-ISO 8859-4 Baltic
+AHw-28595+AHw unicode.org +AHw-ISO 8859-5 Cyrillic
+AHw-28596+AHw unicode.org +AHw-ISO 8859-6 Arabic
+AHw-28597+AHw unicode.org +AHw-ISO 8859-7 Greek
+AHw-28598+AHw unicode.org +AHw-ISO 8859-8 Hebrew (ISO-Visual)
+AHw-28599+AHw unicode.org +AHw-ISO 8859-9 Turkish
+AHw-28600+AHw unicode.org +AHw-ISO 8859-10 Latin 6
+AHw-28601+AHw unicode.org +AHw-ISO 8859-11 Latin (Thai)
+AHw-28603+AHw unicode.org +AHw-ISO 8859-13 Latin 7 (Estonian)
+AHw-28604+AHw unicode.org +AHw-ISO 8859-14 Latin 8 (Celtic)
+AHw-28605+AHw unicode.org +AHw-ISO 8859-15 Latin 9
+AHw-28606+AHw unicode.org +AHw-ISO 8859-15 Latin 10
+AHw-29001+AHw-MakeEncoding.cs+AHw-Europa 3
+AHw-38598+AHw-MakeEncoding.cs+AHw-ISO 8859-8 Hebrew (ISO-Logical)
+AHw-50220+AHw-MakeEncoding.cs+AHw-ISO 2022 JIS Japanese with no halfwidth Katakana
+AHw-50221+AHw-MakeEncoding.cs+AHw-ISO 2022 JIS Japanese with halfwidth Katakana
+AHw-50222+AHw-MakeEncoding.cs+AHw-ISO 2022 Japanese JIS X 0201-1989 (1 byte Kana-SO/SI)
+AHw-50225+AHw-MakeEncoding.cs+AHw-ISO 2022 Korean
+AHw-50227+AHw-MakeEncoding.cs+AHw-ISO 2022 Simplified Chinese
+AHw-51932+AHw-MakeEncoding.cs+AHw-EUC Japanese
+AHw-51936+AHw-MakeEncoding.cs+AHw-EUC Simplified Chinese
+AHw-51949+AHw-MakeEncoding.cs+AHw-EUC Korean
+AHw-52936+AHw-MakeEncoding.cs+AHw-HZ-GB2312 Simplified Chinese
+AHw-54936+AHw-MakeEncoding.cs+AHw-GB18030 Simplified Chinese (4 byte)
+AHw-57002+AHw-MakeEncoding.cs+AHw-ISCII Devanagari
+AHw-57003+AHw-MakeEncoding.cs+AHw-ISCII Bengali
+AHw-57004+AHw-MakeEncoding.cs+AHw-ISCII Tamil
+AHw-57005+AHw-MakeEncoding.cs+AHw-ISCII Telugu
+AHw-57006+AHw-MakeEncoding.cs+AHw-ISCII Assamese
+AHw-57007+AHw-MakeEncoding.cs+AHw-ISCII Oriya
+AHw-57008+AHw-MakeEncoding.cs+AHw-ISCII Kannada
+AHw-57009+AHw-MakeEncoding.cs+AHw-ISCII Malayalam
+AHw-57010+AHw-MakeEncoding.cs+AHw-ISCII Gujarati
+AHw-57011+AHw-MakeEncoding.cs+AHw-ISCII Punjabi
+AHw-65000+AHw magic +AHw-Unicode (UTF-7)
+AHw-65001+AHw magic +AHw-Unicode (UTF-8)
Note that MakeEncoding.cs deviates from unicode.org for some codepages. In the
case of direct conflicts, unicode.org takes precedence. In cases where the
unicode.org listing does not prescribe a value, MakeEncoding.cs value is used.
NLS refers to the National Language Support files supplied in various versions of
Windows. In older versions of Windows (e.g. Windows 98) these files followed the
pattern +AGA-CP+AF8AIw.NLS+AGA, but newer versions use the pattern +AGA-C+AF8AIw.NLS+AGA.
+ACMAIw Sources
- +AFs-Unicode Consortium Public Mappings+AF0(http://www.unicode.org/Public/MAPPINGS/)
- +AFs-Code Page Enumeration+AF0(http://msdn.microsoft.com/en-us/library/cc195051.aspx)
- +AFs-Code Page Identifiers+AF0(http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756.aspx)
+ACMAIw Badges
+AFsAIQBb-Build Status+AF0(https://travis-ci.org/SheetJS/js-codepage.svg?branch+AD0-master)+AF0(https://travis-ci.org/SheetJS/js-codepage)
+AFsAIQBb-Coverage Status+AF0(https://coveralls.io/repos/SheetJS/js-codepage/badge.png)+AF0(https://coveralls.io/r/SheetJS/js-codepage)
+AFsAIQBb-Analytics+AF0(https://ga-beacon.appspot.com/UA-36810333-1/SheetJS/js-codepage?pixel)+AF0(https://github.com/SheetJS/js-codepage)