SheetJS 93513b6e52 version bump 1.3.0: performance

- more specializations in cptable
- removed functional badnesses in cptable
- bits reworked to minimize functional impact (which caused deopts)

some loss in coverage due to standard codepages missing astral characters

2014-06-26 01:54:13 -04:00

12 KiB

Raw Blame History

Codepages for JS

Codepages are character encodings. In many contexts, single-byte character sets are used in lieu of standard multibyte Unicode encodings. They use 256 characters with a simple mapping.

unicode.org hosts lists of mappings. The build script automatically downloads and parses the mappings in order to generate the full script. The pages.csv description in codepage.md controls which codepages are used.

Setup

In node:

var cptable = require('codepage');

In the browser:

<script src="cptable.js"></script>
<script src="cputils.js"></script>

Alternatively, use the full version in the dist folder:

<script src="cptable.full.js"></script>

The complete set of codepages is large due to some Double Byte Character Set encodings. A much smaller file that just includes SBCS codepages is provided in this repo (sbcs.js), as well as a file for other projects (cpexcel.js)

If you know which codepages you need, you can include individual scripts for each codepage. The individual files are provided in the bits/ directory. For example, to include only the Mac codepages:

<script src="bits/10000.js"></script>
<script src="bits/10006.js"></script>
<script src="bits/10007.js"></script>
<script src="bits/10029.js"></script>
<script src="bits/10079.js"></script>
<script src="bits/10081.js"></script>

All of the browser scripts define and append to the cptable object. To rename the object, edit the JSVAR shell variable in make.sh and run the script.

The utilities functions are contained in cputils.js, which assumes that the appropriate codepage scripts were loaded.

Usage

The codepages are indexed by number. To get the unicode character for a given codepoint, use the dec property:

var unicode_cp10000_255 = cptable[10000].dec[255]; // ˇ

To get the codepoint for a given character, use the enc property:

var cp10000_711 = cptable[10000].enc[String.fromCharCode(711)]; // 255

There are a few utilities that deal with strings and buffers:

var 汇总 = cptable.utils.decode(936, [0xbb,0xe3,0xd7,0xdc]);
var buf =  cptable.utils.encode(936,  汇总);
var sushi= cptable.utils.decode(65001, [0xf0,0x9f,0x8d,0xa3]); // 🍣
var sbuf = cptable.utils.encode(65001, sushi);

cptable.utils.encode(CP, data, ofmt) accepts a String or Array of characters and returns a representation controlled by ofmt:

Default output is a Buffer (or Array) of bytes (integers between 0 and 255).
If ofmt == 'str', return a String where o.charCodeAt(i) is the ith byte
If ofmt == 'arr', return an Array of bytes

Known Excel Codepages

A much smaller script, including only the codepages known to be used in Excel, is available under the name cpexcel. It exposes the same variable cptable and is suitable as a drop-in replacement when the full codepage tables are not needed.

In node:

var cptable = require('codepage/dist/cpexcel.full');

Building the script

This script uses voc. The script to build the codepage tables and the JS source is codepage.md, so building is as simple as voc codepage.md.

Generated Codepages

The complete list of hardcoded codepages can be found in the file pages.csv.

Some codepages are easier to implement algorithmically. Since these are hardcoded in utils, there is no corresponding entry (they are "magic")

CP#	Information	Description
37	unicode.org	IBM EBCDIC US-Canada
437	unicode.org	OEM United States
500	unicode.org	IBM EBCDIC International
708	MakeEncoding.cs	Arabic (ASMO 708)
720	MakeEncoding.cs	Arabic (Transparent ASMO); Arabic (DOS)
737	unicode.org	OEM Greek (formerly 437G); Greek (DOS)
775	unicode.org	OEM Baltic; Baltic (DOS)
850	unicode.org	OEM Multilingual Latin 1; Western European (DOS)
852	unicode.org	OEM Latin 2; Central European (DOS)
855	unicode.org	OEM Cyrillic (primarily Russian)
857	unicode.org	OEM Turkish; Turkish (DOS)
858	MakeEncoding.cs	OEM Multilingual Latin 1 + Euro symbol
860	unicode.org	OEM Portuguese; Portuguese (DOS)
861	unicode.org	OEM Icelandic; Icelandic (DOS)
862	unicode.org	OEM Hebrew; Hebrew (DOS)
863	unicode.org	OEM French Canadian; French Canadian (DOS)
864	unicode.org	OEM Arabic; Arabic (864)
865	unicode.org	OEM Nordic; Nordic (DOS)
866	unicode.org	OEM Russian; Cyrillic (DOS)
869	unicode.org	OEM Modern Greek; Greek, Modern (DOS)
870	MakeEncoding.cs	IBM EBCDIC Multilingual/ROECE (Latin 2)
874	unicode.org	Windows Thai
875	unicode.org	IBM EBCDIC Greek Modern
932	unicode.org	Japanese Shift-JIS
936	unicode.org	Simplified Chinese GBK
949	unicode.org	Korean
950	unicode.org	Traditional Chinese Big5
1026	unicode.org	IBM EBCDIC Turkish (Latin 5)
1047	MakeEncoding.cs	IBM EBCDIC Latin 1/Open System
1140	MakeEncoding.cs	IBM EBCDIC US-Canada (037 + Euro symbol)
1141	MakeEncoding.cs	IBM EBCDIC Germany (20273 + Euro symbol)
1142	MakeEncoding.cs	IBM EBCDIC Denmark-Norway (20277 + Euro symbol)
1143	MakeEncoding.cs	IBM EBCDIC Finland-Sweden (20278 + Euro symbol)
1144	MakeEncoding.cs	IBM EBCDIC Italy (20280 + Euro symbol)
1145	MakeEncoding.cs	IBM EBCDIC Latin America-Spain (20284 + Euro symbol)
1146	MakeEncoding.cs	IBM EBCDIC United Kingdom (20285 + Euro symbol)
1147	MakeEncoding.cs	IBM EBCDIC France (20297 + Euro symbol)
1148	MakeEncoding.cs	IBM EBCDIC International (500 + Euro symbol)
1149	MakeEncoding.cs	IBM EBCDIC Icelandic (20871 + Euro symbol)
1200	magic	Unicode UTF-16, little endian (BMP of ISO 10646)
1201	magic	Unicode UTF-16, big endian
1250	unicode.org	Windows Central Europe
1251	unicode.org	Windows Cyrillic
1252	unicode.org	Windows Latin I
1253	unicode.org	Windows Green
1254	unicode.org	Windows Turkish
1255	unicode.org	Windows Hebrew
1256	unicode.org	Windows Arabic
1257	unicode.org	Windows Baltic
1258	unicode.org	Windows Vietnam
1361	MakeEncoding.cs	Korean (Johab)
10000	unicode.org	MAC Roman
10001	MakeEncoding.cs	Japanese (Mac)
10002	MakeEncoding.cs	MAC Traditional Chinese (Big5)
10003	MakeEncoding.cs	Korean (Mac)
10004	MakeEncoding.cs	Arabic (Mac)
10005	MakeEncoding.cs	Hebrew (Mac)
10006	unicode.org	Greek (Mac)
10007	unicode.org	Cyrillic (Mac)
10008	MakeEncoding.cs	MAC Simplified Chinese (GB 2312)
10010	MakeEncoding.cs	Romanian (Mac)
10017	MakeEncoding.cs	Ukrainian (Mac)
10021	MakeEncoding.cs	Thai (Mac)
10029	unicode.org	MAC Latin 2 (Central European)
10079	unicode.org	Icelandic (Mac)
10081	unicode.org	Turkish (Mac)
10082	MakeEncoding.cs	Croatian (Mac)
12000	magic	Unicode UTF-32, little endian byte order
12001	magic	Unicode UTF-32, big endian byte order
20000	MakeEncoding.cs	CNS Taiwan (Chinese Traditional)
20001	MakeEncoding.cs	TCA Taiwan
20002	MakeEncoding.cs	Eten Taiwan (Chinese Traditional)
20003	MakeEncoding.cs	IBM5550 Taiwan
20004	MakeEncoding.cs	TeleText Taiwan
20005	MakeEncoding.cs	Wang Taiwan
20105	MakeEncoding.cs	Western European IA5 (IRV International Alphabet 5) 7-bit
20106	MakeEncoding.cs	IA5 German (7-bit)
20107	MakeEncoding.cs	IA5 Swedish (7-bit)
20108	MakeEncoding.cs	IA5 Norwegian (7-bit)
20127	magic	US-ASCII (7-bit)
20261	MakeEncoding.cs	T.61
20269	MakeEncoding.cs	ISO 6937 Non-Spacing Accent
20273	MakeEncoding.cs	IBM EBCDIC Germany
20277	MakeEncoding.cs	IBM EBCDIC Denmark-Norway
20278	MakeEncoding.cs	IBM EBCDIC Finland-Sweden
20280	MakeEncoding.cs	IBM EBCDIC Italy
20284	MakeEncoding.cs	IBM EBCDIC Latin America-Spain
20285	MakeEncoding.cs	IBM EBCDIC United Kingdom
20290	MakeEncoding.cs	IBM EBCDIC Japanese Katakana Extended
20297	MakeEncoding.cs	IBM EBCDIC France
20420	MakeEncoding.cs	IBM EBCDIC Arabic
20423	MakeEncoding.cs	IBM EBCDIC Greek
20424	MakeEncoding.cs	IBM EBCDIC Hebrew
20833	MakeEncoding.cs	IBM EBCDIC Korean Extended
20838	MakeEncoding.cs	IBM EBCDIC Thai
20866	MakeEncoding.cs	Russian Cyrillic (KOI8-R)
20871	MakeEncoding.cs	IBM EBCDIC Icelandic
20880	MakeEncoding.cs	IBM EBCDIC Cyrillic Russian
20905	MakeEncoding.cs	IBM EBCDIC Turkish
20924	MakeEncoding.cs	IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
20932	MakeEncoding.cs	Japanese (JIS 0208-1990 and 0212-1990)
20936	MakeEncoding.cs	Simplified Chinese (GB2312-80)
20949	MakeEncoding.cs	Korean Wansung
21025	MakeEncoding.cs	IBM EBCDIC Cyrillic Serbian-Bulgarian
21866	MakeEncoding.cs	Ukrainian Cyrillic (KOI8-U)
28591	unicode.org	ISO 8859-1 Latin 1 (Western European)
28592	unicode.org	ISO 8859-2 Latin 2 (Central European)
28593	unicode.org	ISO 8859-3 Latin 3
28594	unicode.org	ISO 8859-4 Baltic
28595	unicode.org	ISO 8859-5 Cyrillic
28596	unicode.org	ISO 8859-6 Arabic
28597	unicode.org	ISO 8859-7 Greek
28598	unicode.org	ISO 8859-8 Hebrew (ISO-Visual)
28599	unicode.org	ISO 8859-9 Turkish
28600	unicode.org	ISO 8859-10 Latin 6
28601	unicode.org	ISO 8859-11 Latin (Thai)
28603	unicode.org	ISO 8859-13 Latin 7 (Estonian)
28604	unicode.org	ISO 8859-14 Latin 8 (Celtic)
28605	unicode.org	ISO 8859-15 Latin 9
28606	unicode.org	ISO 8859-15 Latin 10
29001	MakeEncoding.cs	Europa 3
38598	MakeEncoding.cs	ISO 8859-8 Hebrew (ISO-Logical)
50220	MakeEncoding.cs	ISO 2022 JIS Japanese with no halfwidth Katakana
50221	MakeEncoding.cs	ISO 2022 JIS Japanese with halfwidth Katakana
50222	MakeEncoding.cs	ISO 2022 Japanese JIS X 0201-1989 (1 byte Kana-SO/SI)
50225	MakeEncoding.cs	ISO 2022 Korean
50227	MakeEncoding.cs	ISO 2022 Simplified Chinese
51932	MakeEncoding.cs	EUC Japanese
51936	MakeEncoding.cs	EUC Simplified Chinese
51949	MakeEncoding.cs	EUC Korean
52936	MakeEncoding.cs	HZ-GB2312 Simplified Chinese
54936	MakeEncoding.cs	GB18030 Simplified Chinese (4 byte)
57002	MakeEncoding.cs	ISCII Devanagari
57003	MakeEncoding.cs	ISCII Bengali
57004	MakeEncoding.cs	ISCII Tamil
57005	MakeEncoding.cs	ISCII Telugu
57006	MakeEncoding.cs	ISCII Assamese
57007	MakeEncoding.cs	ISCII Oriya
57008	MakeEncoding.cs	ISCII Kannada
57009	MakeEncoding.cs	ISCII Malayalam
57010	MakeEncoding.cs	ISCII Gujarati
57011	MakeEncoding.cs	ISCII Punjabi
65000	magic	Unicode (UTF-7)
65001	magic	Unicode (UTF-8)

Note that MakeEncoding.cs deviates from unicode.org for some codepages. In the case of direct conflicts, unicode.org takes precedence. In cases where the unicode.org listing does not prescribe a value, MakeEncoding.cs value is used.

Missing Codepages

The following codepages are not implemented. Normative references may not be available in all cases. Furthermore, other software packages are known to hack certain codepages (for example, Mozilla treats ASMO-708 as an alias of Arabic ISO-8869-6 when in fact there are many differences), so all implementations should be cleanroom when possible.

709 Arabic (ASMO-449+, BCON V4)
710 Arabic - Transparent Arabic
21027 (deprecated) <-- is this necessary?
50229 ISO 2022 Traditional Chinese
50930 EBCDIC Japanese (Katakana) Extended
50931 EBCDIC US-Canada and Japanese
50933 EBCDIC Korean Extended and Korean
50935 EBCDIC Simplified Chinese Extended and Simplified Chinese
50936 EBCDIC Simplified Chinese
50937 EBCDIC US-Canada and Traditional Chinese
50939 EBCDIC Japanese (Latin) Extended and Japanese
51950 EUC Traditional Chinese

12 KiB

Raw Blame History

Codepages for JS

Setup

Usage

Known Excel Codepages

Building the script

Generated Codepages

Missing Codepages

Sources

Badges

12 KiB Raw Blame History

Codepages for JS

Setup

Usage

Known Excel Codepages

Building the script

Generated Codepages

Missing Codepages

Sources

Badges

12 KiB

Raw Blame History