js-codepage/NOTES.md
SheetJS 5aacbbf522 version bump 1.5.0
new codepages:

- 808   OEM Russian; Cyrillic + Euro symbol
- 872   OEM Cyrillic (primarily Russian) + Euro Symbol
- 1010  IBM EBCDIC French
- 1132  IBM EBCDIC Lao (1132 / 1133 / 1341)
- 47451 Atari ST/TT

other changes:

- updated travis versions for test
- miscellaneous adjustments to tooling
2016-09-22 13:42:47 -04:00

3.1 KiB

Verifying Codepages

After installing every language pack in Windows 7, many codepages are available via the .NET System.Text.Encoding class. The included MakeEncoding.cs program generates a full manifest that can be parsed into a mapping table.

The included nls2tbl script extracts data from the various C_#####.NLS files available in the system or system32 directories in various versions of Windows.

Many codepages are also available in various iconv libraries, but there are some differences. For example, some codepages use the Arabic percent sign ٪ U+066A instead of the standard ASCII "%".

Extended Characters

No known codepage uses characters from the SMP, so certain code paths are never tested. The coverage will not be 100%

Missing Codepages

The following codepages are not implemented. Normative references may not be available in all cases. Furthermore, other software packages are known to hack certain codepages (for example, Mozilla treats ASMO-708 as an alias of Arabic ISO-8869-6 when in fact there are many differences), so all implementations should be cleanroom when possible.

  • 709 Arabic (ASMO-449+, BCON V4)
  • 710 Arabic - Transparent Arabic
  • 50229 ISO 2022 Traditional Chinese
  • 50930 EBCDIC Japanese (Katakana) Extended
  • 50931 EBCDIC US-Canada and Japanese
  • 50933 EBCDIC Korean Extended and Korean
  • 50935 EBCDIC Simplified Chinese Extended and Simplified Chinese
  • 50936 EBCDIC Simplified Chinese
  • 50937 EBCDIC US-Canada and Traditional Chinese
  • 50939 EBCDIC Japanese (Latin) Extended and Japanese
  • 51950 EUC Traditional Chinese

Each version of Windows adds a few and removes a few codepages, so the missing codepages most likely reside in a specific version that we may not be able to obtain. These notes document our progress.

Arabic codepages 709-710

These codepages are not available in the Arabic version of Windows XP. They may be available in the Arabic versions of MS-DOS or Windows 3.1/95/98/2000.

The "Code Page and Text Layout Conversion Utility" CONVTEXT.EXE ships with some versions of Office. It can convert from the various codepages to ANSI.

To produce a UTF16LE (1200) manifest, convert from the relevant codepage to ANSI and then convert from ANSI to "Unicode using Arabic ANSI Code Page".

Since there is no way to convert directly to unicode using the tool, CONVTEXT is useful only for the characters which exist in both the relevant codepage and in codepage 1256. There are various non-Microsoft sources which claim to document both codepages, but there is no way to verify the claim.

EUC Traditional Chinese 51950

The raw NLS file C_51950.NLS supposedly exists, although there is no way for a US version of Windows to obtain the file. As with the Arabic Codepages, most likely the manifest is only available in Chinese versions of Windows 95/98/2000

ISO 2022 Traditional Chinese 50229

Some sources claim 50229 is ISO-2022-TW and others claim it is ISO-2022-CN.

EBCDIC Codepages 50930-50939

WHATWG claims that the supposed-EBCDIC codepages are really hybrids of ASCII (even though the Microsoft name suggests they should be the same as the originals)