js-codepage/NOTES.md

73 lines
3.1 KiB
Markdown
Raw Normal View History

2014-10-19 02:39:39 +00:00
# Verifying Codepages
After installing every language pack in Windows 7, many codepages are available
via the .NET System.Text.Encoding class. The included MakeEncoding.cs program
generates a full manifest that can be parsed into a mapping table.
2014-10-19 02:39:39 +00:00
The included `nls2tbl` script extracts data from the various `C_#####.NLS` files
available in the system or system32 directories in various versions of Windows.
Many codepages are also available in various iconv libraries, but there are some
differences. For example, some codepages use the Arabic percent sign ٪ U+066A
instead of the standard ASCII "%".
## Extended Characters
No known codepage uses characters from the SMP, so certain code paths are never
tested. The coverage will not be 100%
2014-10-19 02:39:39 +00:00
# Missing Codepages
The following codepages are not implemented. Normative references may not be
available in all cases. Furthermore, other software packages are known to hack
certain codepages (for example, Mozilla treats ASMO-708 as an alias of Arabic
ISO-8869-6 when in fact there are many differences), so all implementations
*should* be cleanroom when possible.
- 709 Arabic (ASMO-449+, BCON V4)
- 710 Arabic - Transparent Arabic
- 50229 ISO 2022 Traditional Chinese
- 50930 EBCDIC Japanese (Katakana) Extended
- 50931 EBCDIC US-Canada and Japanese
- 50933 EBCDIC Korean Extended and Korean
- 50935 EBCDIC Simplified Chinese Extended and Simplified Chinese
- 50936 EBCDIC Simplified Chinese
- 50937 EBCDIC US-Canada and Traditional Chinese
- 50939 EBCDIC Japanese (Latin) Extended and Japanese
- 51950 EUC Traditional Chinese
Each version of Windows adds a few and removes a few codepages, so the missing
codepages most likely reside in a specific version that we may not be able to
obtain. These notes document our progress.
## Arabic codepages 709-710
These codepages are not available in the Arabic version of Windows XP. They may
be available in the Arabic versions of MS-DOS or Windows 3.1/95/98/2000.
The "Code Page and Text Layout Conversion Utility" CONVTEXT.EXE ships with some
versions of Office. It can convert from the various codepages to ANSI.
To produce a UTF16LE (1200) manifest, convert from the relevant codepage to ANSI
and then convert from ANSI to "Unicode using Arabic ANSI Code Page".
Since there is no way to convert directly to unicode using the tool, CONVTEXT is
useful only for the characters which exist in both the relevant codepage and in
codepage 1256. There are various non-Microsoft sources which claim to document
both codepages, but there is no way to verify the claim.
## EUC Traditional Chinese 51950
The raw NLS file C_51950.NLS supposedly exists, although there is no way for a US
version of Windows to obtain the file. As with the Arabic Codepages, most likely
the manifest is only available in Chinese versions of Windows 95/98/2000
### ISO 2022 Traditional Chinese 50229
Some sources claim 50229 is ISO-2022-TW and others claim it is ISO-2022-CN.
### EBCDIC Codepages 50930-50939
WHATWG claims that the supposed-EBCDIC codepages are really hybrids of ASCII (even
though the Microsoft name suggests they should be the same as the originals)