2014-10-19 02:39:39 +00:00
|
|
|
# Verifying Codepages
|
|
|
|
|
|
|
|
After installing every language pack in Windows 7, many codepages are available
|
2016-09-22 17:42:47 +00:00
|
|
|
via the .NET System.Text.Encoding class. The included MakeEncoding.cs program
|
|
|
|
generates a full manifest that can be parsed into a mapping table.
|
2014-10-19 02:39:39 +00:00
|
|
|
|
|
|
|
The included `nls2tbl` script extracts data from the various `C_#####.NLS` files
|
|
|
|
available in the system or system32 directories in various versions of Windows.
|
|
|
|
|
|
|
|
Many codepages are also available in various iconv libraries, but there are some
|
2016-09-22 17:42:47 +00:00
|
|
|
differences. For example, some codepages use the Arabic percent sign ٪ U+066A
|
|
|
|
instead of the standard ASCII "%".
|
|
|
|
|
|
|
|
## Extended Characters
|
|
|
|
|
|
|
|
No known codepage uses characters from the SMP, so certain code paths are never
|
|
|
|
tested. The coverage will not be 100%
|
2014-10-19 02:39:39 +00:00
|
|
|
|
|
|
|
# Missing Codepages
|
|
|
|
|
|
|
|
The following codepages are not implemented. Normative references may not be
|
|
|
|
available in all cases. Furthermore, other software packages are known to hack
|
|
|
|
certain codepages (for example, Mozilla treats ASMO-708 as an alias of Arabic
|
|
|
|
ISO-8869-6 when in fact there are many differences), so all implementations
|
|
|
|
*should* be cleanroom when possible.
|
|
|
|
|
|
|
|
- 709 Arabic (ASMO-449+, BCON V4)
|
|
|
|
- 710 Arabic - Transparent Arabic
|
|
|
|
- 50229 ISO 2022 Traditional Chinese
|
|
|
|
- 50930 EBCDIC Japanese (Katakana) Extended
|
|
|
|
- 50931 EBCDIC US-Canada and Japanese
|
|
|
|
- 50933 EBCDIC Korean Extended and Korean
|
|
|
|
- 50935 EBCDIC Simplified Chinese Extended and Simplified Chinese
|
|
|
|
- 50936 EBCDIC Simplified Chinese
|
|
|
|
- 50937 EBCDIC US-Canada and Traditional Chinese
|
|
|
|
- 50939 EBCDIC Japanese (Latin) Extended and Japanese
|
|
|
|
- 51950 EUC Traditional Chinese
|
|
|
|
|
|
|
|
Each version of Windows adds a few and removes a few codepages, so the missing
|
|
|
|
codepages most likely reside in a specific version that we may not be able to
|
|
|
|
obtain. These notes document our progress.
|
|
|
|
|
|
|
|
## Arabic codepages 709-710
|
|
|
|
|
|
|
|
These codepages are not available in the Arabic version of Windows XP. They may
|
|
|
|
be available in the Arabic versions of MS-DOS or Windows 3.1/95/98/2000.
|
|
|
|
|
|
|
|
The "Code Page and Text Layout Conversion Utility" CONVTEXT.EXE ships with some
|
|
|
|
versions of Office. It can convert from the various codepages to ANSI.
|
|
|
|
|
|
|
|
To produce a UTF16LE (1200) manifest, convert from the relevant codepage to ANSI
|
|
|
|
and then convert from ANSI to "Unicode using Arabic ANSI Code Page".
|
|
|
|
|
|
|
|
Since there is no way to convert directly to unicode using the tool, CONVTEXT is
|
|
|
|
useful only for the characters which exist in both the relevant codepage and in
|
|
|
|
codepage 1256. There are various non-Microsoft sources which claim to document
|
|
|
|
both codepages, but there is no way to verify the claim.
|
|
|
|
|
|
|
|
## EUC Traditional Chinese 51950
|
|
|
|
|
|
|
|
The raw NLS file C_51950.NLS supposedly exists, although there is no way for a US
|
|
|
|
version of Windows to obtain the file. As with the Arabic Codepages, most likely
|
|
|
|
the manifest is only available in Chinese versions of Windows 95/98/2000
|
|
|
|
|
|
|
|
### ISO 2022 Traditional Chinese 50229
|
|
|
|
|
|
|
|
Some sources claim 50229 is ISO-2022-TW and others claim it is ISO-2022-CN.
|
|
|
|
|
|
|
|
### EBCDIC Codepages 50930-50939
|
|
|
|
|
|
|
|
WHATWG claims that the supposed-EBCDIC codepages are really hybrids of ASCII (even
|
|
|
|
though the Microsoft name suggests they should be the same as the originals)
|