ISO 2022 JIS Japanese encoding fails #17

Open
opened 2019-11-01 16:20:06 +00:00 by n1474335 · 2 comments
n1474335 commented 2019-11-01 16:20:06 +00:00 (Migrated from github.com)

Hi, thanks very much for your work on this repository, it's incredibly useful. We use it as the main character encoding library for CyberChef.

We've recently noticed an issue when trying to encode into ISO 2022 JIS Japanese where only null bytes are returned.

The affected CP numbers are 50220, 50221 and 50222.

Example code

import cptable from "codepage";

cptable.utils.encode(50220, "こんにちは");

Expected output

Uint8Array(10) [164, 179, 164, 243, 164, 203, 164, 193, 164, 207]

Actual output

Uint8Array(5) [0, 0, 0, 0, 0]

Can you shed any light on this behaviour?

Hi, thanks very much for your work on this repository, it's incredibly useful. We use it as the main character encoding library for [CyberChef](https://github.com/gchq/CyberChef). We've recently noticed an issue when trying to encode into `ISO 2022 JIS Japanese` where only null bytes are returned. The affected CP numbers are `50220`, `50221` and `50222`. **Example code** ```javascript import cptable from "codepage"; cptable.utils.encode(50220, "こんにちは"); ``` **Expected output** ``` Uint8Array(10) [164, 179, 164, 243, 164, 203, 164, 193, 164, 207] ``` **Actual output** ``` Uint8Array(5) [0, 0, 0, 0, 0] ``` Can you shed any light on this behaviour?
n1474335 commented 2019-11-01 16:48:38 +00:00 (Migrated from github.com)

Another example that also fails:

Code

import cptable from "codepage";

cptable.utils.encode(50220, "ーム")

Expected output

Uint8Array(10) [27, 36, 66, 33, 60, 37, 96, 27, 40, 66]

Actual output

Uint8Array(2) [0, 0]
Another example that also fails: **Code** ```javascript import cptable from "codepage"; cptable.utils.encode(50220, "ーム") ``` **Expected output** ``` Uint8Array(10) [27, 36, 66, 33, 60, 37, 96, 27, 40, 66] ``` **Actual output** ``` Uint8Array(2) [0, 0] ```
SheetJSDev commented 2019-11-01 17:56:09 +00:00 (Migrated from github.com)

Thanks for sharing! The ISO 2022 codepages 5022{0,1,2,5,7} are definitely incorrect -- hiragana require a control sequence and those are not currently supported. Based on ECMA-35, the first kana "こ" should be encoded as 1B 24 42 24 33 (1B 24 42 to switch to the JIS double byte encoding, 24 for the Hiragana subset and 43 for the actual character). This will require a direct implementation of control sequences and a new set of LUTs for the various character subsets.

PS: All of the generated codepages with source listed as "Windows 7" are assumed to either be single-byte or double-byte. Clearly that wasn't the case here.

Thanks for sharing! The ISO 2022 codepages 5022{0,1,2,5,7} are definitely incorrect -- hiragana require a control sequence and those are not currently supported. Based on [ECMA-35](http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-035.pdf), the first kana "こ" should be encoded as `1B 24 42 24 33` (`1B 24 42` to switch to the JIS double byte encoding, `24` for the Hiragana subset and `43` for the actual character). This will require a direct implementation of control sequences and a new set of LUTs for the various character subsets. PS: All of the [generated codepages](https://github.com/SheetJS/js-codepage#generated-codepages) with source listed as "Windows 7" are assumed to either be single-byte or double-byte. Clearly that wasn't the case here.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/js-codepage#17
No description provided.