Problem with the encoding of Cyrillic characters #912

Closed
opened 2017-12-04 19:57:19 +00:00 by makcbrain · 6 comments
makcbrain commented 2017-12-04 19:57:19 +00:00 (Migrated from github.com)

There are problems with the encoding of Cyrillic characters in some files. The sample file I uploaded to google drive: file xls. This file opens correctly in Excel 16.0.
The result of the conversion to csv from the page http://oss.sheetjs.com/js-xlsx/

Îñòàòêè ÒÌÖ íà ñêëàäàõ,,,,,,,,,,,,,,,,,,,,,,,,,,
Íà äàòó: 04.12.17,,,,,,,,,,,,,,,,,,,,,,,,,,
"Ïî íîìåíêëàòóðíûì ïîçèöèÿì èç ñïèñêà (""Àâòîøèíû"").",,,,,,,,,,,,,,,,,,,,,,,,,,
Íîìåíêëàòóðà,,,Åä.,Ñêëàä ã. ×åëÿáèíñê,,Ñëîáîäñêîé ïåð. 45,,,,,,,,,,,,,,,,,,,,
,,,,öåíà À,Ñâîáîäíûé,öåíà À,Ñâîáîäíûé,,,,,,,,,,,,,,,,,,,
51344,-----,16.5/70-18 TT ÂØÇ ÊÔ-97 íñ10 ñ îá.ëåíòîé,êîìïë, ,0,"14,008.00",>12,,,,,,,,,,,,,,,,,,,

Can it be fixed by some manipulation:

const iconv = require('iconv-lite');
console.log(iconv.decode(iconv.encode('Îñòàòêè ÒÌÖ íà ñêëàäàõ,,,,,,,,,,,,,,,,,,,,,,,,,,', 'cp1252'), 'cp1251'));
//display "Остатки ТМЦ на складах,,,,,,,,,,,,,,,,,,,,,,,,,,"
There are problems with the encoding of Cyrillic characters in some files. The sample file I uploaded to google drive: [file xls](https://drive.google.com/file/d/1V6FAXtd7i4M3WFbu8GCecvvtii5m2Wo0/view?usp=sharing). This file opens correctly in Excel 16.0. The result of the conversion to csv from the page http://oss.sheetjs.com/js-xlsx/ ``` Îñòàòêè ÒÌÖ íà ñêëàäàõ,,,,,,,,,,,,,,,,,,,,,,,,,, Íà äàòó: 04.12.17,,,,,,,,,,,,,,,,,,,,,,,,,, "Ïî íîìåíêëàòóðíûì ïîçèöèÿì èç ñïèñêà (""Àâòîøèíû"").",,,,,,,,,,,,,,,,,,,,,,,,,, Íîìåíêëàòóðà,,,Åä.,Ñêëàä ã. ×åëÿáèíñê,,Ñëîáîäñêîé ïåð. 45,,,,,,,,,,,,,,,,,,,, ,,,,öåíà À,Ñâîáîäíûé,öåíà À,Ñâîáîäíûé,,,,,,,,,,,,,,,,,,, 51344,-----,16.5/70-18 TT ÂØÇ ÊÔ-97 íñ10 ñ îá.ëåíòîé,êîìïë, ,0,"14,008.00",>12,,,,,,,,,,,,,,,,,,, ``` Can it be fixed by some manipulation: ```javascript const iconv = require('iconv-lite'); console.log(iconv.decode(iconv.encode('Îñòàòêè ÒÌÖ íà ñêëàäàõ,,,,,,,,,,,,,,,,,,,,,,,,,,', 'cp1252'), 'cp1251')); //display "Остатки ТМЦ на складах,,,,,,,,,,,,,,,,,,,,,,,,,," ```
SheetJSDev commented 2017-12-04 21:20:34 +00:00 (Migrated from github.com)

@makcbrain thanks for sharing! This is a BIFF5 XLS (Excel 5.0/95) file with no CodePage record, so there's no way to inspect the file and determine the correct encoding. To see that this is a file ambiguity, try opening this in Excel 2016 for Mac and you'll see different content corresponding to the default Mac Roman codepage 10000:

That string does correspond to the original set of bytes, as you can verify manually:

var cptable = require('codepage'); 
cptable.utils.decode(1251, cptable.utils.encode(1252, "Îñòàòêè ÒÌÖ íà ñêëàäàõ"));
// 'Остатки ТМЦ на складах'
cptable.utils.decode(1251, cptable.utils.encode(10000, "ŒÒÚ‡ÚÍË “Ã÷ ̇ ÒÍ·‰‡ı"));
// 'Остатки ТМЦ на складах'

Just as discussed in #907 the final solution will involve adding a default codepage option to the read functions (e.g. XLSX.readFile("file.xls", {codepage:1251}))

@makcbrain thanks for sharing! This is a BIFF5 XLS (Excel 5.0/95) file with no CodePage record, so there's no way to inspect the file and determine the correct encoding. To see that this is a file ambiguity, try opening this in Excel 2016 for Mac and you'll see different content corresponding to the [default Mac Roman codepage 10000](https://github.com/sheetjs/js-codepage#generated-codepages): <img width="382" alt="" src="https://user-images.githubusercontent.com/6070939/33575998-713427aa-d90c-11e7-9753-629cabe8c849.png"> That string does correspond to the original set of bytes, as you can verify manually: ```js var cptable = require('codepage'); cptable.utils.decode(1251, cptable.utils.encode(1252, "Îñòàòêè ÒÌÖ íà ñêëàäàõ")); // 'Остатки ТМЦ на складах' cptable.utils.decode(1251, cptable.utils.encode(10000, "ŒÒÚ‡ÚÍË “Ã÷ ̇ ÒÍ·‰‡ı")); // 'Остатки ТМЦ на складах' ``` Just as discussed in #907 the final solution will involve adding a default codepage option to the read functions (e.g. `XLSX.readFile("file.xls", {codepage:1251})`)
makcbrain commented 2017-12-05 07:07:15 +00:00 (Migrated from github.com)

This option ({codepage: 1251}) does not change anything. What could be wrong?

This option ({codepage: 1251}) does not change anything. What could be wrong?
makcbrain commented 2017-12-07 09:39:54 +00:00 (Migrated from github.com)

Tell me please how to set the default encoding when loading a file. I will be very grateful for the answer.

Tell me please how to set the default encoding when loading a file. I will be very grateful for the answer.
SheetJSDev commented 2017-12-09 07:28:11 +00:00 (Migrated from github.com)

The option codepage:1251 should work in version 0.11.13. It's also exposed as the --codepage flag in the xlsx bin script:

$ xlsx --codepage 1252 ~/Downloads/1252.xls | head -n 2
Sheet1
,,,,,,,,,,,,,,,,,,,,,,,,,,
Îñòàòêè ÒÌÖ íà ñêëàäàõ,,,,,,,,,,,,,,,,,,,,,,,,,,
$ xlsx --codepage 1251 ~/Downloads/1252.xls | head -n 2
Sheet1
,,,,,,,,,,,,,,,,,,,,,,,,,,
Остатки ТМЦ на складах,,,,,,,,,,,,,,,,,,,,,,,,,,
The option `codepage:1251` should work in version 0.11.13. It's also exposed as the `--codepage` flag in the `xlsx` bin script: ``` $ xlsx --codepage 1252 ~/Downloads/1252.xls | head -n 2 Sheet1 ,,,,,,,,,,,,,,,,,,,,,,,,,, Îñòàòêè ÒÌÖ íà ñêëàäàõ,,,,,,,,,,,,,,,,,,,,,,,,,, $ xlsx --codepage 1251 ~/Downloads/1252.xls | head -n 2 Sheet1 ,,,,,,,,,,,,,,,,,,,,,,,,,, Остатки ТМЦ на складах,,,,,,,,,,,,,,,,,,,,,,,,,, ```
RockMother commented 2020-10-22 09:38:48 +00:00 (Migrated from github.com)

Tell me please how to set the default encoding when loading a file. I will be very grateful for the answer.

Hi, man. Did you find the solution?

> Tell me please how to set the default encoding when loading a file. I will be very grateful for the answer. Hi, man. Did you find the solution?
SheetJSDev commented 2020-10-22 09:41:38 +00:00 (Migrated from github.com)

pass the codepage option to read or readFile https://github.com/SheetJS/sheetjs/#parsing-options .

pass the `codepage` option to `read` or `readFile` https://github.com/SheetJS/sheetjs/#parsing-options .
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#912
No description provided.