Problem with reading cyrillic CSV without BOM (ANSI as UTF-8) #907

Closed
opened 2017-11-27 13:09:59 +00:00 by popovserhii · 5 comments
popovserhii commented 2017-11-27 13:09:59 +00:00 (Migrated from github.com)

I stumble over the problem with reading CSV without BOM on Windows 10 (not tested on Unix).
My script generate CSV file in next order

let path = 'path/for/save.csv';
let file = fs.createWriteStream(path, {
  flags: 'w',
  encoding: 'utf8',
  mode: '0744'
});

let csv = '6324321;case HEREVIN STORE BOX /прямоуг./1638 мл;УЦЕНКА ПОСУДА';
file.write(csv + '\n');

The file can be opened well in Notepad++ but in right-bottom corner is writen "ANSI as UTF-8".

When I try to handle this file with XLSX it returns Cyrillic in bad encoding.

let wb = XLSX.readFile(path);
// handle worksheet with this script https://stackoverflow.com/a/31083368/1335142

and when I get value in debugger it returns

{
  "sku": 6324321,
  "name": "case HEREVIN STORE BOX /прямоуг./1638 мл",
  "category": "УЦЕНКА ПОСУДА"
}

but when I change this line file.write(csv + '\n'); to file.write('\ufeff' + csv + '\n'); all work as aspected.

'\ufeff' means add BOM to the file (taken here https://stackoverflow.com/a/13859239/1335142)

I stumble over the problem with reading CSV without BOM on Windows 10 (not tested on Unix). My script generate CSV file in next order ```js let path = 'path/for/save.csv'; let file = fs.createWriteStream(path, { flags: 'w', encoding: 'utf8', mode: '0744' }); let csv = '6324321;case HEREVIN STORE BOX /прямоуг./1638 мл;УЦЕНКА ПОСУДА'; file.write(csv + '\n'); ``` The file can be opened well in Notepad++ but in right-bottom corner is writen "ANSI as UTF-8". When I try to handle this file with XLSX it returns Cyrillic in bad encoding. ```js let wb = XLSX.readFile(path); // handle worksheet with this script https://stackoverflow.com/a/31083368/1335142 ``` and when I get value in debugger it returns ```json { "sku": 6324321, "name": "case HEREVIN STORE BOX /прямоуг./1638 мл", "category": "УЦЕНКА ПОСУДА" } ``` but when I change this line ```file.write(csv + '\n');``` to ```file.write('\ufeff' + csv + '\n');``` all work as aspected. ```'\ufeff'``` means add BOM to the file (taken here https://stackoverflow.com/a/13859239/1335142)
SheetJSDev commented 2017-11-27 15:44:46 +00:00 (Migrated from github.com)

So to be clear, Excel does not assume UTF8 by default -- it actually assumes the system codepage. You can see this by trying to open the file in Excel. This is what the save.csv file (without the BOM) looks like in Excel 2013 in the English (United States) locale with default CP 1252:

issue907

If you add the BOM then Excel will treat the file as UTF8, which is why it looks correct in the second case.

If you have control over the origin, I would make sure that the generator is properly adding the BOM.

So to be clear, Excel does not assume UTF8 by default -- it actually assumes the system codepage. You can see this by trying to open the file in Excel. This is what the save.csv file (without the BOM) looks like in Excel 2013 in the English (United States) locale with default CP 1252: <img width="589" alt="issue907" src="https://user-images.githubusercontent.com/6070939/33274172-50abb20a-d35d-11e7-95a6-f32c8aa70c3c.png"> If you add the BOM then Excel will treat the file as UTF8, which is why it looks correct in the second case. If you have control over the origin, I would make sure that the generator is properly adding the BOM.
popovserhii commented 2017-11-27 17:46:35 +00:00 (Migrated from github.com)

This is not expected behavior as I obviously set encoding: 'utf8' and hope other utilities understand that.
I can add BOM in my codebase but this is single case. In most cases, developers get a complete file from other resources (services, APIs) and would like to work with that but in this case, he/she should convert a file before the following processing.

Is there any possibility "explain" XLSX that file has correct encoding without converting?

This is not expected behavior as I obviously set ```encoding: 'utf8'``` and hope other utilities understand that. I can add BOM in my codebase but this is single case. In most cases, developers get a complete file from other resources (services, APIs) and would like to work with that but in this case, he/she should convert a file before the following processing. Is there any possibility "explain" XLSX that file has correct encoding without converting?
SheetJSDev commented 2017-11-28 02:25:43 +00:00 (Migrated from github.com)

I obviously set encoding: 'utf8' and hope other utilities understand that.

Guess what program doesn't understand UTF-8 by default? You guessed it: Excel. Excel's default behavior is to interpret in the local codepage.

If you'd like to test it out, the CHAR function gives you different results based on local codepage:

fs.writeFileSync("out.csv", Array.from({length:256}, (x,i) => `=CHAR(${i})`).join("\r\n"));

Change your computer region settings and try to open the file to see a fun surprise.

We'll definitely need to add a codepage option at some point (so you would be able to force UTF8 interpretation with codepage:65001) but it's unclear if it makes sense to default to UTF8.

> I obviously set `encoding: 'utf8'` and hope other utilities understand that. Guess what program doesn't understand UTF-8 by default? You guessed it: Excel. Excel's default behavior is to interpret in the local codepage. If you'd like to test it out, the `CHAR` function gives you different results based on local codepage: ```js fs.writeFileSync("out.csv", Array.from({length:256}, (x,i) => `=CHAR(${i})`).join("\r\n")); ``` Change your computer region settings and try to open the file to see a fun surprise. We'll definitely need to add a `codepage` option at some point (so you would be able to force UTF8 interpretation with `codepage:65001`) but it's unclear if it makes sense to default to UTF8.
popovserhii commented 2017-11-28 16:30:02 +00:00 (Migrated from github.com)

I've changed my computer region and language settings to the United Kingdom and English respectively and I've opened the file, nothing changed... As I can guess, Excel is not to interpret in the local codebase.

Here is my computer settings and opened file 2017-11-28_1818

I've changed my computer region and language settings to the United Kingdom and English respectively and I've opened the file, nothing changed... As I can guess, Excel is not to interpret in the local codebase. Here is my computer settings and opened file ![2017-11-28_1818](https://user-images.githubusercontent.com/1991183/33331359-0ac3ce4a-d46a-11e7-950c-db95c78e19b7.png)
SheetJSDev commented 2017-12-09 07:29:49 +00:00 (Migrated from github.com)

The option to set the default codepage to Cyrillic is codepage:1251 e.g.

const XLSX = require('xlsx');
const wb = XLSX.readFile("save.csv", {codepage:1251});
The option to set the default codepage to Cyrillic is `codepage:1251` e.g. ```js const XLSX = require('xlsx'); const wb = XLSX.readFile("save.csv", {codepage:1251}); ```
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#907
No description provided.