Problem with reading cyrillic CSV without BOM (ANSI as UTF-8) #907
Labels
No Label
DBF
Dates
Defined Names
Features
Formula
HTML
Images
Infrastructure
Integration
International
ODS
Operations
Performance
PivotTables
Pro
Protection
Read Bug
SSF
SYLK
Style
Write Bug
good first issue
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: sheetjs/sheetjs#907
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I stumble over the problem with reading CSV without BOM on Windows 10 (not tested on Unix).
My script generate CSV file in next order
The file can be opened well in Notepad++ but in right-bottom corner is writen "ANSI as UTF-8".
When I try to handle this file with XLSX it returns Cyrillic in bad encoding.
and when I get value in debugger it returns
but when I change this line
file.write(csv + '\n');
tofile.write('\ufeff' + csv + '\n');
all work as aspected.'\ufeff'
means add BOM to the file (taken here https://stackoverflow.com/a/13859239/1335142)So to be clear, Excel does not assume UTF8 by default -- it actually assumes the system codepage. You can see this by trying to open the file in Excel. This is what the save.csv file (without the BOM) looks like in Excel 2013 in the English (United States) locale with default CP 1252:
If you add the BOM then Excel will treat the file as UTF8, which is why it looks correct in the second case.
If you have control over the origin, I would make sure that the generator is properly adding the BOM.
This is not expected behavior as I obviously set
encoding: 'utf8'
and hope other utilities understand that.I can add BOM in my codebase but this is single case. In most cases, developers get a complete file from other resources (services, APIs) and would like to work with that but in this case, he/she should convert a file before the following processing.
Is there any possibility "explain" XLSX that file has correct encoding without converting?
Guess what program doesn't understand UTF-8 by default? You guessed it: Excel. Excel's default behavior is to interpret in the local codepage.
If you'd like to test it out, the
CHAR
function gives you different results based on local codepage:Change your computer region settings and try to open the file to see a fun surprise.
We'll definitely need to add a
codepage
option at some point (so you would be able to force UTF8 interpretation withcodepage:65001
) but it's unclear if it makes sense to default to UTF8.I've changed my computer region and language settings to the United Kingdom and English respectively and I've opened the file, nothing changed... As I can guess, Excel is not to interpret in the local codebase.
Here is my computer settings and opened file
The option to set the default codepage to Cyrillic is
codepage:1251
e.g.