how to read file as ut8? #3128

Closed
opened 2024-05-15 16:40:42 +00:00 by evoytenkoapps · 7 comments

cant read csv file with cyrilyc

"xlsx": "^0.18.5"
node.js 14
macos

cant read csv file with cyrilyc "xlsx": "^0.18.5" node.js 14 macos
Owner

SheetJS is consistent with Excel in that the default codepage is used to interpret CSV data. Modern text editors tend to use UTF8 as the default encoding.

There are two ways to force UTF8 interpretation:

  1. Ensure that the file starts with the UTF8 BOM (\xEF\xBB\xBF).

  2. Explicitly pass the option codepage: 65001 to XLSX.read.

SheetJS is consistent with Excel in that the default codepage is used to interpret CSV data. Modern text editors tend to use UTF8 as the default encoding. There are two ways to force UTF8 interpretation: 1) Ensure that the file starts with the UTF8 BOM (`\xEF\xBB\xBF`). 2) Explicitly pass the option `codepage: 65001` to `XLSX.read`.
Author

i call this command on my macos
is it correct file?

file -I upload_afd2e045490707e4784134a22456c776.csv

upload_afd2e045490707e4784134a22456c776.csv: text/plain; charset=utf-8

i call this command on my macos is it correct file? file -I upload_afd2e045490707e4784134a22456c776.csv upload_afd2e045490707e4784134a22456c776.csv: text/plain; charset=utf-8
Owner

This explanation will be added somewhere in the docs.

Consider the following content:

ЅНЕ̂ЀТЈЅ;54337
ЅНЕ̂ЀТЈЅ;54337

The UTF8 encoded version, including the BOM, is i3128-utf8-bom.csv and has the following contents:

% xxd i3128-utf8-bom.csv 
00000000: efbb bfd0 85d0 9dd0 95cc 82d0 80d0 a2d0  ................
00000010: 88d0 853b 3534 3333 370a d085 d09d d095  ...;54337.......
00000020: cc82 d080 d0a2 d088 d085 3b35 3433 3337  ..........;54337
00000030: 0a

This file has the UTF8 BOM so Excel will treat it as UTF8.


The version without a BOM is i3128-utf8-nobom.csv:

% xxd i3128-utf8-nobom.csv 
00000000: d085 d09d d095 cc82 d080 d0a2 d088 d085  ................
00000010: 3b35 3433 3337 0ad0 85d0 9dd0 95cc 82d0  ;54337..........
00000020: 80d0 a2d0 88d0 853b 3534 3333 370a       .......;54337.

Excel tries to interpret it using the default codepage. In English (United States) you see the following mess:

i3128-utf8-nobom.png

SheetJS produces a similar mess:

% npx xlsx-cli i3128-utf8-nobom.csv 
Sheet1
ЅНЕ̂ЀТЈЅ,54337
ЅНЕ̂ЀТЈЅ,54337

The file command applies heuristics to determine the MIME type and charset. It does not care about the presence of the BOM. Other tools, including Apple Numbers, will assume UTF8 encoding by default.

This explanation will be added somewhere in the docs. Consider the following content: ``` ЅНЕ̂ЀТЈЅ;54337 ЅНЕ̂ЀТЈЅ;54337 ``` The UTF8 encoded version, including the BOM, is `i3128-utf8-bom.csv` and has the following contents: ``` % xxd i3128-utf8-bom.csv 00000000: efbb bfd0 85d0 9dd0 95cc 82d0 80d0 a2d0 ................ 00000010: 88d0 853b 3534 3333 370a d085 d09d d095 ...;54337....... 00000020: cc82 d080 d0a2 d088 d085 3b35 3433 3337 ..........;54337 00000030: 0a ``` This file has the UTF8 BOM so Excel will treat it as UTF8. --- The version without a BOM is `i3128-utf8-nobom.csv`: ``` % xxd i3128-utf8-nobom.csv 00000000: d085 d09d d095 cc82 d080 d0a2 d088 d085 ................ 00000010: 3b35 3433 3337 0ad0 85d0 9dd0 95cc 82d0 ;54337.......... 00000020: 80d0 a2d0 88d0 853b 3534 3333 370a .......;54337. ``` Excel tries to interpret it using the default codepage. In English (United States) you see the following mess: ![i3128-utf8-nobom.png](/attachments/c0c64341-5c69-470f-861b-a1e0c5ecc5d6) SheetJS produces a similar mess: ``` % npx xlsx-cli i3128-utf8-nobom.csv Sheet1 ЅНЕ̂ЀТЈЅ,54337 ЅНЕ̂ЀТЈЅ,54337 ``` --- The `file` command applies heuristics to determine the MIME type and charset. It does not care about the presence of the BOM. Other tools, including Apple Numbers, will assume UTF8 encoding by default.
Author

so what can help me?

so what can help me?
Owner

https://docs.sheetjs.com/docs/api/parse-options#parsing-options pass the option codepage: 65001 to XLSX.readFile

https://docs.sheetjs.com/docs/api/parse-options#parsing-options pass the option `codepage: 65001` to `XLSX.readFile`
Author

doesnt work, what else?

doesnt work, what else?
Owner

You probably want to resolve(x). If you are having further issues, join the chat

You probably want to `resolve(x)`. If you are having further issues, join [the chat](https://discord.gg/sheetjs)
sheetjs locked and limited conversation to collaborators 2024-05-15 18:32:48 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#3128
No description provided.