CSV euro symbol format error #3185

Closed
opened 2024-08-30 08:17:09 +00:00 by michael-arduni · 5 comments

Hello,
I'm trying to export a csv file and I noticed that the symbol is not properly displayed within a cell. It seems to be a miss encoding configuration somewhere.

If you take a look at the column Importo, you see that all the cells contain € instead of

Hello, I'm trying to export a csv file and I noticed that the `€` symbol is not properly displayed within a cell. It seems to be a miss encoding configuration somewhere. If you take a look at the column `Importo`, you see that all the cells contain `€` instead of `€`
Owner

Please review this response. If it makes sense, we will add this to the docs. The issue is obliquely mentioned in "Troubleshooting" but it deserves a longer explanation


SheetJS is mirroring Excel's behavior. Opening the CSV in en-US Excel for Mac shows incorrect text in column C:

i3185.png

This is because the file uses UTF-8 encoding but does not include the BOM. Excel parses the file using a system default codepage (typically codepage 1252 in en-US Windows and 10000 on en-US Mac).

To see the raw bytes in the file, inspect the output from xxd deadlines_t.csv | head:

          VV -- File does not start with BOM
00000000: 4461 7461 2053 6361 6465 6e7a 612c 436f  Data Scadenza,Co
00000010: 6e74 726f 7061 7274 652c 496d 706f 7274  ntroparte,Import
00000020: 6f2c 5374 6174 6f2c 4f72 6967 696e 652c  o,Stato,Origine,
00000030: 4461 7461 2064 6f63 756d 656e 746f 2c43  Data documento,C
00000040: 6f6e 746f 2064 6920 6170 706f 6767 696f  onto di appoggio
00000050: 2c4d 6574 6f64 6f20 6469 2070 6167 616d  ,Metodo di pagam
00000060: 656e 746f 0d0a 3331 2f30 382f 3230 3234  ento..31/08/2024
                         VVVV VV -- UTF8 encoding of €
00000070: 2c61 2c31 3030 e282 ac2c 4461 2069 6e63  ,a,100...,Da inc
00000080: 6173 7361 7265 2c43 6f6d 7065 6e73 617a  assare,Compensaz
00000090: 696f 6e65 2c2c 612c 426f 6e69 6669 636f  ione,,a,Bonifico

The sequence e2 82 ac in the 8th line corresponds to the UTF8 encoding of .

To force a UTF8 encoding on a CSV file without a proper BOM, pass the option codepage: 65001 to read or readFile.

If you are using third party software to generate CSV, ask them to add the BOM.


To verify the behavior in Excel, the BOM ef bb bf was manually added to the attached CSV file:

          VVVV VV --- UTF8 BOM manually added
00000000: efbb bf44 6174 6120 5363 6164 656e 7a61  ...Data Scadenza
00000010: 2c43 6f6e 7472 6f70 6172 7465 2c49 6d70  ,Controparte,Imp
00000020: 6f72 746f 2c53 7461 746f 2c4f 7269 6769  orto,Stato,Origi
00000030: 6e65 2c44 6174 6120 646f 6375 6d65 6e74  ne,Data document
00000040: 6f2c 436f 6e74 6f20 6469 2061 7070 6f67  o,Conto di appog
00000050: 6769 6f2c 4d65 746f 646f 2064 6920 7061  gio,Metodo di pa
00000060: 6761 6d65 6e74 6f0d 0a33 312f 3038 2f32  gamento..31/08/2
00000070: 3032 342c 612c 3130 30e2 82ac 2c44 6120  024,a,100...,Da 
00000080: 696e 6361 7373 6172 652c 436f 6d70 656e  incassare,Compen
00000090: 7361 7a69 6f6e 652c 2c61 2c42 6f6e 6966  sazione,,a,Bonif

The new CSV file opens in Excel with the correct text.

Please review this response. If it makes sense, we will add this to the docs. The issue is obliquely mentioned in ["Troubleshooting"](https://docs.sheetjs.com/docs/miscellany/errors#dbf-files-with-chinese-or-japanese-characters-have-underscores) but it deserves a longer explanation --- SheetJS is mirroring Excel's behavior. Opening the CSV in en-US Excel for Mac shows incorrect text in column C: ![i3185.png](/attachments/40c17351-bb26-4cb3-8001-0ea04a0716ef) This is because the file uses UTF-8 encoding but does not include the BOM. Excel parses the file using a system default codepage (typically codepage 1252 in en-US Windows and 10000 on en-US Mac). To see the raw bytes in the file, inspect the output from `xxd deadlines_t.csv | head`: ``` VV -- File does not start with BOM 00000000: 4461 7461 2053 6361 6465 6e7a 612c 436f Data Scadenza,Co 00000010: 6e74 726f 7061 7274 652c 496d 706f 7274 ntroparte,Import 00000020: 6f2c 5374 6174 6f2c 4f72 6967 696e 652c o,Stato,Origine, 00000030: 4461 7461 2064 6f63 756d 656e 746f 2c43 Data documento,C 00000040: 6f6e 746f 2064 6920 6170 706f 6767 696f onto di appoggio 00000050: 2c4d 6574 6f64 6f20 6469 2070 6167 616d ,Metodo di pagam 00000060: 656e 746f 0d0a 3331 2f30 382f 3230 3234 ento..31/08/2024 VVVV VV -- UTF8 encoding of € 00000070: 2c61 2c31 3030 e282 ac2c 4461 2069 6e63 ,a,100...,Da inc 00000080: 6173 7361 7265 2c43 6f6d 7065 6e73 617a assare,Compensaz 00000090: 696f 6e65 2c2c 612c 426f 6e69 6669 636f ione,,a,Bonifico ``` The sequence `e2 82 ac` in the 8th line corresponds to the UTF8 encoding of `€`. To force a UTF8 encoding on a CSV file without a proper BOM, pass the [option `codepage: 65001` to `read` or `readFile`](https://docs.sheetjs.com/docs/api/parse-options#parsing-options). If you are using third party software to generate CSV, ask them to add the BOM. --- To verify the behavior in Excel, the BOM `ef bb bf` was manually added to the attached CSV file: ``` VVVV VV --- UTF8 BOM manually added 00000000: efbb bf44 6174 6120 5363 6164 656e 7a61 ...Data Scadenza 00000010: 2c43 6f6e 7472 6f70 6172 7465 2c49 6d70 ,Controparte,Imp 00000020: 6f72 746f 2c53 7461 746f 2c4f 7269 6769 orto,Stato,Origi 00000030: 6e65 2c44 6174 6120 646f 6375 6d65 6e74 ne,Data document 00000040: 6f2c 436f 6e74 6f20 6469 2061 7070 6f67 o,Conto di appog 00000050: 6769 6f2c 4d65 746f 646f 2064 6920 7061 gio,Metodo di pa 00000060: 6761 6d65 6e74 6f0d 0a33 312f 3038 2f32 gamento..31/08/2 00000070: 3032 342c 612c 3130 30e2 82ac 2c44 6120 024,a,100...,Da 00000080: 696e 6361 7373 6172 652c 436f 6d70 656e incassare,Compen 00000090: 7361 7a69 6f6e 652c 2c61 2c42 6f6e 6966 sazione,,a,Bonif ``` The new CSV file opens in Excel with the correct text.
Author

Thanks for your answer, but why I would force UTF8 encoding without a proper BOM?
Is not possible to add the BOM when writing a csv file?

Thanks for your answer, but why I would force UTF8 encoding without a proper BOM? Is not possible to add the BOM when writing a csv file?
Owner

If you are exporting CSVs using SheetJS:

var aoa = [['€']];
XLSX.writeFile(XLSX.utils.book_new(XLSX.utils.aoa_to_sheet(aoa)), "i3185.csv")

Small demo: https://jsfiddle.net/kv59zog1/

  • sheet_to_csv returns a normal JavaScript string. If you are writing the file using DOM or NodeJS APIs, you will most likely have to add the BOM programmatically before exporting.
var aoa = [['€']];
var csv_no_bom = XLSX.utils.sheet_to_csv(XLSX.utils.aoa_to_sheet(aoa))
If you are exporting CSVs using SheetJS: - [`writeFile` always adds the BOM](https://docs.sheetjs.com/docs/api/write-options#output-type): ```js var aoa = [['€']]; XLSX.writeFile(XLSX.utils.book_new(XLSX.utils.aoa_to_sheet(aoa)), "i3185.csv") ``` Small demo: https://jsfiddle.net/kv59zog1/ - `sheet_to_csv` returns a normal JavaScript string. If you are writing the file using DOM or NodeJS APIs, you will most likely have to add the BOM programmatically before exporting. ```js var aoa = [['€']]; var csv_no_bom = XLSX.utils.sheet_to_csv(XLSX.utils.aoa_to_sheet(aoa)) ```
Author

It's clear, my only issues now is that I have to return a string representation of the genereted csv with the BOM encoding.
Once the file is read, I have an object of type WorkBook . As you said, if I use XLSX.utils.sheet_to_csv(file) it will return a string without BOM encoding.

How can I read a csv file and returning its string value keeping the BOM encoding?

It's clear, my only issues now is that I have to return a string representation of the genereted csv with the BOM encoding. Once the file is read, I have an object of type `WorkBook` . As you said, if I use `XLSX.utils.sheet_to_csv(file)` it will return a string without BOM encoding. How can I read a csv file and returning its string value keeping the BOM encoding?
Owner

If your API requires you to return a string, you can manually prepend \uFEFF.


To keep it simple, suppose the worksheet represents the following array of arrays:

var data = [ [ "a", "€", 1 ] ]
var ws = XLSX.utils.aoa_to_sheet(data);
var wb = XLSX.utils.book_new(ws);

Most modern JS functions that generate files (e.g. HTML5 download attribute) will run text data through TextEncoder:

var csv_string = XLSX.utils.sheet_to_csv(ws);
var ab = new TextEncoder().encode(csv_string);
console.log(new Uint8Array(ab));

You will see 7 bytes:

[ 97, 44, 226, 130, 172, 44, 49 ]
//        |-----------|  UTF8 encoding of the euro symbol

Manually prepending the BOM character will add the required bytes to the output:

var csv_string = XLSX.utils.sheet_to_csv(ws);
var ab = new TextEncoder().encode("\uFEFF" + csv_string);
console.log(new Uint8Array(ab));

You will see the BOM bytes 0xEF 0xBB 0xBF:

[ 239, 187, 191, 97, 44, 226, 130, 172, 44, 49 ]
//|-----------| UTF8 BOM bytes

If you are using SheetJS to read CSV files, you can force a UTF8 interpretation by passing the option codepage: 65001.

If you are receiving a raw byte string from FileReader#readAsBinaryString or a similar API, pass the option type: "binary"

If your API requires you to return a string, you can manually prepend `\uFEFF`. --- To keep it simple, suppose the worksheet represents the following array of arrays: ```js var data = [ [ "a", "€", 1 ] ] var ws = XLSX.utils.aoa_to_sheet(data); var wb = XLSX.utils.book_new(ws); ``` Most modern JS functions that generate files (e.g. HTML5 `download` attribute) will run text data through `TextEncoder`: ```js var csv_string = XLSX.utils.sheet_to_csv(ws); var ab = new TextEncoder().encode(csv_string); console.log(new Uint8Array(ab)); ``` You will see 7 bytes: ```js [ 97, 44, 226, 130, 172, 44, 49 ] // |-----------| UTF8 encoding of the euro symbol ``` Manually prepending the BOM character will add the required bytes to the output: ```js var csv_string = XLSX.utils.sheet_to_csv(ws); var ab = new TextEncoder().encode("\uFEFF" + csv_string); console.log(new Uint8Array(ab)); ``` You will see the BOM bytes 0xEF 0xBB 0xBF: ```js [ 239, 187, 191, 97, 44, 226, 130, 172, 44, 49 ] //|-----------| UTF8 BOM bytes ``` --- If you are using SheetJS to *read* CSV files, you can force a UTF8 interpretation by passing the option [`codepage: 65001`](https://docs.sheetjs.com/docs/api/parse-options#parsing-options). If you are receiving a raw byte string from [`FileReader#readAsBinaryString`](https://developer.mozilla.org/en-US/docs/Web/API/FileReader/readAsBinaryString) or a similar API, pass the option [`type: "binary"`](https://docs.sheetjs.com/docs/api/parse-options#input-type)
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#3185
No description provided.