Encoding issue when convertting slk file to csv #1544

New Issue

2019-06-20T13:29:22Z

Slayess commented

2019-06-20 13:29:22 +00:00

(Migrated from github.com)

Hello,

I'm trying to convert a slk file to csv file, with the code see below :

var wb = XLSX.readFile(file.path);
var ws1 = wb.Sheets[wb.SheetNames[0]];
var csv = XLSX.utils.sheet_to_csv(ws1);
console.log(csv);

Here a link for the slk sample file

When I'm opening the source file with Excel the result is :

The result on the console or csv file is :

Could you please tell me if I'm doing something wrong ?
Thanks in advance.

Hello, I'm trying to convert a slk file to csv file, with the code see below : ``` var wb = XLSX.readFile(file.path); var ws1 = wb.Sheets[wb.SheetNames[0]]; var csv = XLSX.utils.sheet_to_csv(ws1); console.log(csv); ``` [Here a link for the slk sample file](https://we.tl/t-NysldsGJVa) When I'm opening the source file with Excel the result is : ![Excel source](https://user-images.githubusercontent.com/52039612/59852724-bcd15880-936f-11e9-8e53-64ab9f101cd8.jpg) The result on the console or csv file is : ![console result](https://user-images.githubusercontent.com/52039612/59852702-afb46980-936f-11e9-8f7e-74ec54492164.jpg) Could you please tell me if I'm doing something wrong ? Thanks in advance.

SheetJSDev commented

2019-06-20 21:04:47 +00:00

(Migrated from github.com)

The file uses some sort of 7-bit escape sequence (like VT-52). The hexadecimal byte representations, with a CP1252 interpretation replacing the ascii ESC with _, look like this:

50 72 1b 4e 42 65 73   Pr_NBes
41 4e 44 52 c9 45      ANDRÉE

You don't see the N in the console because 0x1b 0x4e is interpreted as an escape sequence.

Excel 2019 seems to prefer using those, since opening and saving the file actually rewrites cell A2 to use a similar scheme:

41 4e 44 52 1b 4e 42 45 45  ANDR_NBEE

We would have to go back and either brute-force the characters Excel uses in CHAR function or try to find old documentation. Do you have any other characters in mind?

The file uses some sort of 7-bit escape sequence (like VT-52). The hexadecimal byte representations, with a CP1252 interpretation replacing the ascii ESC with _, look like this: ``` 50 72 1b 4e 42 65 73 Pr_NBes 41 4e 44 52 c9 45 ANDRÉE ``` You don't see the `N` in the console because `0x1b 0x4e` is interpreted as an escape sequence. Excel 2019 seems to prefer using those, since opening and saving the file actually rewrites cell A2 to use a similar scheme: ``` 41 4e 44 52 1b 4e 42 45 45 ANDR_NBEE ``` We would have to go back and either brute-force [the characters Excel uses in CHAR function](https://github.com/SheetJS/js-codepage/blob/master/bits/1252.js) or try to find old documentation. Do you have any other characters in mind?

Slayess commented

2019-06-21 10:17:03 +00:00

(Migrated from github.com)

Thanks for your reply, it's most clear for me :)
I wrote the list of the french accent and I updated my code with :

var wb = XLSX.readFile(file.path);
var ws1 = wb.Sheets[wb.SheetNames[0]];
var csv = XLSX.utils.sheet_to_csv(ws1, {Fr: ';'});
var obj = {Aa:'à',Ca:'â',Ha:'ä',Be:'é',Ae:'è',Ce:'ê',He:'ë',Ci:'î',Hi:'ï',Co:'ô',Ho:'ö',Au:'ù',Cu:'û',Hu:'ü',q:'æ',z:'œ',Kc:'ç',AA:'À',CA:'Â',HA:'Ä',BE:'É',AE:'È',CE:'Ê',HE:'Ë',CI:'Î',HI:'Ï',CO:'Ô',HO:'Ö',AU:'Ù',CU:'Û',HU:'Ü',a:'Æ',j:'Œ',KC:'Ç'};
csv = csv.replace(/(\u001B\u004E)(Aa|Ca|Ha|Be|Ae|Ce|He|Ci|Hi|Co|Ho|Au|Cu|Hu|q|z|Kc|AA|CA|HA|BE|AE|CE|HE|CI|HI|CO|HO|AU|CU|HU|a|j|KC)/gm, function(_, $1, $2){return obj[$2]});
fs.writeFileSync(file.path + '__', csv, 'utf8');

Could you please tell me if it's a «clean» way ?

Thanks for your reply, it's most clear for me :) I wrote the list of the french accent and I updated my code with : ![accent](https://user-images.githubusercontent.com/52039612/59916017-5f481500-941e-11e9-80a6-5ffcb1eb646c.jpg) ``` var wb = XLSX.readFile(file.path); var ws1 = wb.Sheets[wb.SheetNames[0]]; var csv = XLSX.utils.sheet_to_csv(ws1, {Fr: ';'}); var obj = {Aa:'à',Ca:'â',Ha:'ä',Be:'é',Ae:'è',Ce:'ê',He:'ë',Ci:'î',Hi:'ï',Co:'ô',Ho:'ö',Au:'ù',Cu:'û',Hu:'ü',q:'æ',z:'œ',Kc:'ç',AA:'À',CA:'Â',HA:'Ä',BE:'É',AE:'È',CE:'Ê',HE:'Ë',CI:'Î',HI:'Ï',CO:'Ô',HO:'Ö',AU:'Ù',CU:'Û',HU:'Ü',a:'Æ',j:'Œ',KC:'Ç'}; csv = csv.replace(/(\u001B\u004E)(Aa|Ca|Ha|Be|Ae|Ce|He|Ci|Hi|Co|Ho|Au|Cu|Hu|q|z|Kc|AA|CA|HA|BE|AE|CE|HE|CI|HI|CO|HO|AU|CU|HU|a|j|KC)/gm, function(_, $1, $2){return obj[$2]}); fs.writeFileSync(file.path + '__', csv, 'utf8'); ``` Could you please tell me if it's a «clean» way ?

SheetJSDev commented

2019-06-21 14:30:34 +00:00

(Migrated from github.com)

Ultimately this logic will be pushed into the slk parser.

There seem to be two different sets of patterns involved. In addition to the \u001BN encoding, there is another two-character encoding like \u001B 9 (ESC SPC 9) which maps to the tab character! That second one is a bit more regular: the first character encodes the high 4 bits (starting at SPC which has ASCII code 0x20) and the second character encodes the low 4 bits (starting at "0" which has ASCII code 0x30). Interestingly there are two valid encodings of Œ: \u001BNj and \u001B(< . We'll do a bit more digging to see if there's an actual pattern in the \001BN case.

PS: in your regular expression the first group is not necessary:

csv = csv.replace(/\u001B\u004E(Aa|Ca|Ha|Be|Ae|Ce|He|Ci|Hi|Co|Ho|Au|Cu|Hu|q|z|Kc|AA|CA|HA|BE|AE|CE|HE|CI|HI|CO|HO|AU|CU|HU|a|j|KC)/gm, function(_, $1){return obj[$1]});

If you wanted to go even further, you could programmatically generate the regexp:

csv = csv.replace(new RegExp("\u001BN(" + Object.keys(obj).join("|") + ")", "gm"), function(_, $1){return obj[$1]});

Ultimately this logic will be pushed into the slk parser. There seem to be two different sets of patterns involved. In addition to the `\u001BN` encoding, there is another two-character encoding like `\u001B 9` (ESC SPC 9) which maps to the tab character! That second one is a bit more regular: the first character encodes the high 4 bits (starting at SPC which has ASCII code `0x20`) and the second character encodes the low 4 bits (starting at "0" which has ASCII code `0x30`). Interestingly there are two valid encodings of `Œ`: `\u001BNj` and `\u001B(<` . We'll do a bit more digging to see if there's an actual pattern in the `\001BN` case. PS: in your regular expression the first group is not necessary: ```js csv = csv.replace(/\u001B\u004E(Aa|Ca|Ha|Be|Ae|Ce|He|Ci|Hi|Co|Ho|Au|Cu|Hu|q|z|Kc|AA|CA|HA|BE|AE|CE|HE|CI|HI|CO|HO|AU|CU|HU|a|j|KC)/gm, function(_, $1){return obj[$1]}); ``` If you wanted to go even further, you could programmatically generate the regexp: ```js csv = csv.replace(new RegExp("\u001BN(" + Object.keys(obj).join("|") + ")", "gm"), function(_, $1){return obj[$1]}); ```

Slayess commented

2019-06-22 06:19:44 +00:00

(Migrated from github.com)

Thanks a lot for your help ;)

Sign in to join this conversation.