Encoding issue when convertting slk file to csv #1544

Closed
opened 2019-06-20 13:29:22 +00:00 by Slayess · 4 comments
Slayess commented 2019-06-20 13:29:22 +00:00 (Migrated from github.com)

Hello,

I'm trying to convert a slk file to csv file, with the code see below :

var wb = XLSX.readFile(file.path);
var ws1 = wb.Sheets[wb.SheetNames[0]];
var csv = XLSX.utils.sheet_to_csv(ws1);
console.log(csv);

Here a link for the slk sample file

When I'm opening the source file with Excel the result is :
Excel source

The result on the console or csv file is :
console result

Could you please tell me if I'm doing something wrong ?
Thanks in advance.

Hello, I'm trying to convert a slk file to csv file, with the code see below : ``` var wb = XLSX.readFile(file.path); var ws1 = wb.Sheets[wb.SheetNames[0]]; var csv = XLSX.utils.sheet_to_csv(ws1); console.log(csv); ``` [Here a link for the slk sample file](https://we.tl/t-NysldsGJVa) When I'm opening the source file with Excel the result is : ![Excel source](https://user-images.githubusercontent.com/52039612/59852724-bcd15880-936f-11e9-8e53-64ab9f101cd8.jpg) The result on the console or csv file is : ![console result](https://user-images.githubusercontent.com/52039612/59852702-afb46980-936f-11e9-8f7e-74ec54492164.jpg) Could you please tell me if I'm doing something wrong ? Thanks in advance.
SheetJSDev commented 2019-06-20 21:04:47 +00:00 (Migrated from github.com)

The file uses some sort of 7-bit escape sequence (like VT-52). The hexadecimal byte representations, with a CP1252 interpretation replacing the ascii ESC with _, look like this:

50 72 1b 4e 42 65 73   Pr_NBes
41 4e 44 52 c9 45      ANDRÉE

You don't see the N in the console because 0x1b 0x4e is interpreted as an escape sequence.

Excel 2019 seems to prefer using those, since opening and saving the file actually rewrites cell A2 to use a similar scheme:

41 4e 44 52 1b 4e 42 45 45  ANDR_NBEE

We would have to go back and either brute-force the characters Excel uses in CHAR function or try to find old documentation. Do you have any other characters in mind?

The file uses some sort of 7-bit escape sequence (like VT-52). The hexadecimal byte representations, with a CP1252 interpretation replacing the ascii ESC with _, look like this: ``` 50 72 1b 4e 42 65 73 Pr_NBes 41 4e 44 52 c9 45 ANDRÉE ``` You don't see the `N` in the console because `0x1b 0x4e` is interpreted as an escape sequence. Excel 2019 seems to prefer using those, since opening and saving the file actually rewrites cell A2 to use a similar scheme: ``` 41 4e 44 52 1b 4e 42 45 45 ANDR_NBEE ``` We would have to go back and either brute-force [the characters Excel uses in CHAR function](https://github.com/SheetJS/js-codepage/blob/master/bits/1252.js) or try to find old documentation. Do you have any other characters in mind?
Slayess commented 2019-06-21 10:17:03 +00:00 (Migrated from github.com)

Thanks for your reply, it's most clear for me :)
I wrote the list of the french accent and I updated my code with :
accent

var wb = XLSX.readFile(file.path);
var ws1 = wb.Sheets[wb.SheetNames[0]];
var csv = XLSX.utils.sheet_to_csv(ws1, {Fr: ';'});
var obj = {Aa:'à',Ca:'â',Ha:'ä',Be:'é',Ae:'è',Ce:'ê',He:'ë',Ci:'î',Hi:'ï',Co:'ô',Ho:'ö',Au:'ù',Cu:'û',Hu:'ü',q:'æ',z:'œ',Kc:'ç',AA:'À',CA:'Â',HA:'Ä',BE:'É',AE:'È',CE:'Ê',HE:'Ë',CI:'Î',HI:'Ï',CO:'Ô',HO:'Ö',AU:'Ù',CU:'Û',HU:'Ü',a:'Æ',j:'Œ',KC:'Ç'};
csv = csv.replace(/(\u001B\u004E)(Aa|Ca|Ha|Be|Ae|Ce|He|Ci|Hi|Co|Ho|Au|Cu|Hu|q|z|Kc|AA|CA|HA|BE|AE|CE|HE|CI|HI|CO|HO|AU|CU|HU|a|j|KC)/gm, function(_, $1, $2){return obj[$2]});
fs.writeFileSync(file.path + '__', csv, 'utf8');

Could you please tell me if it's a «clean» way ?

Thanks for your reply, it's most clear for me :) I wrote the list of the french accent and I updated my code with : ![accent](https://user-images.githubusercontent.com/52039612/59916017-5f481500-941e-11e9-80a6-5ffcb1eb646c.jpg) ``` var wb = XLSX.readFile(file.path); var ws1 = wb.Sheets[wb.SheetNames[0]]; var csv = XLSX.utils.sheet_to_csv(ws1, {Fr: ';'}); var obj = {Aa:'à',Ca:'â',Ha:'ä',Be:'é',Ae:'è',Ce:'ê',He:'ë',Ci:'î',Hi:'ï',Co:'ô',Ho:'ö',Au:'ù',Cu:'û',Hu:'ü',q:'æ',z:'œ',Kc:'ç',AA:'À',CA:'Â',HA:'Ä',BE:'É',AE:'È',CE:'Ê',HE:'Ë',CI:'Î',HI:'Ï',CO:'Ô',HO:'Ö',AU:'Ù',CU:'Û',HU:'Ü',a:'Æ',j:'Œ',KC:'Ç'}; csv = csv.replace(/(\u001B\u004E)(Aa|Ca|Ha|Be|Ae|Ce|He|Ci|Hi|Co|Ho|Au|Cu|Hu|q|z|Kc|AA|CA|HA|BE|AE|CE|HE|CI|HI|CO|HO|AU|CU|HU|a|j|KC)/gm, function(_, $1, $2){return obj[$2]}); fs.writeFileSync(file.path + '__', csv, 'utf8'); ``` Could you please tell me if it's a «clean» way ?
SheetJSDev commented 2019-06-21 14:30:34 +00:00 (Migrated from github.com)

Ultimately this logic will be pushed into the slk parser.

There seem to be two different sets of patterns involved. In addition to the \u001BN encoding, there is another two-character encoding like \u001B 9 (ESC SPC 9) which maps to the tab character! That second one is a bit more regular: the first character encodes the high 4 bits (starting at SPC which has ASCII code 0x20) and the second character encodes the low 4 bits (starting at "0" which has ASCII code 0x30). Interestingly there are two valid encodings of Œ: \u001BNj and \u001B(< . We'll do a bit more digging to see if there's an actual pattern in the \001BN case.

PS: in your regular expression the first group is not necessary:

csv = csv.replace(/\u001B\u004E(Aa|Ca|Ha|Be|Ae|Ce|He|Ci|Hi|Co|Ho|Au|Cu|Hu|q|z|Kc|AA|CA|HA|BE|AE|CE|HE|CI|HI|CO|HO|AU|CU|HU|a|j|KC)/gm, function(_, $1){return obj[$1]});

If you wanted to go even further, you could programmatically generate the regexp:

csv = csv.replace(new RegExp("\u001BN(" + Object.keys(obj).join("|") + ")", "gm"), function(_, $1){return obj[$1]});
Ultimately this logic will be pushed into the slk parser. There seem to be two different sets of patterns involved. In addition to the `\u001BN` encoding, there is another two-character encoding like `\u001B 9` (ESC SPC 9) which maps to the tab character! That second one is a bit more regular: the first character encodes the high 4 bits (starting at SPC which has ASCII code `0x20`) and the second character encodes the low 4 bits (starting at "0" which has ASCII code `0x30`). Interestingly there are two valid encodings of `Œ`: `\u001BNj` and `\u001B(<` . We'll do a bit more digging to see if there's an actual pattern in the `\001BN` case. PS: in your regular expression the first group is not necessary: ```js csv = csv.replace(/\u001B\u004E(Aa|Ca|Ha|Be|Ae|Ce|He|Ci|Hi|Co|Ho|Au|Cu|Hu|q|z|Kc|AA|CA|HA|BE|AE|CE|HE|CI|HI|CO|HO|AU|CU|HU|a|j|KC)/gm, function(_, $1){return obj[$1]}); ``` If you wanted to go even further, you could programmatically generate the regexp: ```js csv = csv.replace(new RegExp("\u001BN(" + Object.keys(obj).join("|") + ")", "gm"), function(_, $1){return obj[$1]}); ```
Slayess commented 2019-06-22 06:19:44 +00:00 (Migrated from github.com)

Thanks a lot for your help ;)

Thanks a lot for your help ;)
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#1544
No description provided.