Chinese Characters Not Showing Properly #1218

Closed
opened 2018-08-14 18:16:02 +00:00 by starmon00 · 5 comments
starmon00 commented 2018-08-14 18:16:02 +00:00 (Migrated from github.com)

I have a csv exported from google sheets

screen shot 2018-08-14 at 11 11 49 am

(Since I can't upload csv files on here, here's the link to the google sheet. Please download as csv when testing!)

https://docs.google.com/spreadsheets/d/1fJ4UqUlxAe9aP74iVVP17kwuP2WEeOqVQnImLZ7vt3k/edit?usp=sharing

My Code

const csvFile = './fileName.csv';
const workbook = xlsx.readFile(csvFile, { codepage: 65001 });
const firstSheet = workbook.SheetNames[0];
const worksheet = workbook.Sheets[firstSheet];
debug(`WORKSHEET: ${util.inspect(worksheet)}`);

Unfortunately when I print it on my terminal, the Chinese Characters (A3 in traditional chinese, A4 in simplified chinese) are not the same as the csv file.

screen shot 2018-08-14 at 11 14 36 am

I read somewhere that google exports UTF-8 so that is why I used the codepage 65001. Without it, it would spit out gibberish.

If the issue isn't me, would the workaround be to convert to UTF-16? Is this even possible?

I have a csv exported from google sheets <img width="889" alt="screen shot 2018-08-14 at 11 11 49 am" src="https://user-images.githubusercontent.com/7799546/44109823-de700f4c-9fb2-11e8-99a3-0792b8f287d6.png"> (Since I can't upload csv files on here, here's the link to the google sheet. Please download as csv when testing!) https://docs.google.com/spreadsheets/d/1fJ4UqUlxAe9aP74iVVP17kwuP2WEeOqVQnImLZ7vt3k/edit?usp=sharing My Code const csvFile = './fileName.csv'; const workbook = xlsx.readFile(csvFile, { codepage: 65001 }); const firstSheet = workbook.SheetNames[0]; const worksheet = workbook.Sheets[firstSheet]; debug(`WORKSHEET: ${util.inspect(worksheet)}`); Unfortunately when I print it on my terminal, the Chinese Characters (A3 in traditional chinese, A4 in simplified chinese) are not the same as the csv file. ![screen shot 2018-08-14 at 11 14 36 am](https://user-images.githubusercontent.com/7799546/44109968-49af4a16-9fb3-11e8-91b3-c093a2a3ac50.png) I read somewhere that google exports UTF-8 so that is why I used the codepage 65001. Without it, it would spit out gibberish. If the issue isn't me, would the workaround be to convert to UTF-16? Is this even possible?
SheetJSDev commented 2018-08-14 18:24:25 +00:00 (Migrated from github.com)

Let's see where the issue is coming from. Can you share the actual google sheet (make it public or protected, and post a share link here) -- you can make a new sheet with just the screenshotted contents

Let's see where the issue is coming from. Can you share the actual google sheet (make it public or protected, and post a share link here) -- you can make a new sheet with just the screenshotted contents
starmon00 commented 2018-08-14 18:24:59 +00:00 (Migrated from github.com)

@SheetJSDev edited original post with link already, otherwise it's here as well.

https://docs.google.com/spreadsheets/d/1fJ4UqUlxAe9aP74iVVP17kwuP2WEeOqVQnImLZ7vt3k/edit?usp=sharing

@SheetJSDev edited original post with link already, otherwise it's here as well. https://docs.google.com/spreadsheets/d/1fJ4UqUlxAe9aP74iVVP17kwuP2WEeOqVQnImLZ7vt3k/edit?usp=sharing
SheetJSDev commented 2018-08-14 18:37:15 +00:00 (Migrated from github.com)

Thanks for sharing, you can take down the sheet. The XLSX export seems to be parsed correctly but the CSV export is not

Thanks for sharing, you can take down the sheet. The XLSX export seems to be parsed correctly but the CSV export is not
starmon00 commented 2018-08-14 18:58:34 +00:00 (Migrated from github.com)

Unfortunately, I'm required to only work with CSV files, thanks for looking into it. Hopefully fixes will come soon :)

Unfortunately, I'm required to only work with CSV files, thanks for looking into it. Hopefully fixes will come soon :)
SheetJSDev commented 2018-08-14 19:05:15 +00:00 (Migrated from github.com)

Inverting through windows-1252 isn't always consistent with node buffer behavior, the following patch handles the buffer directly rather than trying to encode/decode a binary string:

--- a/bits/40_harb.js
+++ b/bits/40_harb.js
@@ -833,13 +833,17 @@ var PRN = (function() {
                switch(opts.type) {
                        case 'base64': str = Base64.decode(d); break;
                        case 'binary': str = d; break;
-                       case 'buffer': str = d.toString('binary'); break;
+                       case 'buffer':
+                               if(opts.codepage == 65001) str = d.toString('utf8');
+                               else if(opts.codepage && typeof cptable !== 'undefined') str = cptable.utils.decode(opts.codepage, d);
+                               else str = d.toString('binary');
+                               break;
                        case 'array': str = cc2str(d); break;
                        case 'string': str = d; break;
                        default: throw new Error("Unrecognized type " + opts.type);
                }
                if(bytes[0] == 0xEF && bytes[1] == 0xBB && bytes[2] == 0xBF) str = utf8read(str.slice(3));
-               else if((opts.type == 'binary' || opts.type == 'buffer') && typeof cptable !== 'undefined' && opts.codepage)  str = cptable.utils.decode(opts.codepage, cptable.utils.encode(1252,str));
+               else if((opts.type == 'binary') && typeof cptable !== 'undefined' && opts.codepage)  str = cptable.utils.decode(opts.codepage, cptable.utils.encode(1252,str));
                if(str.slice(0,19) == "socialcalc:version:") return ETH.to_sheet(opts.type == 'string' ? str : utf8read(str), opts);
                return prn_to_sheet_str(str, opts);
        }

Feel free to submit that as a PR

Inverting through windows-1252 isn't always consistent with node buffer behavior, the following patch handles the buffer directly rather than trying to encode/decode a binary string: ```diff --- a/bits/40_harb.js +++ b/bits/40_harb.js @@ -833,13 +833,17 @@ var PRN = (function() { switch(opts.type) { case 'base64': str = Base64.decode(d); break; case 'binary': str = d; break; - case 'buffer': str = d.toString('binary'); break; + case 'buffer': + if(opts.codepage == 65001) str = d.toString('utf8'); + else if(opts.codepage && typeof cptable !== 'undefined') str = cptable.utils.decode(opts.codepage, d); + else str = d.toString('binary'); + break; case 'array': str = cc2str(d); break; case 'string': str = d; break; default: throw new Error("Unrecognized type " + opts.type); } if(bytes[0] == 0xEF && bytes[1] == 0xBB && bytes[2] == 0xBF) str = utf8read(str.slice(3)); - else if((opts.type == 'binary' || opts.type == 'buffer') && typeof cptable !== 'undefined' && opts.codepage) str = cptable.utils.decode(opts.codepage, cptable.utils.encode(1252,str)); + else if((opts.type == 'binary') && typeof cptable !== 'undefined' && opts.codepage) str = cptable.utils.decode(opts.codepage, cptable.utils.encode(1252,str)); if(str.slice(0,19) == "socialcalc:version:") return ETH.to_sheet(opts.type == 'string' ? str : utf8read(str), opts); return prn_to_sheet_str(str, opts); } ``` Feel free to submit that as a PR
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#1218
No description provided.