Weird Characters parsed #2081

Closed
opened 2020-08-11 04:02:53 +00:00 by lxzhh · 3 comments
lxzhh commented 2020-08-11 04:02:53 +00:00 (Migrated from github.com)

I have this example csv file, when parse it using:
const workbook = XLSX.read(data, { type: 'array' })
It output characters like <p>Â </p>, which is actually a space.
example_error_character.csv.zip

I have this example csv file, when parse it using: `const workbook = XLSX.read(data, { type: 'array' })` It output characters like `<p>Â </p>`, which is actually a space. [example_error_character.csv.zip](https://github.com/SheetJS/sheetjs/files/5054670/example_error_character.csv.zip)
SheetJSDev commented 2020-08-11 04:14:31 +00:00 (Migrated from github.com)

It's a UTF8 CSV but missing the BOM. You can see this using xxd:

00000070: 6972 204d 6174 222c 223c 703e c2a0 3c2f  ir Mat","<p>..</
00000080: 703e 0a3c 703e c2a0 3c2f 703e 222c 2c2c  p>.<p>..</p>",,,

To force a UTF8 interpretation, pass the option codepage: 65001:

const workbook = XLSX.read(data, { type: 'array', codepage: 65001 })

It's a UTF8 CSV but missing the BOM. You can see this using `xxd`: ``` 00000070: 6972 204d 6174 222c 223c 703e c2a0 3c2f ir Mat","<p>..</ 00000080: 703e 0a3c 703e c2a0 3c2f 703e 222c 2c2c p>.<p>..</p>",,, ``` To force a UTF8 interpretation, pass the option `codepage: 65001`: const workbook = XLSX.read(data, { type: 'array', codepage: 65001 })
lxzhh commented 2020-08-11 05:29:21 +00:00 (Migrated from github.com)

It's a UTF8 CSV but missing the BOM. You can see this using xxd:

00000070: 6972 204d 6174 222c 223c 703e c2a0 3c2f  ir Mat","<p>..</
00000080: 703e 0a3c 703e c2a0 3c2f 703e 222c 2c2c  p>.<p>..</p>",,,

To force a UTF8 interpretation, pass the option codepage: 65001:

const workbook = XLSX.read(data, { type: 'array', codepage: 65001 })

Thanks for respone.
I've tried to add the codepage configuration, still not work, still outputs:
<p>Â </p>

> It's a UTF8 CSV but missing the BOM. You can see this using `xxd`: > > ``` > 00000070: 6972 204d 6174 222c 223c 703e c2a0 3c2f ir Mat","<p>..</ > 00000080: 703e 0a3c 703e c2a0 3c2f 703e 222c 2c2c p>.<p>..</p>",,, > ``` > > To force a UTF8 interpretation, pass the option `codepage: 65001`: > > const workbook = XLSX.read(data, { type: 'array', codepage: 65001 }) Thanks for respone. I've tried to add the codepage configuration, still not work, still outputs: `<p>Â </p>`
SheetJSDev commented 2020-08-11 05:46:30 +00:00 (Migrated from github.com)

You're right, the array case in https://github.com/SheetJS/sheetjs/blob/master/bits/40_harb.js#L888 does not handle the codepage argument. As a temporary workaround, convert to binary string as shown in https://jsfiddle.net/7Lrmxb8c/ :

/* assuming data is an Array or Uint8Array */
const binary = [...data].map(x => String.fromCharCode(x)).join("");
const workbook = XLSX.read(binary, { type: 'binary', codepage: 65001 });
You're right, the array case in https://github.com/SheetJS/sheetjs/blob/master/bits/40_harb.js#L888 does not handle the codepage argument. As a temporary workaround, convert to binary string as shown in https://jsfiddle.net/7Lrmxb8c/ : ```js /* assuming data is an Array or Uint8Array */ const binary = [...data].map(x => String.fromCharCode(x)).join(""); const workbook = XLSX.read(binary, { type: 'binary', codepage: 65001 }); ```
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#2081
No description provided.