CDATA in cell values in XLSX format #775

Closed
opened 2017-08-10 15:56:42 +00:00 by awb99 · 3 comments
awb99 commented 2017-08-10 15:56:42 +00:00 (Migrated from github.com)

I have a strange problem with js-xlsx:

I have XLSX files, and sometimes, when parsing them, all
string fields are internally comosed of CDATA strings. This tends to happen when I move
files form Windows to Linux or vice versa.

This does NOT happen in xls files

This is an example:

{t: "s", v: "<![CDATA[Team Vienna]]>", w: "<![CDATA[Team Vienna]]>"}

The XLSX files can be read without a problem with Libreoffice.
Any ideas?

Regards
Andreas

I have a strange problem with js-xlsx: I have XLSX files, and sometimes, when parsing them, all string fields are internally comosed of CDATA strings. This tends to happen when I move files form Windows to Linux or vice versa. This does NOT happen in xls files This is an example: ``` {t: "s", v: "<![CDATA[Team Vienna]]>", w: "<![CDATA[Team Vienna]]>"} ``` The XLSX files can be read without a problem with Libreoffice. Any ideas? Regards Andreas
awb99 commented 2017-08-10 16:50:12 +00:00 (Migrated from github.com)

http://oss.sheetjs.com/

I can replicate the bug also on this page. All strings seem to be changed to the CDATA representation.

http://oss.sheetjs.com/ I can replicate the bug also on this page. All strings seem to be changed to the CDATA representation.
awb99 commented 2017-08-10 17:01:54 +00:00 (Migrated from github.com)

I fixed it with a XML decoding library. But I still don't get it whythis error occurs...

var xmldecode = require('xmldecode');
 var cell_ref = XLSX.utils.encode_cell({ c: col, r: row });
    var desired_cell = ws[cell_ref];
    var desired_value = desired_cell.v;
    if (typeof desired_value == "string") {
        desired_value = xmldecode (desired_value);
   }

I fixed it with a XML decoding library. But I still don't get it whythis error occurs... ``` var xmldecode = require('xmldecode'); var cell_ref = XLSX.utils.encode_cell({ c: col, r: row }); var desired_cell = ws[cell_ref]; var desired_value = desired_cell.v; if (typeof desired_value == "string") { desired_value = xmldecode (desired_value); } ```
SheetJSDev commented 2017-08-10 17:09:33 +00:00 (Migrated from github.com)

@awb99 thanks for the report!

XLSX files are really ZIP containers with XML files. The strings in the workbook are usually stored in an XML file xl/sharedStrings.xml within the XLSX file. Most writers follow the ECMA-376 spec, which has its own style of encoding special characters (x0010). You stumbled upon a file which opted for the CDATA section logic. The parser doesn't understand CDATA blocks, so it just dumps the entire string.

This doesn't show up in XLS because the representation is completely different. There, the data is stored in binary records and the strings are usually stored in length-prefixed UTF-16 or codepage strings, avoiding the XML issue entirely.

We'll push a fix in the next version.

@awb99 thanks for the report! XLSX files are really ZIP containers with XML files. The strings in the workbook are usually stored in an XML file `xl/sharedStrings.xml` within the XLSX file. Most writers follow the ECMA-376 spec, which has its own style of encoding special characters (_x0010_). You stumbled upon a file which opted for the CDATA section logic. The parser doesn't understand CDATA blocks, so it just dumps the entire string. This doesn't show up in XLS because the representation is completely different. There, the data is stored in binary records and the strings are usually stored in length-prefixed UTF-16 or codepage strings, avoiding the XML issue entirely. We'll push a fix in the next version.
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#775
No description provided.