CDATA in cell values in XLSX format #775
Labels
No Label
DBF
Dates
Defined Names
Features
Formula
HTML
Images
Infrastructure
Integration
International
ODS
Operations
Performance
PivotTables
Pro
Protection
Read Bug
SSF
SYLK
Style
Write Bug
good first issue
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: sheetjs/sheetjs#775
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I have a strange problem with js-xlsx:
I have XLSX files, and sometimes, when parsing them, all
string fields are internally comosed of CDATA strings. This tends to happen when I move
files form Windows to Linux or vice versa.
This does NOT happen in xls files
This is an example:
The XLSX files can be read without a problem with Libreoffice.
Any ideas?
Regards
Andreas
http://oss.sheetjs.com/
I can replicate the bug also on this page. All strings seem to be changed to the CDATA representation.
I fixed it with a XML decoding library. But I still don't get it whythis error occurs...
@awb99 thanks for the report!
XLSX files are really ZIP containers with XML files. The strings in the workbook are usually stored in an XML file
xl/sharedStrings.xml
within the XLSX file. Most writers follow the ECMA-376 spec, which has its own style of encoding special characters (x0010). You stumbled upon a file which opted for the CDATA section logic. The parser doesn't understand CDATA blocks, so it just dumps the entire string.This doesn't show up in XLS because the representation is completely different. There, the data is stored in binary records and the strings are usually stored in length-prefixed UTF-16 or codepage strings, avoiding the XML issue entirely.
We'll push a fix in the next version.