Weird character conversion when input XLSX file contains _X2069_ #3177

Closed
opened 2024-08-14 12:12:54 +00:00 by inreoh · 2 comments

(this was originally posted on StackOverflow as a question regarding node-xlsx )

I'm having a weird problem with reading XLSX files when the input contains the character sequence _X2069_ (underscore X 2 0 6 9 underscore).

Upon reading, the string _X2069_ is converted to the Unicode character U+2069 (Pop Directional Isolate).

Here's a minimal code sample that exhibits the problem (I tried different values for the "type" option, but it seems to have no effect):

 const xlsx = require('xlsx');

 const workbook = xlsx.readFile('./input.xlsx', {type: 'binary'});
 console.log(workbook.Sheets);

Here's my input (see also attached file input.xlsx):
image

Here's a screenshot of the output:
image

OS: Windows 11 (German language settings)
Excel: 2016 (used to generate the input file)
xlsx: v0.20.2

This looks like it might be related to Issue 72: full Unicode support, but I'm not 100% sure.

How can I ensure my input is parsed properly?

(this was [originally posted on StackOverflow](https://stackoverflow.com/questions/78846937/weird-character-conversion-when-x2069-is-contained-in-input-sheet-for-node-xls) as a question regarding node-xlsx ) I'm having a weird problem with reading XLSX files when the input contains the character sequence `_X2069_` (underscore X 2 0 6 9 underscore). Upon reading, the string `_X2069_` is converted to the Unicode character U+2069 (Pop Directional Isolate). Here's a minimal code sample that exhibits the problem (I tried different values for the "type" option, but it seems to have no effect): const xlsx = require('xlsx'); const workbook = xlsx.readFile('./input.xlsx', {type: 'binary'}); console.log(workbook.Sheets); Here's my input (see also attached file input.xlsx): ![image](/attachments/c959125d-beda-460b-b3e6-0a19efc4d65c) Here's a screenshot of the output: ![image](/attachments/1a0251ee-868b-4edd-a118-5de1cf3b74e7) OS: Windows 11 (German language settings) Excel: 2016 (used to generate the input file) xlsx: v0.20.2 This looks like it might be related to [Issue 72: full Unicode support](https://git.sheetjs.com/sheetjs/sheetjs/issues/72), but I'm not 100% sure. How can I ensure my input is parsed properly?
Owner

Thanks for raising an issue!

In XLSX, certain unicode characters are encoded using a special representation _x####_ based on the hexadecimal code. For example, the string _x2069_ is actually stored as _x005F_x2069_ (where the first underscore is encoded as _x005F_, the character code for _)

The current code treats the x as case insensitive, so _X2069_ is treated as the encoded version of the string. Fortunately it is a one-character fix:

--- a/bits/22_xmlutils.js
+++ b/bits/22_xmlutils.js
@@ -68,7 +68,7 @@ var rencoding = /*#__PURE__*/evert(encodings);
 var unescapexml/*:StringConv*/ = /*#__PURE__*/(function() {
        /* 22.4.2.4 bstr (Basic String) */
-       var encregex = /&(?:quot|apos|gt|lt|amp|#x?([\da-fA-F]+));/ig, coderegex = /_x([\da-fA-F]{4})_/ig;
+       var encregex = /&(?:quot|apos|gt|lt|amp|#x?([\da-fA-F]+));/ig, coderegex = /_x([\da-fA-F]{4})_/g;
        function raw_unescapexml(text/*:string*/)/*:string*/ {
                var s = text + '', i = s.indexOf("<![CDATA[");
                if(i == -1) return s.replace(encregex, function($$, $1) { return encodings[$$]||String.fromCharCode(parseInt($1,$$.indexOf("x")>-1?16:10))||$$; }).replace(coderegex,function(m,c) {return String.fromCharCode(parseInt(c,16));});
Thanks for raising an issue! In XLSX, certain unicode characters are encoded using a special representation `_x####_` based on the hexadecimal code. For example, the string `_x2069_` is actually stored as `_x005F_x2069_` (where the first underscore is encoded as `_x005F_`, the character code for `_`) The current code treats the `x` as case insensitive, so `_X2069_` is treated as the encoded version of the string. Fortunately it is a one-character fix: ```diff --- a/bits/22_xmlutils.js +++ b/bits/22_xmlutils.js @@ -68,7 +68,7 @@ var rencoding = /*#__PURE__*/evert(encodings); var unescapexml/*:StringConv*/ = /*#__PURE__*/(function() { /* 22.4.2.4 bstr (Basic String) */ - var encregex = /&(?:quot|apos|gt|lt|amp|#x?([\da-fA-F]+));/ig, coderegex = /_x([\da-fA-F]{4})_/ig; + var encregex = /&(?:quot|apos|gt|lt|amp|#x?([\da-fA-F]+));/ig, coderegex = /_x([\da-fA-F]{4})_/g; function raw_unescapexml(text/*:string*/)/*:string*/ { var s = text + '', i = s.indexOf("<![CDATA["); if(i == -1) return s.replace(encregex, function($$, $1) { return encodings[$$]||String.fromCharCode(parseInt($1,$$.indexOf("x")>-1?16:10))||$$; }).replace(coderegex,function(m,c) {return String.fromCharCode(parseInt(c,16));}); ```
Author

So the root cause for the issue is that the code thinks that _X2069_ is actually _x2069_ and therefore part of a special representation? Ok, that explains it. Do you have a rough estimate when the fixed version will become available?

So the root cause for the issue is that the code thinks that `_X2069_` is actually `_x2069_` and therefore part of a special representation? Ok, that explains it. Do you have a rough estimate when the fixed version will become available?
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#3177
No description provided.