how to resolve the encode problem? #739

Closed
opened 2017-07-20 02:47:33 +00:00 by LittleBreak · 21 comments
LittleBreak commented 2017-07-20 02:47:33 +00:00 (Migrated from github.com)

i got a xml file include some chinese characters, when i use the XLSX.write api, the chinese characters changed.
here is my code
workbook is a xml string
var workbook = XLSX.read(content, {type: 'binary'});
workbook.Sheets['ag-grid']['A24'].v
image
the cell value became this, the oringin value is '中文测试',apologize for my bed english,.

i got a xml file include some chinese characters, when i use the XLSX.write api, the chinese characters changed. here is my code workbook is a xml string `var workbook = XLSX.read(content, {type: 'binary'});` workbook.Sheets['ag-grid']['A24'].v ![image](https://user-images.githubusercontent.com/16707072/28398132-5dfea1b4-6d38-11e7-84d4-e884a7c64967.png) the cell value became this, the oringin value is '中文测试',apologize for my bed english,.
reviewher commented 2017-07-20 02:51:30 +00:00 (Migrated from github.com)

Can you share the original file and a screenshot of what excel shows?

Can you share the original file and a screenshot of what excel shows?
kwangin commented 2017-07-28 05:28:58 +00:00 (Migrated from github.com)

I also have the same question. I'm working with japanese/korean. It cannot read these words, but english shows no problem.
Target excel file is the one which generated from python workbook library.
But something weird is, it reads the data well after export new xlsx file from original one without no changing.

I also have the same question. I'm working with japanese/korean. It cannot read these words, but english shows no problem. Target excel file is the one which generated from python workbook library. But something weird is, it reads the data well after export new xlsx file from original one without no changing.
SheetJSDev commented 2017-07-28 05:33:09 +00:00 (Migrated from github.com)

@kwangin Can you upload or send us:

  1. the original file (that was causing an issue)
  2. an Excel screenshot showing the right values
  3. the new file after export

If you are running in the browser, are you using the full version or the core version?

@kwangin Can you upload or send us: 1) the original file (that was causing an issue) 2) an Excel screenshot showing the right values 3) the new file after export If you are running in the browser, are you using the full version or the core version?
PWDream commented 2017-09-20 04:23:40 +00:00 (Migrated from github.com)

I also have the same question.@kwangin you solve the problem?
@SheetJSDev My files are as follows
预算_月度累计预算_20170920_1216.xlsx

I also have the same question.@kwangin you solve the problem? @SheetJSDev My files are as follows [预算_月度累计预算_20170920_1216.xlsx](https://github.com/SheetJS/js-xlsx/files/1316591/_._20170920_1216.xlsx)
SheetJSDev commented 2017-09-20 04:47:57 +00:00 (Migrated from github.com)

@PWDream thanks for sharing! This was very helpful. It looks like a file from Apache POI.

The tool that wrote the file used xml entities rather than the UTF8 string. Despite the fact that the subfile clearly has a UTF-8 encoding, the writer opted to use XML entities. This is the text for A1:

<!-- this is from the @PWDream file -->
<t>&#39044;&#31639;&#32534;&#30721;</t>

In a good file generated by Excel, the stored text is written directly:

<!-- this is from the resaved file -->
<t>预算编码</t>

Looking at other files in the test suite, those values should have be encoded using hexadecimal notation:

<!-- this is what should have been written -->
<t>&#x9884;&#x7b97;&#x7f16;&#x7801;</t>

The XLSX reader accepts UTF8-encoded Chinese characters (second example) and hexadecimal digit entities (third example),
but the relevant entity checker does not accept the decimal digit case (problematic case). Since Excel accepts it, we probably should do the same. Fortunately the fix is straightforward.

@PWDream thanks for sharing! This was very helpful. It looks like a file from Apache POI. The tool that wrote the file used xml entities rather than the UTF8 string. Despite the fact that the subfile clearly has a UTF-8 encoding, the writer opted to use XML entities. This is the text for A1: ```xml <!-- this is from the @PWDream file --> <t>&#39044;&#31639;&#32534;&#30721;</t> ``` In a good file generated by Excel, the stored text is written directly: ```xml <!-- this is from the resaved file --> <t>预算编码</t> ``` Looking at other files in the test suite, those values should have be encoded using hexadecimal notation: ```xml <!-- this is what should have been written --> <t>&#x9884;&#x7b97;&#x7f16;&#x7801;</t> ``` The XLSX reader accepts UTF8-encoded Chinese characters (second example) and hexadecimal digit entities (third example), but the [relevant entity checker](https://github.com/SheetJS/js-xlsx/blob/master/bits/22_xmlutils.js#L45) does not accept the decimal digit case (problematic case). Since Excel accepts it, we probably should do the same. Fortunately the fix is straightforward.
PWDream commented 2017-09-20 05:00:39 +00:00 (Migrated from github.com)

Thank you very much for your reply.What time do you fix this problem?

Thank you very much for your reply.What time do you fix this problem?
SheetJSDev commented 2017-09-20 05:14:59 +00:00 (Migrated from github.com)

A small local change seemed to have fixed the immediate problem but we need to do some more tests to be sure. Expect a release tomorrow or Friday

A small local change seemed to have fixed the immediate problem but we need to do some more tests to be sure. Expect a release tomorrow or Friday
PWDream commented 2017-09-20 05:15:56 +00:00 (Migrated from github.com)

Thank you very much

Thank you very much
PWDream commented 2017-09-21 08:57:05 +00:00 (Migrated from github.com)

@SheetJSDev Hello, is this problem solved?

@SheetJSDev Hello, is this problem solved?
PWDream commented 2017-09-22 04:11:30 +00:00 (Migrated from github.com)

@SheetJSDev Hello, is this problem solved?

@SheetJSDev Hello, is this problem solved?
PWDream commented 2017-09-22 06:51:39 +00:00 (Migrated from github.com)

@SheetJSDev Hello, is this problem solved? or you tell me how to change, I go to change.Thanks

@SheetJSDev Hello, is this problem solved? or you tell me how to change, I go to change.Thanks
SheetJSDev commented 2017-09-23 00:14:45 +00:00 (Migrated from github.com)

@PWDream it's available now in version 0.11.4

@PWDream it's available now in version 0.11.4
PWDream commented 2017-09-24 23:23:26 +00:00 (Migrated from github.com)

@SheetJSDev Thank you very much for your changes.I have upgraded to the latest version,the coding problem has been solved,but why did I turn behind the three empty lines after the line.
image
My files are as follows

预算_+——)(-&……%¥#@!-_20170925_0715.xlsx

@SheetJSDev Thank you very much for your changes.I have upgraded to the latest version,the coding problem has been solved,but why did I turn behind the three empty lines after the line. ![image](https://user-images.githubusercontent.com/4903755/30787730-0df3a9a8-a1c2-11e7-8022-a5b0dea0c1fe.png) My files are as follows [预算_+——)(-&……%¥#@!-_20170925_0715.xlsx](https://github.com/SheetJS/js-xlsx/files/1327865/_.-.-_20170925_0715.xlsx)
SheetJSDev commented 2017-09-24 23:31:53 +00:00 (Migrated from github.com)

The file has a weird record near the end of the worksheet:

<row r="5">
</row>

If you want to hide those blank rows, when converting to CSV pass the option blankrows:false. For example, in node:

> var XLSX = require('xlsx');
> var wb = XLSX.readFile('_.-.-_20170925_0715.xlsx');
> var ws = wb.Sheets[wb.SheetNames[0]];

> console.log(XLSX.utils.sheet_to_csv(ws)); // this shows the blank lines
预算编码,项目,2017年预算金额,预算超额时控制,负责人
,项目1,0.00,允许单据提交,
,,,,
,,,,
,,,,

> console.log(XLSX.utils.sheet_to_csv(ws, {blankrows:false})) // this will hide those rows
预算编码,项目,2017年预算金额,预算超额时控制,负责人
,项目1,0.00,允许单据提交,
The file has a weird record near the end of the worksheet: ```xml <row r="5"> </row> ``` If you want to hide those blank rows, when converting to CSV pass the [option `blankrows:false`](https://github.com/sheetjs/js-xlsx#delimiter-separated-output). For example, in node: ```js > var XLSX = require('xlsx'); > var wb = XLSX.readFile('_.-.-_20170925_0715.xlsx'); > var ws = wb.Sheets[wb.SheetNames[0]]; > console.log(XLSX.utils.sheet_to_csv(ws)); // this shows the blank lines 预算编码,项目,2017年预算金额,预算超额时控制,负责人 ,项目1,0.00,允许单据提交, ,,,, ,,,, ,,,, > console.log(XLSX.utils.sheet_to_csv(ws, {blankrows:false})) // this will hide those rows 预算编码,项目,2017年预算金额,预算超额时控制,负责人 ,项目1,0.00,允许单据提交, ```
PWDream commented 2017-09-24 23:34:26 +00:00 (Migrated from github.com)

Thank you very much!

Thank you very much!
happy-ruby commented 2020-09-20 13:48:30 +00:00 (Migrated from github.com)

Hello
I have the same issue for the encoding.
XLS BIFF8(office 97-2004 document) and contains japanese characters.
So i have used codepage:932 for japanese and it's working well for csv files.
Unfortunately, it's not working for office 97 xls file.
Manually, i open office 97 xls file and save as xlsx or csv file and one alert prompted as like that:
"Some features in your workbook must be lost if you save it as Microsoft Excel 5.0/95 Workbook.
Do you want to keep using that format?"
After i click "Yes" and save, then it's working without any problem.(of course we use codepage: 932)
Could you please give me instruction how can i solve that issue?
Thanks.

Hello I have the same issue for the encoding. XLS BIFF8(office 97-2004 document) and contains japanese characters. So i have used codepage:932 for japanese and it's working well for csv files. Unfortunately, it's not working for office 97 xls file. Manually, i open office 97 xls file and save as xlsx or csv file and one alert prompted as like that: "Some features in your workbook must be lost if you save it as Microsoft Excel 5.0/95 Workbook. Do you want to keep using that format?" After i click "Yes" and save, then it's working without any problem.(of course we use codepage: 932) Could you please give me instruction how can i solve that issue? Thanks.
SheetJSDev commented 2020-09-20 21:54:11 +00:00 (Migrated from github.com)

@WangHwaKok is this happening when you read in a file or make one from scratch? Can you share the bad file?

@WangHwaKok is this happening when you read in a file or make one from scratch? Can you share the bad file?
happy-ruby commented 2020-09-21 06:45:47 +00:00 (Migrated from github.com)

Hello @SheetJSDev
Yes, I am trying to read a downloaded xls file from the site.
I share office 97 xls file.(it is zipped)
test.zip
Please take a look and hope we will solve the issue easily.

Hello @SheetJSDev Yes, I am trying to read a downloaded xls file from the site. I share office 97 xls file.(it is zipped) [test.zip](https://github.com/SheetJS/sheetjs/files/5253772/test.zip) Please take a look and hope we will solve the issue easily.
SheetJSDev commented 2020-09-21 14:15:07 +00:00 (Migrated from github.com)

Does this actually work for you in Excel @WangHwaKok ? The file has a CodePage record but the specified codepage is 1252.

Does this actually work for you in Excel @WangHwaKok ? The file has a [CodePage record](https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/03651cf7-1926-4356-b42b-a378938975e5) but the specified codepage is [1252](https://en.wikipedia.org/wiki/Windows-1252).
happy-ruby commented 2020-09-21 15:34:37 +00:00 (Migrated from github.com)

image
Yes, working well so i can see japanese characters in Excel.
Normally, we can't see original japanese characters.
I am using win 10 OS and changed region/language setting as Japanese.
image
As i said before, after i save the file again & click 'yes' on alert message then i can analyze the file with codepage: 932 and i can see original characters.
Hopefully this can be helpful.

![image](https://user-images.githubusercontent.com/58267901/93785240-0cc85a80-fc58-11ea-9496-2421ec681797.png) Yes, working well so i can see japanese characters in Excel. Normally, we can't see original japanese characters. I am using win 10 OS and changed region/language setting as Japanese. ![image](https://user-images.githubusercontent.com/58267901/93786138-23bb7c80-fc59-11ea-9fb2-7c26cd44048d.png) As i said before, after i save the file again & click 'yes' on alert message then i can analyze the file with codepage: 932 and i can see original characters. Hopefully this can be helpful.
SheetJSDev commented 2020-09-21 18:23:58 +00:00 (Migrated from github.com)

This is very strange. I will raise a new issue

This is very strange. I will raise a new issue
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#739
No description provided.