About Chinese garbled code with DBF files #2781

Closed
opened 2022-09-03 04:08:18 +00:00 by ZJS248 · 3 comments
ZJS248 commented 2022-09-03 04:08:18 +00:00 (Migrated from github.com)

when I try to create a dbf file include Chinese words , it translate the words into underline just like this '__'
Here is the code and result :

const xlsx = require("xlsx");

const json = [
  {
    A1: "2020-01-04",
    A2: "English",
  },
  {
    A1: "2020-01-04",
    A2: "中文",
  },
];

const book = xlsx.utils.book_new();
const sheet = xlsx.utils.json_to_sheet(json);
xlsx.utils.book_append_sheet(book, sheet, "sheet1");
xlsx.writeFile(book, "./test.dbf", {bookType: "dbf"});
<html>
A1 A2

1 | 2020-01-04 | English
2 | 2020-01-04 | __

</html>
when I try to create a dbf file include Chinese words , it translate the words into underline just like this '__' Here is the code and result : ``` const xlsx = require("xlsx"); const json = [ { A1: "2020-01-04", A2: "English", }, { A1: "2020-01-04", A2: "中文", }, ]; const book = xlsx.utils.book_new(); const sheet = xlsx.utils.json_to_sheet(json); xlsx.utils.book_append_sheet(book, sheet, "sheet1"); xlsx.writeFile(book, "./test.dbf", {bookType: "dbf"}); ``` <html> <body> <!--StartFragment--> A1 | A2 -- | -- 1 | 2020-01-04 | English 2 | 2020-01-04 | __ <!--EndFragment--> </body> </html>
SheetJSDev commented 2022-09-03 20:22:08 +00:00 (Migrated from github.com)

Thanks for raising the issue!

Attached is a ZIP file containing 4 DBF files for 4 separate encodings. Please open each one in your application and confirm all four display the correct characters.

issue2781.zip

As for the fix, there are two parts:

  1. Currently, for the legacy formats, the non-ASCII characters are replaced: https://github.com/SheetJS/sheetjs/blob/master/bits/23_binutils.js#L194

This can be patched as follows:

--- a/bits/23_binutils.js
+++ b/bits/23_binutils.js
@@ -189,13 +189,18 @@ function WriteShift(t/*:number*/, val/*:string|number*/, f/*:?string*/)/*:any*/
                                var cppayload = $cptable.utils.encode(current_ansi, val.charAt(i));
                                this[this.l + i] = cppayload[0];
                        }
+                       size = val.length;
+               } else if(typeof $cptable !== 'undefined') {
+                       var cppayload = $cptable.utils.encode(current_ansi, val);
+                       for(i = 0; i < cppayload.length; ++i) this[this.l + i] = cppayload[i];
+                       size = cppayload.length;
                } else {
                        /*:: if(typeof val !== 'string') throw new Error("unreachable"); */
                        val = val.replace(/[^\x00-\x7F]/g, "_");
                        /*:: if(typeof val !== 'string') throw new Error("unreachable"); */
                        for(i = 0; i != val.length; ++i) this[this.l + i] = (val.charCodeAt(i) & 0xFF);
+                       size = val.length;
                }
-               size = val.length;
        } else if(f === 'hex') {
                for(; i < t; ++i) {
                        /*:: if(typeof val !== "string") throw new Error("unreachable"); */

This won't be the full fix since the DBF writer needs to use the full lengths in the calculation (large Chinese strings will overflow) and this will change how some of the other legacy writers work, but it is enough to verify encoding correctness.

  1. After applying the patch, you have to tell the library which encoding you want to use. For example, with Simplified Chinese:
xlsx.writeFile(book, "./test.dbf", {bookType: "dbf", codepage: 936});

The main supported codepages for Chinese characters are:

  • 936 (Simplified Chinese GBK)
  • 950 (Traditional Chinese Big5)

There are two other codepages with support for the two characters in the example:

  • 949 (Korean)
  • 932 (Japanese Shift-JIS).
Thanks for raising the issue! Attached is a ZIP file containing 4 DBF files for 4 separate encodings. Please open each one in your application and confirm all four display the correct characters. [issue2781.zip](https://github.com/SheetJS/sheetjs/files/9483329/issue2781.zip) As for the fix, there are two parts: 1) Currently, for the legacy formats, the non-ASCII characters are replaced: https://github.com/SheetJS/sheetjs/blob/master/bits/23_binutils.js#L194 This can be patched as follows: ```diff --- a/bits/23_binutils.js +++ b/bits/23_binutils.js @@ -189,13 +189,18 @@ function WriteShift(t/*:number*/, val/*:string|number*/, f/*:?string*/)/*:any*/ var cppayload = $cptable.utils.encode(current_ansi, val.charAt(i)); this[this.l + i] = cppayload[0]; } + size = val.length; + } else if(typeof $cptable !== 'undefined') { + var cppayload = $cptable.utils.encode(current_ansi, val); + for(i = 0; i < cppayload.length; ++i) this[this.l + i] = cppayload[i]; + size = cppayload.length; } else { /*:: if(typeof val !== 'string') throw new Error("unreachable"); */ val = val.replace(/[^\x00-\x7F]/g, "_"); /*:: if(typeof val !== 'string') throw new Error("unreachable"); */ for(i = 0; i != val.length; ++i) this[this.l + i] = (val.charCodeAt(i) & 0xFF); + size = val.length; } - size = val.length; } else if(f === 'hex') { for(; i < t; ++i) { /*:: if(typeof val !== "string") throw new Error("unreachable"); */ ``` This won't be the full fix since the DBF writer needs to use the full lengths in the calculation (large Chinese strings will overflow) and this will change how some of the other legacy writers work, but it is enough to verify encoding correctness. 2) After applying the patch, you have to tell the library which encoding you want to use. For example, with Simplified Chinese: ```js xlsx.writeFile(book, "./test.dbf", {bookType: "dbf", codepage: 936}); ``` The main supported codepages for Chinese characters are: - 936 (Simplified Chinese GBK) - 950 (Traditional Chinese Big5) There are two other codepages with support for the two characters in the example: - 949 (Korean) - 932 (Japanese Shift-JIS).
ZJS248 commented 2022-09-04 09:26:30 +00:00 (Migrated from github.com)

Thanks for replying, it works for me now.

Thanks for replying, it works for me now.
SheetJSDev commented 2022-09-09 06:47:35 +00:00 (Migrated from github.com)

Testing this against the latest version appears to work. Web version https://jsfiddle.net/bg10f526/ automatically generates and downloads test.dbf . The web file is identical to the file generated in NodeJS.
For version 0.18.11, the MD5 of the generated file should be ec756d220aa7e6ce5e7d810406617842

Testing this against the latest version appears to work. Web version https://jsfiddle.net/bg10f526/ automatically generates and downloads `test.dbf` . The web file is identical to the file generated in NodeJS. For version 0.18.11, the MD5 of the generated file should be `ec756d220aa7e6ce5e7d810406617842`
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#2781
No description provided.