importing CSV doesn't handle string delimiters #646

Closed
opened 2017-05-01 18:25:03 +00:00 by jabbermarky · 3 comments
jabbermarky commented 2017-05-01 18:25:03 +00:00 (Migrated from github.com)

I am trying to import CSV files.
I am using commas to separate fields.
when fields have embedded commas, I either quote or double quote the string

when strings are double-quoted, the library strips the trailing double-quote, but does not strip the leading double-quote.
example in:
"hello, this is a test"
yields:
"hello, this is a test
I expected it to drop both double-quotes.

also when importing a CSV, how do I specify other field separators and string delimiters?
for instance, single quotes are not recognized at all.
how would I use pipe '|' as the field separator?

I am trying to import CSV files. I am using commas to separate fields. when fields have embedded commas, I either quote or double quote the string when strings are double-quoted, the library strips the trailing double-quote, but does not strip the leading double-quote. example in: "hello, this is a test" yields: "hello, this is a test I expected it to drop both double-quotes. also when importing a CSV, how do I specify other field separators and string delimiters? for instance, single quotes are not recognized at all. how would I use pipe '|' as the field separator?
SheetJSDev commented 2017-05-01 19:14:26 +00:00 (Migrated from github.com)

If I understood correctly, you tried a file like quotes.txt

I checked against the web demo at http://oss.sheetjs.com/js-xlsx/ (drag and drop ) and found the correct output:

screen shot 2017-05-01 at 15 07 30

The node module appears to do the right thing:

$ xlsx quotes.txt
Sheet1
"hello, this is a test"

And to confirm that the object has the correct value:

$ node -pe 'require("xlsx").readFile("quotes.txt").Sheets.Sheet1.A1'
{ t: 's', v: 'hello, this is a test' }

Can you share the exact file you tested with?

If I understood correctly, you tried a file like [quotes.txt](https://github.com/SheetJS/js-xlsx/files/968561/quotes.txt) I checked against the web demo at http://oss.sheetjs.com/js-xlsx/ (drag and drop ) and found the correct output: <img width="185" alt="screen shot 2017-05-01 at 15 07 30" src="https://cloud.githubusercontent.com/assets/6070939/25590809/eb6a8a7e-2e7f-11e7-9cf3-6168d7aed0ec.png"> The node module appears to do the right thing: ```bash $ xlsx quotes.txt Sheet1 "hello, this is a test" ``` And to confirm that the object has the correct value: ```bash $ node -pe 'require("xlsx").readFile("quotes.txt").Sheets.Sheet1.A1' { t: 's', v: 'hello, this is a test' } ``` Can you share the exact file you tested with?
jabbermarky commented 2017-05-01 19:22:21 +00:00 (Migrated from github.com)

I think it might have something to do with white space before/after the field separators.
for example, I always put a space after a comma. that shows up in the JSON.
so my hypothesis is that for this input ’Some Value, “hello, this is a test”'
2 columns are extracted
first column value is ‘Some Value’
and second column value is ‘ “hello, this is a test’

Mark Lummus
http://www.marklummus.com

On May 1, 2017, at 3:14 PM, SheetJSDev notifications@github.com wrote:

If I understood correctly, you tried a file like quotes.txt https://github.com/SheetJS/js-xlsx/files/968561/quotes.txt
I checked against the web demo at http://oss.sheetjs.com/js-xlsx/ http://oss.sheetjs.com/js-xlsx/ (drag and drop ) and found the correct output:

https://cloud.githubusercontent.com/assets/6070939/25590809/eb6a8a7e-2e7f-11e7-9cf3-6168d7aed0ec.png
The node module appears to do the right thing:

$ xlsx quotes.txt
Sheet1
"hello, this is a test"
And to confirm that the object has the correct value:

$ node -pe 'require("xlsx").readFile("quotes.txt").Sheets.Sheet1.A1'
{ t: 's', v: 'hello, this is a test' }
Can you share the exact file you tested with?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub https://github.com/SheetJS/js-xlsx/issues/646#issuecomment-298406573, or mute the thread https://github.com/notifications/unsubscribe-auth/AHccd6jAcdzonyk6fI3mGVnmDSmVcAckks5r1i8XgaJpZM4NNSFT.

I think it might have something to do with white space before/after the field separators. for example, I always put a space after a comma. that shows up in the JSON. so my hypothesis is that for this input ’Some Value, “hello, this is a test”' 2 columns are extracted first column value is ‘Some Value’ and second column value is ‘ “hello, this is a test’ Mark Lummus http://www.marklummus.com On May 1, 2017, at 3:14 PM, SheetJSDev <notifications@github.com> wrote: If I understood correctly, you tried a file like quotes.txt <https://github.com/SheetJS/js-xlsx/files/968561/quotes.txt> I checked against the web demo at http://oss.sheetjs.com/js-xlsx/ <http://oss.sheetjs.com/js-xlsx/> (drag and drop ) and found the correct output: <https://cloud.githubusercontent.com/assets/6070939/25590809/eb6a8a7e-2e7f-11e7-9cf3-6168d7aed0ec.png> The node module appears to do the right thing: $ xlsx quotes.txt Sheet1 "hello, this is a test" And to confirm that the object has the correct value: $ node -pe 'require("xlsx").readFile("quotes.txt").Sheets.Sheet1.A1' { t: 's', v: 'hello, this is a test' } Can you share the exact file you tested with? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/SheetJS/js-xlsx/issues/646#issuecomment-298406573>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHccd6jAcdzonyk6fI3mGVnmDSmVcAckks5r1i8XgaJpZM4NNSFT>.
SheetJSDev commented 2017-05-01 19:58:13 +00:00 (Migrated from github.com)

@jabbermarky Thanks for reporting! You actually revealed two bugs.

  1. So here's the story: if you put a whitespace before the quote, Excel will actually treat it as 3 cells:
Some Value, "hello, this is a test"
          |       |
Some Value| "hello| this is a test"

If you don't put the whitespace Excel treats it as two cells.

The fix here is to treat any double quote as literal if the cell doesn't start with a double quote

  1. You asked about the pipe delimiter, the answer should be "start your file with sep=| as the first line" (try it in Excel!) but there's a missing else in the code.

So I believe the following diff addresses both issues as well as a related issue when the CSV does not end in a newline:

diff --git a/bits/40_harb.js b/bits/40_harb.js
index b9898a9..d22e2b1 100644
--- a/bits/40_harb.js
+++ b/bits/40_harb.js
@@ -485,29 +485,35 @@ var PRN = (function() {
 
 		/* known sep */
 		if(str.substr(0,4) == "sep=" && str.charCodeAt(5) == 10) { sep = str.charAt(4); str = str.substr(6); }
-		/* TODO: actually determine the separator */
-		if(str.substr(0,1024).indexOf("\t") == -1) sep = ","; else sep = "\t";
+		else if(str.substr(0,1024).indexOf("\t") == -1) sep = ","; else sep = "\t";
 		var R = 0, C = 0, v = 0;
 		var start = 0, end = 0, sepcc = sep.charCodeAt(0), instr = false, cc=0;
 		str = str.replace(/\r\n/g, "\n");
-		for(;end < str.length;++end) switch((cc=str.charCodeAt(end))) {
-			case 0x22: instr = !instr; break;
-			case sepcc: case 0x0a: if(instr) break;
+		function finish_cell() {
 			var s = str.slice(start, end);
 			var cell = ({}/*:any*/);
 			if(s.charCodeAt(0) == 0x3D) { cell.t = 'n'; cell.f = s.substr(1); }
 			else if(s == "TRUE") { cell.t = 'b'; cell.v = true; }
 			else if(s == "FALSE") { cell.t = 'b'; cell.v = false; }
 			else if(!isNaN(v = parseFloat(s))) { cell.t = 'n'; cell.w = s; cell.v = v; }
-			else { cell.t = 's'; cell.v = s.replace(/^"/,"").replace(/"$/,"").replace(/""/g,'"'); }
+			else {
+				cell.t = 's';
+				if(s.charAt(0) == '"' && s.charAt(s.length - 1) == '"') s = s.slice(1,-1).replace(/""/g,'"');
+				cell.v = s;
+			}
 			if(o.dense) { if(!ws[R]) ws[R] = []; ws[R][C] = cell; }
 			else ws[encode_cell({c:C,r:R})] = cell;
 			start = end+1;
 			if(range.e.c < C) range.e.c = C;
 			if(range.e.r < R) range.e.r = R;
-			if(cc == sepcc) ++C; else { C = 0; ++R; } break;
+			if(cc == sepcc) ++C; else { C = 0; ++R; }
+		}
+		for(;end < str.length;++end) switch((cc=str.charCodeAt(end))) {
+			case 0x22: if(instr || (end - start == 0)) instr = !instr; break;
+			case sepcc: case 0x0a: if(!instr) finish_cell(); break;
 			default: break;
 		}
+		if(end - start > 0) finish_cell();
 
 		ws['!ref'] = encode_range(range);
 		return ws;

We'll add some more corner cases for the CSV

@jabbermarky Thanks for reporting! You actually revealed two bugs. 1) So here's the story: if you put a whitespace before the quote, Excel will actually treat it as 3 cells: ``` Some Value, "hello, this is a test" | | Some Value| "hello| this is a test" ``` If you don't put the whitespace Excel treats it as two cells. The fix here is to treat any double quote as literal if the cell doesn't start with a double quote 2) You asked about the pipe delimiter, the answer should be "start your file with `sep=|` as the first line" (try it in Excel!) but there's a missing `else` in the code. So I believe the following diff addresses both issues as well as a related issue when the CSV does not end in a newline: ```diff diff --git a/bits/40_harb.js b/bits/40_harb.js index b9898a9..d22e2b1 100644 --- a/bits/40_harb.js +++ b/bits/40_harb.js @@ -485,29 +485,35 @@ var PRN = (function() { /* known sep */ if(str.substr(0,4) == "sep=" && str.charCodeAt(5) == 10) { sep = str.charAt(4); str = str.substr(6); } - /* TODO: actually determine the separator */ - if(str.substr(0,1024).indexOf("\t") == -1) sep = ","; else sep = "\t"; + else if(str.substr(0,1024).indexOf("\t") == -1) sep = ","; else sep = "\t"; var R = 0, C = 0, v = 0; var start = 0, end = 0, sepcc = sep.charCodeAt(0), instr = false, cc=0; str = str.replace(/\r\n/g, "\n"); - for(;end < str.length;++end) switch((cc=str.charCodeAt(end))) { - case 0x22: instr = !instr; break; - case sepcc: case 0x0a: if(instr) break; + function finish_cell() { var s = str.slice(start, end); var cell = ({}/*:any*/); if(s.charCodeAt(0) == 0x3D) { cell.t = 'n'; cell.f = s.substr(1); } else if(s == "TRUE") { cell.t = 'b'; cell.v = true; } else if(s == "FALSE") { cell.t = 'b'; cell.v = false; } else if(!isNaN(v = parseFloat(s))) { cell.t = 'n'; cell.w = s; cell.v = v; } - else { cell.t = 's'; cell.v = s.replace(/^"/,"").replace(/"$/,"").replace(/""/g,'"'); } + else { + cell.t = 's'; + if(s.charAt(0) == '"' && s.charAt(s.length - 1) == '"') s = s.slice(1,-1).replace(/""/g,'"'); + cell.v = s; + } if(o.dense) { if(!ws[R]) ws[R] = []; ws[R][C] = cell; } else ws[encode_cell({c:C,r:R})] = cell; start = end+1; if(range.e.c < C) range.e.c = C; if(range.e.r < R) range.e.r = R; - if(cc == sepcc) ++C; else { C = 0; ++R; } break; + if(cc == sepcc) ++C; else { C = 0; ++R; } + } + for(;end < str.length;++end) switch((cc=str.charCodeAt(end))) { + case 0x22: if(instr || (end - start == 0)) instr = !instr; break; + case sepcc: case 0x0a: if(!instr) finish_cell(); break; default: break; } + if(end - start > 0) finish_cell(); ws['!ref'] = encode_range(range); return ws; ``` We'll add some more corner cases for the CSV
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#646
No description provided.