Parsing string values and treating as dates is incorrect #1300

Closed
opened 2018-10-05 09:48:01 +00:00 by pzhelnov · 13 comments
pzhelnov commented 2018-10-05 09:48:01 +00:00 (Migrated from github.com)

Hi!
There is a CSV file, which I try to read and it contains field with value

"Aprobil P 0.1%"

the short example of CSV:


"Country";"Product Family"
"Germany";"Aprobil P 0.1%"

conversion to workbook is the following:

var workbook = XLSX.read(csvData, {
                    type:'string',
                    dateNF: 'D-M-YYYY',
                    cellDates:true,
                    cellText:true,
                    cellNF: false,
                raw:false});

after conversion I save the XLS, where value "Aprobil P 0.1%" is converted to a date 01.04.00

looking into the worksheet model and getting the certain cell, it contains:

{
      t: 'd',
      v: 'Sat Apr 01 2000 00:00:00 GMT+0300 (Eastern European Summer Time)',
      z:undefined
}

is there any way to cover this case?
Could you pls assist, what to do.

Thanks in advance!

Hi! There is a CSV file, which I try to read and it contains field with value "Aprobil P 0.1%" the short example of CSV: ``` "Country";"Product Family" "Germany";"Aprobil P 0.1%" ``` conversion to workbook is the following: ``` var workbook = XLSX.read(csvData, { type:'string', dateNF: 'D-M-YYYY', cellDates:true, cellText:true, cellNF: false, raw:false}); ``` after conversion I save the XLS, where value "Aprobil P 0.1%" is converted to a date 01.04.00 looking into the worksheet model and getting the certain cell, it contains: ``` { t: 'd', v: 'Sat Apr 01 2000 00:00:00 GMT+0300 (Eastern European Summer Time)', z:undefined } ``` is there any way to cover this case? Could you pls assist, what to do. Thanks in advance!
pzhelnov commented 2018-10-18 14:09:27 +00:00 (Migrated from github.com)

any update, any comments? is anybody here? ;)

any update, any comments? is anybody here? ;)
metal4timmy commented 2018-11-07 15:40:22 +00:00 (Migrated from github.com)

I am having a similar issue with certain strings being converted to date values. Any update on this?

I am having a similar issue with certain strings being converted to date values. Any update on this?
pzhelnov commented 2018-11-14 13:54:35 +00:00 (Migrated from github.com)

guys, @SheetJSDev, pls, just pay a little attention to it, tell us, is it possible to fix, or should we rely on our workaround?

guys, @SheetJSDev, pls, just pay a little attention to it, tell us, is it possible to fix, or should we rely on our workaround?
pzhelnov commented 2019-09-11 07:01:51 +00:00 (Migrated from github.com)

guys, one year has been passed, no one even commented.
Do I need to switch to another project?

guys, one year has been passed, no one even commented. Do I need to switch to another project?
SheetJSDev commented 2019-09-11 07:23:48 +00:00 (Migrated from github.com)

Hello @pzhelnov ! First, understand that this is an open source project and all work done on the project is charity. If you are concerned about response time or interested in supporting development, we offer paid builds and support as discussed in https://sheetjs.com/pro

That said, unfortunately there's no magic solution here. The Date parsing currently falls back on the JS Date builtin features, which are super flexible in chrome. As an example:

new Date("This is not a date 1")

is the date January 1 2001, despite it clearly indicating that it isn't a date.

Setting the option raw: true will give you the raw strings, suppressing any sort of value interpretation.

If you are interested in helping, the relevant function is fuzzydate. Google Docs / Office Online likely use a series of regular expressions based on some known formats like /\d{4}-\d{2}-\d{2}/ for yyyy-mm-dd and that might be a more sensible approach than using the generic Date constructor

Hello @pzhelnov ! First, understand that this is an open source project and all work done on the project is charity. If you are concerned about response time or interested in supporting development, we offer paid builds and support as discussed in https://sheetjs.com/pro That said, unfortunately there's no magic solution here. The Date parsing currently falls back on the JS Date builtin features, which are super flexible in chrome. As an example: ```js new Date("This is not a date 1") ``` is the date January 1 2001, despite it clearly indicating that it isn't a date. Setting the option `raw: true` will give you the raw strings, suppressing any sort of value interpretation. If you are interested in helping, the relevant function is [`fuzzydate`](https://github.com/SheetJS/js-xlsx/blob/master/bits/20_jsutils.js#L126). Google Docs / Office Online likely use a series of regular expressions based on some known formats like `/\d{4}-\d{2}-\d{2}/` for `yyyy-mm-dd` and that might be a more sensible approach than using the generic Date constructor
pzhelnov commented 2019-09-11 07:34:46 +00:00 (Migrated from github.com)

Thank you very much for an answer!
Sorry for raising my hand in such a rude manner, we've just stuck on this problem for a long time and my workaround approach was not so strict.

First of all I have to say this is great project, one of the best, flexible and configurable, which currently existing in the community.
So, keep going!

Thank you very much for an answer! Sorry for raising my hand in such a rude manner, we've just stuck on this problem for a long time and my workaround approach was not so strict. First of all I have to say this is great project, one of the best, flexible and configurable, which currently existing in the community. So, keep going!
SheetJSDev commented 2019-09-11 07:52:54 +00:00 (Migrated from github.com)

We'd accept a PR that does the following:

  1. add an option to the reader, call it something like strictDates (use your imagination) that would use this alternative parser

  2. change the line https://github.com/SheetJS/js-xlsx/blob/master/bits/40_harb.js#L834 to

			else if(!isNaN(fuzzydate(s, o.strictDates).getDate()) || _re && s.match(_re)) {
  1. add a second parameter to fuzzydate and load up the regexes in https://github.com/SheetJS/js-xlsx/blob/master/bits/20_jsutils.js#L126:
var STRICT_DATE_REGEXES = [
	/\d{4}-\d{2}-\d{2}/
]
function fuzzydate(s/*:string*/, strict/*:boolean*/)/*:Date*/ {
	var o = new Date(s), n = new Date(NaN);
	if(strict) {
		for(var i = 0; i < STRICT_DATE_REGEXES.length; ++i) if(s.match(STRICT_DATE_REGEXES[i])) return o;
		return n;
	}
	var y = o.getYear(), m = o.getMonth(), d = o.getDate();
	if(isNaN(d)) return n;
	if(y < 0 || y > 8099) return n;
	if((m > 0 || d > 1) && y != 101) return o;
	if(s.toLowerCase().match(/jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec/)) return o;
	if(s.match(/[^-0-9:,\/\\]/)) return n;
	return o;
}

This obviously isn't 100% correct but is roughly shaped like how we'd expect the ultimate code to be. The new code doesn't the existing behavior (so we could get away with a patch version bump) but it does give you a backdoor to solve your immediate problem and gives us room to grow as we grind further

We'd accept a PR that does the following: 1) add an option to the reader, call it something like `strictDates` (use your imagination) that would use this alternative parser 2) change the line https://github.com/SheetJS/js-xlsx/blob/master/bits/40_harb.js#L834 to ``` else if(!isNaN(fuzzydate(s, o.strictDates).getDate()) || _re && s.match(_re)) { ``` 3) add a second parameter to `fuzzydate` and load up the regexes in https://github.com/SheetJS/js-xlsx/blob/master/bits/20_jsutils.js#L126: ```js var STRICT_DATE_REGEXES = [ /\d{4}-\d{2}-\d{2}/ ] function fuzzydate(s/*:string*/, strict/*:boolean*/)/*:Date*/ { var o = new Date(s), n = new Date(NaN); if(strict) { for(var i = 0; i < STRICT_DATE_REGEXES.length; ++i) if(s.match(STRICT_DATE_REGEXES[i])) return o; return n; } var y = o.getYear(), m = o.getMonth(), d = o.getDate(); if(isNaN(d)) return n; if(y < 0 || y > 8099) return n; if((m > 0 || d > 1) && y != 101) return o; if(s.toLowerCase().match(/jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec/)) return o; if(s.match(/[^-0-9:,\/\\]/)) return n; return o; } ``` This obviously isn't 100% correct but is roughly shaped like how we'd expect the ultimate code to be. The new code doesn't the existing behavior (so we could get away with a patch version bump) but it does give you a backdoor to solve your immediate problem and gives us room to grow as we grind further
nandanv2702 commented 2021-08-06 19:28:39 +00:00 (Migrated from github.com)

Hey, is this issue still open to a PR? I'd like to give it a shot if I can

Hey, is this issue still open to a PR? I'd like to give it a shot if I can
SheetJSDev commented 2021-08-06 19:32:48 +00:00 (Migrated from github.com)

As is usually the case with these problems, the code is the easy part. The hard part is making a decision :(

If this is of interest, start by enumerating the common date formats that the parser should recognize (for example, YYYY-MM-DD) and constructing the corresponding regular expressions. Then we'd accept a PR that follows the approach outlined in the previous comment.

As is usually the case with these problems, the code is the easy part. The hard part is making a decision :( If this is of interest, start by enumerating the common date formats that the parser should recognize (for example, `YYYY-MM-DD`) and constructing the corresponding regular expressions. Then we'd accept a PR that follows the approach outlined in the previous comment.
nandanv2702 commented 2021-08-06 19:43:30 +00:00 (Migrated from github.com)

Hmm, I see - let me start working on this in a bit and I'll see if I can add something of value that works. It's also my first open-source contrib, so please lmk if there's something more I can do!

Hmm, I see - let me start working on this in a bit and I'll see if I can add something of value that works. It's also my first open-source contrib, so please lmk if there's something more I can do!
nandanv2702 commented 2021-08-13 18:15:01 +00:00 (Migrated from github.com)

Hey! Just created a PR for this here - let me know if I need to make any more changes, etc.

Hey! Just created a PR for this [here](https://github.com/SheetJS/sheetjs/pull/2354) - let me know if I need to make any more changes, etc.
SheetJSDev commented 2022-02-11 03:55:17 +00:00 (Migrated from github.com)

Based on some testing, in the en-US locale, there appear to be no valid Excel dates that contain letters outside of the month short or long names (even full day names are not accepted). Looking at a number of the issues, a guard like

    var lower = s.toLowerCase();
    if(lower.match(/jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec/)) {
      lower = lower.replace(/[^a-z]/g,"");
      if(lower.length > 3 && lower_months.indexOf(lower) == -1) return n;
      return o;
    }

(where lower_months is an array of the long names of months) would address many of the problems. The proposed "strict" mode is still needed for determining field values and matching formats that Chrome does not accept (for example, 1:23.04:56)

Based on some testing, in the `en-US` locale, there appear to be no valid Excel dates that contain letters outside of the month short or long names (even full day names are not accepted). Looking at a number of the issues, a guard like ```js var lower = s.toLowerCase(); if(lower.match(/jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec/)) { lower = lower.replace(/[^a-z]/g,""); if(lower.length > 3 && lower_months.indexOf(lower) == -1) return n; return o; } ``` (where `lower_months` is an array of the long names of months) would address many of the problems. The proposed "strict" mode is still needed for determining field values and matching formats that Chrome does not accept (for example, `1:23.04:56`)
SheetJSDev commented 2022-03-04 08:24:27 +00:00 (Migrated from github.com)

Fixed in 0.18.3: https://jsfiddle.net/9u84Lwja/

const csvData = `\
"Country";"Product Family"
"Germany";"Aprobil P 0.1%"`;
var workbook = XLSX.read(csvData, {
                    type:'string',
                    dateNF: 'D-M-YYYY',
                    cellDates:true,
                    cellText:true,
                    cellNF: false,
                raw:false});

out.innerText = JSON.stringify(workbook.Sheets.Sheet1, 2, 2);

The relevant cell is

  "B2": {
    "t": "s",
    "v": "Aprobil P 0.1%"
  },
Fixed in 0.18.3: https://jsfiddle.net/9u84Lwja/ ```js const csvData = `\ "Country";"Product Family" "Germany";"Aprobil P 0.1%"`; var workbook = XLSX.read(csvData, { type:'string', dateNF: 'D-M-YYYY', cellDates:true, cellText:true, cellNF: false, raw:false}); out.innerText = JSON.stringify(workbook.Sheets.Sheet1, 2, 2); ``` The relevant cell is ```js "B2": { "t": "s", "v": "Aprobil P 0.1%" }, ```
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#1300
No description provided.