Unexpected Automatic Date Conversion #2196

Closed
opened 2020-12-11 15:55:10 +00:00 by matthew-macgregor · 10 comments
matthew-macgregor commented 2020-12-11 15:55:10 +00:00 (Migrated from github.com)

Issue Overview

One of our systems which uses sheetjs for parsing csv/xlsx data produced an unexpected result this week. The data which caused the issue was "Mayslanding, NJ 08234", which the library coerced to a date as { t: 'n', v: 2313568, w: '5/1/34' }. My hunch is that the code which attempts to guess if a cell might be a date is too lax given some inputs.

This happens when raw: false, which is the default. We have worked around it by simply setting raw: true and doing our own type conversions. Still, this unexpected result seems likely to trip up others using the default setting.

I am willing to volunteer to submit a PR to fix this issue but would like some feedback from a maintainer before writing any code. My guess is that any changes to this date conversion code could very easily cause other unexpected side effects, so advice would be appreciated.

Steps to Reproduce:

Given the (somewhat fabricated) csv data below, the following code will produce the result:

const wkbk = xlsx.readFile('./datesville.csv')

CSV input:

Line 1
"Januaryville, NJ 08234"
"Februaryville, NJ 08234"
"Marchville, NJ 08234"
"Aprilville, NJ 08234"
"Mayslanding, NJ 08234"
"Juneville, NJ 08234"
"Julyville, NJ 08234"
"Augustville, NJ 08234"
"Septemberville, NJ 08234"
"Octoberville, NJ 08234"
"Novemberlanding, NJ 08330"
"Decembertown, NJ 08330"

Unexpected result:

{
  A1: { t: 's', v: 'Line 1' },
  A2: { t: 'n', v: 2313448, w: '1/1/34' },
  A3: { t: 'n', v: 2313479, w: '2/1/34' },
  A4: { t: 'n', v: 2313507, w: '3/1/34' },
  A5: { t: 'n', v: 2313538, w: '4/1/34' },
  A6: { t: 'n', v: 2313568, w: '5/1/34' },
  A7: { t: 'n', v: 2313599, w: '6/1/34' },
  A8: { t: 'n', v: 2313629, w: '7/1/34' },
  A9: { t: 'n', v: 2313660, w: '8/1/34' },
  A10: { t: 'n', v: 2313691, w: '9/1/34' },
  A11: { t: 'n', v: 2313721, w: '10/1/34' },
  A12: { t: 'n', v: 2348815, w: '11/1/30' },
  A13: { t: 'n', v: 2348845, w: '12/1/30' },
  '!ref': 'A1:A13'
}
# Issue Overview One of our systems which uses `sheetjs` for parsing csv/xlsx data produced an unexpected result this week. The data which caused the issue was `"Mayslanding, NJ 08234"`, which the library coerced to a `date` as `{ t: 'n', v: 2313568, w: '5/1/34' }`. My hunch is that the code which attempts to guess if a cell might be a date is too lax given some inputs. This happens when `raw: false`, which is the default. We have worked around it by simply setting `raw: true` and doing our own type conversions. Still, this unexpected result seems likely to trip up others using the default setting. I am willing to volunteer to submit a PR to fix this issue but would like some feedback from a maintainer before writing any code. My guess is that any changes to this date conversion code could very easily cause other unexpected side effects, so advice would be appreciated. ### Steps to Reproduce: Given the (somewhat fabricated) csv data below, the following code will produce the result: ``` const wkbk = xlsx.readFile('./datesville.csv') ``` CSV input: ```csv Line 1 "Januaryville, NJ 08234" "Februaryville, NJ 08234" "Marchville, NJ 08234" "Aprilville, NJ 08234" "Mayslanding, NJ 08234" "Juneville, NJ 08234" "Julyville, NJ 08234" "Augustville, NJ 08234" "Septemberville, NJ 08234" "Octoberville, NJ 08234" "Novemberlanding, NJ 08330" "Decembertown, NJ 08330" ``` Unexpected result: ```js { A1: { t: 's', v: 'Line 1' }, A2: { t: 'n', v: 2313448, w: '1/1/34' }, A3: { t: 'n', v: 2313479, w: '2/1/34' }, A4: { t: 'n', v: 2313507, w: '3/1/34' }, A5: { t: 'n', v: 2313538, w: '4/1/34' }, A6: { t: 'n', v: 2313568, w: '5/1/34' }, A7: { t: 'n', v: 2313599, w: '6/1/34' }, A8: { t: 'n', v: 2313629, w: '7/1/34' }, A9: { t: 'n', v: 2313660, w: '8/1/34' }, A10: { t: 'n', v: 2313691, w: '9/1/34' }, A11: { t: 'n', v: 2313721, w: '10/1/34' }, A12: { t: 'n', v: 2348815, w: '11/1/30' }, A13: { t: 'n', v: 2348845, w: '12/1/30' }, '!ref': 'A1:A13' } ```
SheetJSDev commented 2020-12-11 17:29:38 +00:00 (Migrated from github.com)

To be clear, this ultimately boils down to how V8 (Chrome/Node) handles dates:

new Date("Mayslanding, NJ 08234"); // Thu May 01 8234 00:00:00 in your timezone

We try to correct for it in fuzzydate by looking for one of the month labels. I suspect the month name regex can be cleaned up a bit to explicitly match the short or long form for names, like

if(s.toLowerCase().match(/\b(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\b/)) return o;
if(s.toLowerCase().match(/\b(january|february|march|april|may|june|july|august|september|october|november|december)\b/)) return o;
To be clear, this ultimately boils down to how V8 (Chrome/Node) handles dates: ```js new Date("Mayslanding, NJ 08234"); // Thu May 01 8234 00:00:00 in your timezone ``` We try to correct for it in [`fuzzydate`](https://github.com/SheetJS/sheetjs/blob/master/bits/20_jsutils.js#L133-L142) by looking for one of the month labels. I suspect the month name regex can be cleaned up a bit to explicitly match the short or long form for names, like ```js if(s.toLowerCase().match(/\b(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\b/)) return o; if(s.toLowerCase().match(/\b(january|february|march|april|may|june|july|august|september|october|november|december)\b/)) return o; ```
matthew-macgregor commented 2020-12-11 19:15:43 +00:00 (Migrated from github.com)

I'll take a closer look at those and put together a PR when I get a chance.

I'll take a closer look at those and put together a PR when I get a chance.
jbull328 commented 2020-12-12 15:54:39 +00:00 (Migrated from github.com)

I am also seeing this when creating an excel file in my project. In our case, the data is a string "1/$1.99" It gets turned into a date with some of the numbers but some randomness as well. 1/2/1999

I am also seeing this when creating an excel file in my project. In our case, the data is a string "1/$1.99" It gets turned into a date with some of the numbers but some randomness as well. 1/2/1999
matthew-macgregor commented 2020-12-12 17:07:44 +00:00 (Migrated from github.com)

@jbull328, just to confirm, you're seeing that behavior on creation/output of a document with that data? I am not seeing it on input in xlsx (with several column formats) or csv.

@jbull328, just to confirm, you're seeing that behavior on creation/output of a document with that data? I am not seeing it on input in `xlsx` (with several column formats) or `csv`.
jbull328 commented 2020-12-15 20:02:32 +00:00 (Migrated from github.com)

Correct @matthew-macgregor we did see that on output. I ended up abandoning using the module.

Correct @matthew-macgregor we did see that on output. I ended up abandoning using the module.
SheetJSDev commented 2020-12-15 20:11:05 +00:00 (Migrated from github.com)

@jbull328 If this is showing up when you write a CSV, that's Excel's automatic conversion. For example, consider the CSV

1/3,2/3,3/3

If you write that as plaintext to a file and open in Excel, it will interpret those as dates:

screenshot

In any case that's an issue unrelated to this one.

@jbull328 If this is showing up when you write a CSV, that's Excel's automatic conversion. For example, consider the CSV ``` 1/3,2/3,3/3 ``` If you write that as plaintext to a file and open in Excel, it will interpret those as dates: <img width="243" alt="screenshot" src="https://user-images.githubusercontent.com/6070939/102267317-b92c7c80-3ee7-11eb-98e2-1baad7cc7753.png"> In any case that's an issue unrelated to this one.
flaushi commented 2021-05-13 11:42:31 +00:00 (Migrated from github.com)

I want to confirm that the date parsing strategy of sheetjs is a bit too aggressive for my taste.

Effectively I can only use raw reading mode for csv files, as random decimal numbers are interpreted as date, which is not foreseeable so not reliable.

There are dozens of issues on this topic.

If I switch on raw, this problem disappears, however everything is a string then. I have to parse numbers then.

Is there something in between? So a config that will parse numbers if possible, and leaves the rest as string?

I want to confirm that the date parsing strategy of sheetjs is a bit too aggressive for my taste. Effectively I can only use `raw` reading mode for csv files, as random decimal numbers are interpreted as date, which is not foreseeable so not reliable. There are dozens of issues on this topic. If I switch on `raw`, this problem disappears, however everything is a string then. I have to parse numbers then. Is there something in between? So a config that will parse numbers if possible, and leaves the rest as string?
SheetJSDev commented 2021-09-10 09:34:58 +00:00 (Migrated from github.com)

@flaushi it's aggressive because V8 (chrome/node) is aggressive. fuzzydate attempts some light validation. Unfortunately this story probably ends with a hard-coded list of acceptable date formats, as discussed in https://github.com/SheetJS/sheetjs/issues/1300#issuecomment-530266048

@flaushi it's aggressive because V8 (chrome/node) is aggressive. `fuzzydate` attempts some light validation. Unfortunately this story probably ends with a hard-coded list of acceptable date formats, as discussed in https://github.com/SheetJS/sheetjs/issues/1300#issuecomment-530266048
SheetJSDev commented 2022-02-11 02:15:50 +00:00 (Migrated from github.com)

Moving this to #1300

Moving this to #1300
reviewher commented 2022-02-14 04:10:05 +00:00 (Migrated from github.com)

0.18.1 fixes this issue. To verify in NodeJS:

$ for v in 0.17.4 0.17.5 0.18.0 0.18.1; do npm i "xlsx@$v"; node -pe 'var XLSX = require("xlsx"); [XLSX.version, XLSX.readFile("t.csv").Sheets.Sheet1.A1]'; done
[ '0.17.4', { t: 'n', v: 2313448, w: '1/1/34' } ]
[ '0.17.5', { t: 'n', v: 2313448, w: '1/1/34' } ]
[ '0.18.0', { t: 'n', v: 2313448, w: '1/1/34' } ]
[ '0.18.1', { t: 's', v: 'Januaryville, NJ 08234' } ]
0.18.1 fixes this issue. To verify in NodeJS: ```bash $ for v in 0.17.4 0.17.5 0.18.0 0.18.1; do npm i "xlsx@$v"; node -pe 'var XLSX = require("xlsx"); [XLSX.version, XLSX.readFile("t.csv").Sheets.Sheet1.A1]'; done ``` ```js [ '0.17.4', { t: 'n', v: 2313448, w: '1/1/34' } ] [ '0.17.5', { t: 'n', v: 2313448, w: '1/1/34' } ] [ '0.18.0', { t: 'n', v: 2313448, w: '1/1/34' } ] [ '0.18.1', { t: 's', v: 'Januaryville, NJ 08234' } ] ```
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#2196
No description provided.