Spurious whitespace creates spurious columns and rows #945

Closed
opened 2018-01-04 20:01:06 +00:00 by bean5 · 7 comments
bean5 commented 2018-01-04 20:01:06 +00:00 (Migrated from github.com)

Expected
Cells containing only whitespace are treated as empty string.

Actual
Whitespace is treated as characters.

Background
In the case that I accidentally have a space in a random cell, everything up to that point ends up being imported, including empty columns/rows. Of course, it is important to leave blank columns between 2 actual columns of data (for spacing). But if I have a 5x5 spreadsheet that upon opening has 4 extra columns and 2 extra rows because at some point there was data in some far-off cell but was replaced with " ", things look strange especially since I can't see whitespace normally. While in Excel extra columns and rows appear by default, when using web-based tools like the one at http://oss.sheetjs.com/ extra rows and columns must be added explicitly.

I have some code I plan to use in the mean time. Feel free to use it. Be warned, it is not yet untested. It is as follows:

// removes columns that are whitespace or empty string
function eradicate_empty_columns(table) {
  column: for(var j = 0; j < table.length; j++) {
    // check column for other rows
    row: for(var i = 0; i < table.length; i++) {
      if(!might_as_well_be_empty_string(table[j][i])) {
        continue column;
      }
    }
    table.splice(i, 1);
    i--;
  }
}
function remove_column(table, column) {
  // remove column
  for(var i = 0; i < table.length; i++) {
    table[i].splice(column, 1);
  }
}
function eradicate_empty_rows(table) {
  row: for(var i = 0; i < table.length; i++) {
    // check column for other rows
    column: for(var j = 0; j < table.length; j++) {
      if(!might_as_well_be_empty_string(table[j][i])) {
        continue column;
      }
    }
    table.splice(i, 1);
    i--;
  }
}
**Expected** Cells containing only whitespace are treated as empty string. **Actual** Whitespace is treated as characters. **Background** In the case that I accidentally have a space in a random cell, everything up to that point ends up being imported, including empty columns/rows. Of course, it is important to leave blank columns between 2 actual columns of data (for spacing). But if I have a 5x5 spreadsheet that upon opening has 4 extra columns and 2 extra rows because at some point there was data in some far-off cell but was replaced with `" "`, things look strange especially since I can't see whitespace normally. While in Excel extra columns and rows appear by default, when using web-based tools like the one at http://oss.sheetjs.com/ extra rows and columns must be added explicitly. I have some code I plan to use in the mean time. Feel free to use it. Be warned, it is not yet untested. It is as follows: ```js // removes columns that are whitespace or empty string function eradicate_empty_columns(table) { column: for(var j = 0; j < table.length; j++) { // check column for other rows row: for(var i = 0; i < table.length; i++) { if(!might_as_well_be_empty_string(table[j][i])) { continue column; } } table.splice(i, 1); i--; } } function remove_column(table, column) { // remove column for(var i = 0; i < table.length; i++) { table[i].splice(column, 1); } } function eradicate_empty_rows(table) { row: for(var i = 0; i < table.length; i++) { // check column for other rows column: for(var j = 0; j < table.length; j++) { if(!might_as_well_be_empty_string(table[j][i])) { continue column; } } table.splice(i, 1); i--; } } ```
SheetJSDev commented 2018-01-04 20:26:19 +00:00 (Migrated from github.com)

@bean5 you can add this to a page in the wiki https://github.com/SheetJS/js-xlsx/wiki . Note that if you open the code block with ```js GH will use syntax highlighting:

// removes columns that are whitespace or empty string
function eradicate_empty_columns(table) {
  column: for(var j = 0; j < table.length; j++) {
    // check column for other rows
    row: for(var i = 0; i < table.length; i++) {
      if(!might_as_well_be_empty_string(table[j][i])) {
        continue column;
      }
    }
    table.splice(i, 1);
    i--;
  }
}
function remove_column(table, column) {
  // remove column
  for(var i = 0; i < table.length; i++) {
    table[i].splice(column, 1);
  }
}
function eradicate_empty_rows(table) {
  row: for(var i = 0; i < table.length; i++) {
    // check column for other rows
    column: for(var j = 0; j < table.length; j++) {
      if(!might_as_well_be_empty_string(table[j][i])) {
        continue column;
      }
    }
    table.splice(i, 1);
    i--;
  }
}

As for the general question of whether cells consisting solely of whitespace should be preserved, here are a few cases to consider:

  • cell consisting of errant blank spaces:
  • cell encoded as a set of blank spaces: '
  • formula explicitly set to an empty string: =""
  • formula explicitly set to blank spaces: '=" "
  • formula that evaluates to blank spaces: =CHAR(32)&CHAR(32)
@bean5 you can add this to a page in the wiki https://github.com/SheetJS/js-xlsx/wiki . Note that if you open the code block with ` ```js ` GH will use syntax highlighting: ```js // removes columns that are whitespace or empty string function eradicate_empty_columns(table) { column: for(var j = 0; j < table.length; j++) { // check column for other rows row: for(var i = 0; i < table.length; i++) { if(!might_as_well_be_empty_string(table[j][i])) { continue column; } } table.splice(i, 1); i--; } } function remove_column(table, column) { // remove column for(var i = 0; i < table.length; i++) { table[i].splice(column, 1); } } function eradicate_empty_rows(table) { row: for(var i = 0; i < table.length; i++) { // check column for other rows column: for(var j = 0; j < table.length; j++) { if(!might_as_well_be_empty_string(table[j][i])) { continue column; } } table.splice(i, 1); i--; } } ``` As for the general question of whether cells consisting solely of whitespace should be preserved, here are a few cases to consider: - cell consisting of errant blank spaces: ` ` - cell encoded as a set of blank spaces: `' ` - formula explicitly set to an empty string: `=""` - formula explicitly set to blank spaces: `'=" "` - formula that evaluates to blank spaces: `=CHAR(32)&CHAR(32)`
bean5 commented 2018-01-04 20:51:00 +00:00 (Migrated from github.com)

I updated my code block to use highlighting.

Your cases there are interesting. My code avoids at least some of those by running after evaluation has occurred. I used regular expressions to replace whatever regular expression defines as whitespace.

Aside, I found something strange with one of your cases. If you copy the value of =CHAR(32)&CHAR(32) into another cell (I am using LibreOffice), then open the file with http://oss.sheetjs.com/ it will be imported and look like whitespace. If you click on the cell, it gets updated to undefined.

Whitespace is really starting to rear its ugly head.

I updated my code block to use highlighting. Your cases there are interesting. My code avoids at least some of those by running after evaluation has occurred. I used regular expressions to replace whatever regular expression defines as whitespace. Aside, I found something strange with one of your cases. If you copy the _value_ of `=CHAR(32)&CHAR(32)` into another cell (I am using LibreOffice), then open the file with http://oss.sheetjs.com/ it will be imported and look like whitespace. If you click on the cell, it gets updated to `undefined`. Whitespace is really starting to rear its ugly head.
bean5 commented 2018-01-04 20:53:25 +00:00 (Migrated from github.com)

Also, I'm not going to add the code to the wiki...my code still has bugs--plus it doesn't aim to preserve blank rows/columns sandwiched between non-whitespace data.

Also, I'm not going to add the code to the wiki...my code still has bugs--plus it doesn't aim to preserve blank rows/columns sandwiched between non-whitespace data.
SheetJSDev commented 2018-01-11 18:27:42 +00:00 (Migrated from github.com)

Since you're working with an array of arrays, and since you want to preserve blank columns and blank rows between valid columns and rows, you can fix the data in one pass by looping backwards:

function remove_outside_blanks(aoa) {
	/* Loop backwards by row */
	row: for(var R = aoa.length-1; R >= 0; --R) {
		var rw = aoa[R];
		/* Fix the row and bail once a real value is found */
		for(var C = rw.length-1; C >= 0; --C) {
			var value = rw[C];
			/* trailing nulls should be removed */
			if(value == null) rw.length--;
			/* remove whitespace cells */
			else if(typeof value == "string" && value.trim() == "") rw.length--;
			/* end loop if useful value */
			else break;
		}
		/* remove empty row */
		if(C < 0 && R == aoa.length - 1) --aoa.length; 
	}
	return aoa;
}

Use this on array of arrays, like those you would get from sheet_to_json with header:1:

remove_outside_blanks([
	["abc", "def"],
	[" "], // this row will be preserved but the blank value will be wiped
	[123, " ", 456, " "], // the last value will be removed
	[, " "] // this row will be deleted, shrinking the AOA
]);
[
	[ 'abc', 'def' ],
	[],
	[ 123, ' ', 456 ]
]
Since you're working with an array of arrays, and since you want to preserve blank columns and blank rows between valid columns and rows, you can fix the data in one pass by looping backwards: ```js function remove_outside_blanks(aoa) { /* Loop backwards by row */ row: for(var R = aoa.length-1; R >= 0; --R) { var rw = aoa[R]; /* Fix the row and bail once a real value is found */ for(var C = rw.length-1; C >= 0; --C) { var value = rw[C]; /* trailing nulls should be removed */ if(value == null) rw.length--; /* remove whitespace cells */ else if(typeof value == "string" && value.trim() == "") rw.length--; /* end loop if useful value */ else break; } /* remove empty row */ if(C < 0 && R == aoa.length - 1) --aoa.length; } return aoa; } ``` Use this on array of arrays, like those you would get from `sheet_to_json` with `header:1`: ```js remove_outside_blanks([ ["abc", "def"], [" "], // this row will be preserved but the blank value will be wiped [123, " ", 456, " "], // the last value will be removed [, " "] // this row will be deleted, shrinking the AOA ]); [ [ 'abc', 'def' ], [], [ 123, ' ', 456 ] ] ```
bean5 commented 2018-01-11 21:37:31 +00:00 (Migrated from github.com)

Since my use cases are okay with removing blank rows and columns, I just use the forwards route. But yes, I believe the backwards route should be used at a minimum by default by this project. It would be nice to have a setting to optionally remove the rest. Normally I'd make a PR, but I am new to the codebase. I don't know where to centrally place such code.

Since my use cases are okay with removing blank rows and columns, I just use the forwards route. But yes, I believe the backwards route should be used at a minimum by default by this project. It would be nice to have a setting to _optionally_ remove the rest. Normally I'd make a PR, but I am new to the codebase. I don't know where to centrally place such code.
SheetJSDev commented 2018-01-11 22:19:32 +00:00 (Migrated from github.com)

These are nice little utility functions but don't really belong in the library code directly, which is why the wiki or a separate library is the best place to house them. For posterity sake, that code segment represents one obvious solution and is hopefully helpful to other people who stumble upon this page.

It is unclear where the "fix" should appear. For example, readers could omit string cells with no formula and whitespace as content (which already happens with HTML input). sheet_to_json, sheet_to_csv, and other utilities could skip cells with whitespace content.

These are nice little utility functions but don't really belong in the library code directly, which is why the wiki or a separate library is the best place to house them. For posterity sake, that code segment represents one obvious solution and is hopefully helpful to other people who stumble upon this page. It is unclear where the "fix" should appear. For example, readers could omit string cells with no formula and whitespace as content (which already happens with HTML input). `sheet_to_json`, `sheet_to_csv`, and other utilities could skip cells with whitespace content.
bean5 commented 2018-01-11 23:17:53 +00:00 (Migrated from github.com)

I agree on many of those points. I ran into this when consuming the json
version of the some xls documents I have.

On Thu, Jan 11, 2018 at 2:19 PM SheetJSDev notifications@github.com wrote:

These are nice little utility functions but don't really belong in the
library code directly, which is why the wiki or a separate library is the
best place to house them. For posterity sake, that code segment represents
one obvious solution and is hopefully helpful to other people who stumble
upon this page.

It is unclear where the "fix" should appear. For example, readers could
omit string cells with no formula and whitespace as content (which already
happens with HTML input). sheet_to_json, sheet_to_csv, and other
utilities could skip cells with whitespace content.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/SheetJS/js-xlsx/issues/945#issuecomment-357080207,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB9SJkt-neHsodAEdj8CEhbCuOt6IORoks5tJoj8gaJpZM4RTf24
.

I agree on many of those points. I ran into this when consuming the json version of the some xls documents I have. On Thu, Jan 11, 2018 at 2:19 PM SheetJSDev <notifications@github.com> wrote: > These are nice little utility functions but don't really belong in the > library code directly, which is why the wiki or a separate library is the > best place to house them. For posterity sake, that code segment represents > one obvious solution and is hopefully helpful to other people who stumble > upon this page. > > It is unclear where the "fix" should appear. For example, readers could > omit string cells with no formula and whitespace as content (which already > happens with HTML input). sheet_to_json, sheet_to_csv, and other > utilities could skip cells with whitespace content. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/SheetJS/js-xlsx/issues/945#issuecomment-357080207>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AB9SJkt-neHsodAEdj8CEhbCuOt6IORoks5tJoj8gaJpZM4RTf24> > . >
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#945
No description provided.