HTML inside XLS not parsed properly #1178
Labels
No Label
DBF
Dates
Defined Names
Features
Formula
HTML
Images
Infrastructure
Integration
International
ODS
Operations
Performance
PivotTables
Pro
Protection
Read Bug
SSF
SYLK
Style
Write Bug
good first issue
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: sheetjs/sheetjs#1178
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hello! I am having some issues with reading XLS files that are actually HTML tables. When exporting a report from Salesforce, if you choose to export as an XLS, the report is actually exported as an HTML table, but saved as an XLS file. Weird, but whatever. I am handling the use case anyway, in case other systems do this, and in order to allow this as an input for my program instead of just throwing an error for the user.
Issue 1: The header cells in the HTML table are not being read by the parser. They are ignored. A quick fix for this was to do regex that replaces all <th>, </th>, and <th ...> variations with <td> tags. Not having to do this would be nice, but it was a simple enough workaround.
Issue 2: I tested putting HTML tags as fields in my Salesforce export, for example I set the value of a Project Manager name to . Salesforce did the correct thing by HTML encoding this value to "<td>", however when reading this file in code, these values are completely ignored. I'm not sure if this will ever be a use case for an input file, however I want to protect against it in case it is.
Here is a simple example of an HTML table that I saved as an XLS file that demonstrates both Issue 1 and 2:
<html>When calling xlsx.read, here is the output when inspecting in the debugger (as you can see, the header values are not picked up, and the HTML encoded values are not picked up either):
Please let me know if you have suggestions for the issues I am experiencing, other than telling the user to download the Salesforce report as a CSV instead :)
So I can only reproduce half the problem: http://jsfiddle.net/8uy017bo/1/ The headers are picked up, and the correct range is calculated, but the entities are being re-processed. What browser are you using?
The entity issue is fortunately a one-line fix (and we'd accept a PR if you feel so inclined :) https://github.com/SheetJS/js-xlsx/blob/master/bits/79_html.js#L33
Thanks for taking a look at this! I am actually not using the browser for this. I am loading an attachment via API call, and then parsing the XLS file with your library.
I will test out the HTML escape issue and submit a PR, thanks!