HTML inside XLS not parsed properly #1178

Closed
opened 2018-07-10 18:37:50 +00:00 by mishaberman · 2 comments
mishaberman commented 2018-07-10 18:37:50 +00:00 (Migrated from github.com)

Hello! I am having some issues with reading XLS files that are actually HTML tables. When exporting a report from Salesforce, if you choose to export as an XLS, the report is actually exported as an HTML table, but saved as an XLS file. Weird, but whatever. I am handling the use case anyway, in case other systems do this, and in order to allow this as an input for my program instead of just throwing an error for the user.

Issue 1: The header cells in the HTML table are not being read by the parser. They are ignored. A quick fix for this was to do regex that replaces all <th>, </th>, and <th ...> variations with <td> tags. Not having to do this would be nice, but it was a simple enough workaround.

Issue 2: I tested putting HTML tags as fields in my Salesforce export, for example I set the value of a Project Manager name to . Salesforce did the correct thing by HTML encoding this value to "<td>", however when reading this file in code, these values are completely ignored. I'm not sure if this will ever be a use case for an input file, however I want to protect against it in case it is.

Here is a simple example of an HTML table that I saved as an XLS file that demonstrates both Issue 1 and 2:

<html>
Header 1 Header 2
Col val 1 Col val 2
Col val 3 Col val 4
<td> <td>
</html>
<html>
<table>
	<tr>
		<th>
			Header 1
		</th>
		<th>
			Header 2
		</th>
	</tr>
	<tr>
		<td>Col val 1</td>
		<td>Col val 2</td>
	</tr>
	<tr>
		<td>Col val 3</td>
		<td>Col val 4</td>
	</tr>
	<tr>
		<td>&lt;td&gt;</td>
		<td>&lt;td&gt;</td>
	</tr>
</table>
</html>

When calling xlsx.read, here is the output when inspecting in the debugger (as you can see, the header values are not picked up, and the HTML encoded values are not picked up either):

image

Please let me know if you have suggestions for the issues I am experiencing, other than telling the user to download the Salesforce report as a CSV instead :)

Hello! I am having some issues with reading XLS files that are actually HTML tables. When exporting a report from Salesforce, if you choose to export as an XLS, the report is actually exported as an HTML table, but saved as an XLS file. Weird, but whatever. I am handling the use case anyway, in case other systems do this, and in order to allow this as an input for my program instead of just throwing an error for the user. **Issue 1:** The header cells in the HTML table are not being read by the parser. They are ignored. A quick fix for this was to do regex that replaces all &lt;th&gt;, &lt;/th&gt;, and <th ...> variations with &lt;td&gt; tags. Not having to do this would be nice, but it was a simple enough workaround. **Issue 2:** I tested putting HTML tags as fields in my Salesforce export, for example I set the value of a Project Manager name to <td>. Salesforce did the correct thing by HTML encoding this value to "&lt;td&gt;", however when reading this file in code, these values are completely ignored. I'm not sure if this will ever be a use case for an input file, however I want to protect against it in case it is. Here is a simple example of an HTML table that I saved as an XLS file that demonstrates both Issue 1 and 2: <html> <table> <tr> <th> Header 1 </th> <th> Header 2 </th> </tr> <tr> <td>Col val 1</td> <td>Col val 2</td> </tr> <tr> <td>Col val 3</td> <td>Col val 4</td> </tr> <tr> <td>&lt;td&gt;</td> <td>&lt;td&gt;</td> </tr> </table> </html> > <html> <table> <tr> <th> Header 1 </th> <th> Header 2 </th> </tr> <tr> <td>Col val 1</td> <td>Col val 2</td> </tr> <tr> <td>Col val 3</td> <td>Col val 4</td> </tr> <tr> <td>&lt;td&gt;</td> <td>&lt;td&gt;</td> </tr> </table> </html> When calling xlsx.read, here is the output when inspecting in the debugger (as you can see, the header values are not picked up, and the HTML encoded values are not picked up either): ![image](https://user-images.githubusercontent.com/68528/42530265-9737b532-8435-11e8-9c97-19c365a51843.png) Please let me know if you have suggestions for the issues I am experiencing, other than telling the user to download the Salesforce report as a CSV instead :)
SheetJSDev commented 2018-07-10 20:10:12 +00:00 (Migrated from github.com)

So I can only reproduce half the problem: http://jsfiddle.net/8uy017bo/1/ The headers are picked up, and the correct range is calculated, but the entities are being re-processed. What browser are you using?

The entity issue is fortunately a one-line fix (and we'd accept a PR if you feel so inclined :) https://github.com/SheetJS/js-xlsx/blob/master/bits/79_html.js#L33

-				m = htmldecode(unescapexml(m));
+				m = htmldecode(m);
So I can only reproduce half the problem: http://jsfiddle.net/8uy017bo/1/ The headers are picked up, and the correct range is calculated, but the entities are being re-processed. What browser are you using? The entity issue is fortunately a one-line fix (and we'd accept a PR if you feel so inclined :) https://github.com/SheetJS/js-xlsx/blob/master/bits/79_html.js#L33 ```diff - m = htmldecode(unescapexml(m)); + m = htmldecode(m); ```
mishaberman commented 2018-07-13 16:16:29 +00:00 (Migrated from github.com)

Thanks for taking a look at this! I am actually not using the browser for this. I am loading an attachment via API call, and then parsing the XLS file with your library.

I will test out the HTML escape issue and submit a PR, thanks!

Thanks for taking a look at this! I am actually not using the browser for this. I am loading an attachment via API call, and then parsing the XLS file with your library. I will test out the HTML escape issue and submit a PR, thanks!
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#1178
No description provided.