hyperlink cannot be parsed correctly #2860

Closed
opened 3 weeks ago by tw-yaxu · 4 comments

I've tried use FileReader and sheetjs to read xlsx file. And when I parse external link in cell object, it turns out to be different from what displayed in excel file.

code snippet to read excel:

public async readExcelFile(file: File): Promise<WorkBook> {
        return new Promise((resolve, reject) => {
            const fileReader = new FileReader()
            fileReader.readAsArrayBuffer(file)

            fileReader.onload = (e) => {
                const bufferArray = e.target?.result
                const workBook = read(bufferArray, { type: 'buffer' })
                resolve(workBook)
            }

            fileReader.onerror = ((error) => {
                reject(error)
            })
        })
    }

code snippet to get hyperlink:

    public getHyperlink(cell: any): string {
        if (cell?.l) {
            return cell.l.Target
        }
        return cell?.v || ''
    }

expected output for external link: https://mail.google.com/mail/u/0/#inbox/FMfcgzGrcFjbSrPQLvwvqSDgJHkZStDv&test=true
actual output for external link:
https://mail.google.com/mail/u/0/#inbox/FMfcgzGrcFjbSrPQLvwvqSDgJHkZStDv&amp;test=true

Now I just use Regex replacement as a quick solution. However, I'm not sure if there is any other characters been encoded incorrectly.
Can you help me to find one solution to get the raw external link from excel file? Thanks very much.

I've tried use `FileReader` and `sheetjs` to read xlsx file. And when I parse external link in cell object, it turns out to be different from what displayed in excel file. code snippet to read excel: ``` public async readExcelFile(file: File): Promise<WorkBook> { return new Promise((resolve, reject) => { const fileReader = new FileReader() fileReader.readAsArrayBuffer(file) fileReader.onload = (e) => { const bufferArray = e.target?.result const workBook = read(bufferArray, { type: 'buffer' }) resolve(workBook) } fileReader.onerror = ((error) => { reject(error) }) }) } ``` code snippet to get hyperlink: ``` public getHyperlink(cell: any): string { if (cell?.l) { return cell.l.Target } return cell?.v || '' } ``` expected output for external link: `https://mail.google.com/mail/u/0/#inbox/FMfcgzGrcFjbSrPQLvwvqSDgJHkZStDv&test=true` actual output for external link: `https://mail.google.com/mail/u/0/#inbox/FMfcgzGrcFjbSrPQLvwvqSDgJHkZStDv&amp;test=true` Now I just use Regex replacement as a quick solution. However, I'm not sure if there is any other characters been encoded incorrectly. Can you help me to find one solution to get the raw external link from excel file? Thanks very much.
Owner

Are you sure some other step is not automatically encoding?

https://jsfiddle.net/ohwnbxqr/ is a small demo with a simple file element:

<input type="file" id="filelt">
<pre id="out"></pre>

JS code prints out every link:

filelt.onchange = async(e) => {
  const f = e.target.files[0];
  const wb = XLSX.read(await f.arrayBuffer(), {dense: true});
  wb.SheetNames.forEach(n => {
    wb.Sheets[n]?.["!data"]?.forEach((row, R) => {
      row.forEach((cell, C) => {
        console.log(cell.l?.Target)
        if(cell?.l) out.innerText += `${n} : ${XLSX.utils.encode_cell({r:R,c:C})} : ${cell.l?.Target}`;
      })
    })
  })
}

To see if this is working properly, we can look at the demos page. https://docs.sheetjs.com/docs/csf/features/hyperlinks go to the first live editor and just change the Target to the expected output. The screenshot shows what the code looked like, and the attached XLSX file is what the script created. To confirm Excel read the file correctly, open it, right click cell A1 and select "Edit Hyperlink" to see https://mail.google.com/mail/u/0/#inbox/FMfcgzGrcFjbSrPQLvwvqSDgJHkZStDv&test=true.

The reason to suspect some process on your end is encoding text is in the fiddle itself. The second screenshot shows the result of selecting the file. You will see the page itself (assigned via innerText) shows the correct &. The actual browser console (right side) shows the correct &. However the JSFiddle fake console shows the encoded value. That, it seems, is a bug in JSFiddle.

If you run the same exact test in NodeJS:

const XLSX = require("xlsx");
const wb = XLSX.readFile("issue2860.xlsx", {dense: true});
wb.SheetNames.forEach(n => {
  wb.Sheets[n]?.["!data"]?.forEach((row, R) => {
    row.forEach((cell, C) => {
      if(cell?.l) console.log(`${n} : ${XLSX.utils.encode_cell({r:R,c:C})} : ${cell.l?.Target}`);
    })
  })
})

You will see the correct value:

% node issue2860.js
Sheet1 : A1 : https://mail.google.com/mail/u/0/#inbox/FMfcgzGrcFjbSrPQLvwvqSDgJHkZStDv&test=true
Are you sure some other step is not automatically encoding? https://jsfiddle.net/ohwnbxqr/ is a small demo with a simple file element: ```html <input type="file" id="filelt"> <pre id="out"></pre> ``` JS code prints out every link: ```js filelt.onchange = async(e) => { const f = e.target.files[0]; const wb = XLSX.read(await f.arrayBuffer(), {dense: true}); wb.SheetNames.forEach(n => { wb.Sheets[n]?.["!data"]?.forEach((row, R) => { row.forEach((cell, C) => { console.log(cell.l?.Target) if(cell?.l) out.innerText += `${n} : ${XLSX.utils.encode_cell({r:R,c:C})} : ${cell.l?.Target}`; }) }) }) } ``` To see if this is working properly, we can look at the demos page. https://docs.sheetjs.com/docs/csf/features/hyperlinks go to the first live editor and just change the Target to the expected output. The screenshot shows what the code looked like, and the attached XLSX file is what the script created. To confirm Excel read the file correctly, open it, right click cell A1 and select "Edit Hyperlink" to see `https://mail.google.com/mail/u/0/#inbox/FMfcgzGrcFjbSrPQLvwvqSDgJHkZStDv&test=true`. The reason to suspect some process on your end is encoding text is in the fiddle itself. The second screenshot shows the result of selecting the file. You will see the page itself (assigned via `innerText`) shows the correct `&`. The actual browser console (right side) shows the correct `&`. However the JSFiddle fake console shows the encoded value. That, it seems, is a bug in JSFiddle. If you run the same exact test in NodeJS: ```js const XLSX = require("xlsx"); const wb = XLSX.readFile("issue2860.xlsx", {dense: true}); wb.SheetNames.forEach(n => { wb.Sheets[n]?.["!data"]?.forEach((row, R) => { row.forEach((cell, C) => { if(cell?.l) console.log(`${n} : ${XLSX.utils.encode_cell({r:R,c:C})} : ${cell.l?.Target}`); }) }) }) ``` You will see the correct value: ``` % node issue2860.js Sheet1 : A1 : https://mail.google.com/mail/u/0/#inbox/FMfcgzGrcFjbSrPQLvwvqSDgJHkZStDv&test=true ```
Poster

I've tried in your small demo and still got the same error. That's really weird. Need more time to figure out the excel parsing issue. Thanks for your help.

here is the excel file with external link
image
here is the fiddle & console output
image

I've tried in your small demo and still got the same error. That's really weird. Need more time to figure out the excel parsing issue. Thanks for your help. here is the excel file with external link ![image](/attachments/a3f50073-e492-4616-9564-e01a67760588) here is the fiddle & console output ![image](/attachments/ad198ea9-46b6-429f-861d-a436edd8db20)
Owner

Thanks for sharing! There's a key difference between the two examples.

The URI fragment (the part after the #) is properly decoded and re-encoded. In your first example:

https://mail.google.com/mail/u/0/#inbox/FMfcgzGrcFjbSrPQLvwvqSDgJHkZStDv&test=true
                                 ^------------------------------------------------

The underlined part is the fragment and that part is correctly round-tripped.

In your second example,

https://google.com/test=true&test2=false

There is no fragment. The body is correctly escaped when writing (you can verify that by writing the file and opening in Excel) but the parser is not unescaping.

https://git.sheetjs.com/sheetjs/sheetjs/src/branch/master/bits/31_rels.js#L55 the patch is:

--- a/bits/31_rels.js
+++ b/bits/31_rels.js
@@ -52,7 +52,7 @@ function parse_rels(data/*:?string*/, currentFilePath/*:string*/) {
                var y = parsexmltag(x);
                /* 9.3.2.2 OPC_Relationships */
                if (y[0] === '<Relationship') {
-                       var rel = {}; rel.Type = y.Type; rel.Target = y.Target; rel.Id = y.Id; if(y.TargetMode) rel.TargetMode = y.TargetMode;
+                       var rel = {}; rel.Type = y.Type; rel.Target = unescapexml(y.Target); rel.Id = y.Id; if(y.TargetMode) rel.TargetMode = y.TargetMode;
                        var canonictarget = y.TargetMode === 'External' ? y.Target : resolve_path(y.Target, currentFilePath);
                        rels[canonictarget] = rel;
                        hash[y.Id] = rel;

Feel free to send a PR.

Thanks for sharing! There's a key difference between the two examples. The URI fragment (the part after the `#`) is properly decoded and re-encoded. In your first example: ``` https://mail.google.com/mail/u/0/#inbox/FMfcgzGrcFjbSrPQLvwvqSDgJHkZStDv&test=true ^------------------------------------------------ ``` The underlined part is the fragment and that part is correctly round-tripped. In your second example, ``` https://google.com/test=true&test2=false ``` There is no fragment. The body is correctly escaped when writing (you can verify that by writing the file and opening in Excel) but the parser is not unescaping. https://git.sheetjs.com/sheetjs/sheetjs/src/branch/master/bits/31_rels.js#L55 the patch is: ```diff --- a/bits/31_rels.js +++ b/bits/31_rels.js @@ -52,7 +52,7 @@ function parse_rels(data/*:?string*/, currentFilePath/*:string*/) { var y = parsexmltag(x); /* 9.3.2.2 OPC_Relationships */ if (y[0] === '<Relationship') { - var rel = {}; rel.Type = y.Type; rel.Target = y.Target; rel.Id = y.Id; if(y.TargetMode) rel.TargetMode = y.TargetMode; + var rel = {}; rel.Type = y.Type; rel.Target = unescapexml(y.Target); rel.Id = y.Id; if(y.TargetMode) rel.TargetMode = y.TargetMode; var canonictarget = y.TargetMode === 'External' ? y.Target : resolve_path(y.Target, currentFilePath); rels[canonictarget] = rel; hash[y.Id] = rel; ``` Feel free to send a PR.
sheetjs commented 1 day ago
Owner

This was pushed in 0.19.2, please do an update. An updated version of the fiddle (using the new library version): https://jsfiddle.net/xhtdrscq/

This was pushed in 0.19.2, please do an update. An updated version of the fiddle (using the new library version): https://jsfiddle.net/xhtdrscq/
sheetjs closed this issue 1 day ago
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#2860
Loading…
There is no content yet.