long formatted strings are lost (undefined) in workbook.Sheets[] cell values #25

Closed
opened 2013-04-30 21:57:49 +00:00 by mitchellsundt · 5 comments
mitchellsundt commented 2013-04-30 21:57:49 +00:00 (Migrated from github.com)

When js-xlsx processes a XLSX file with a long formatted text string in a cell, it somehow drops or does not process the cell value -- the value is undefined.

See the XLSX file attached to this issue: http://code.google.com/p/opendatakit/issues/detail?id=820

When js-xlsx processes a XLSX file with a long formatted text string in a cell, it somehow drops or does not process the cell value -- the value is undefined. See the XLSX file attached to this issue: http://code.google.com/p/opendatakit/issues/detail?id=820
mitchellsundt commented 2013-04-30 22:42:55 +00:00 (Migrated from github.com)

First half of fix is:

/* 18.4 Shared String Table */
function parseStrs(data) {
var s = [];
var sst = data.match(new RegExp("<sst ([^>]*)>([\\s\\S]*)<\/sst>","m"));
if(sst) {
    s = sst[2].replace(/<si>/g,"").split(/<\/si>/).map(function(x) { var z = {};
        var y=x.match(/<([^>]*)>([\s\S]*)<\/[^>]*>/);
        if(y) z[y[1].split(" ")[0]]=utf8read(unescapexml(y[2]));
        return z;
    });

    sst = parsexmltag(sst[1]); s.Count = sst.count; s.Unique = sst.uniqueCount;
}
return s;
}

(the inner match expression has changed).

Since this is apparently a rich text string, it has a 'r' instead of a 't' value, so the retrieval in parseSheet, which assumes:

            case 's': p.v = strs[parseInt(p.v, 10)].t; break;

Will still return undefined.

This should probably still be undefined, and the readme should be updated to note that rich text in cells is not supported by the converter.

The fix to the inner match, however, should be done so that expressions are properly extracted into the Strings array.

First half of fix is: ``` /* 18.4 Shared String Table */ function parseStrs(data) { var s = []; var sst = data.match(new RegExp("<sst ([^>]*)>([\\s\\S]*)<\/sst>","m")); if(sst) { s = sst[2].replace(/<si>/g,"").split(/<\/si>/).map(function(x) { var z = {}; var y=x.match(/<([^>]*)>([\s\S]*)<\/[^>]*>/); if(y) z[y[1].split(" ")[0]]=utf8read(unescapexml(y[2])); return z; }); sst = parsexmltag(sst[1]); s.Count = sst.count; s.Unique = sst.uniqueCount; } return s; } ``` (the inner match expression has changed). Since this is apparently a rich text string, it has a 'r' instead of a 't' value, so the retrieval in parseSheet, which assumes: ``` case 's': p.v = strs[parseInt(p.v, 10)].t; break; ``` Will still return undefined. This should probably still be undefined, and the readme should be updated to note that rich text in cells is not supported by the converter. The fix to the inner match, however, should be done so that expressions are properly extracted into the Strings array.
Niggler commented 2013-04-30 23:00:47 +00:00 (Migrated from github.com)

@mitchellsundt thanks for reporting this.

I think the "Right Thing" looks like this:

  • cells have a .r or some other property for the rich text
  • function to convert those values to proper HTML (replacing special tags
    like rFont)
  • for each cell, the .t will hold a plaintext rendering (strip out the
    formatting)

I'm not in front of computer now; I expect to push this later tonight

On Tue, Apr 30, 2013 at 6:42 PM, Mitch Sundt notifications@github.comwrote:

First half of fix is:

/* 18.4 Shared String Table /
function parseStrs(data) {
var s = [];
var sst = data.match(new RegExp("]
)>([\s\S])</sst>","m"));
if(sst) {
s = sst[2].replace(//g,"").split(/</si>/).map(function(x) { var z = {};
var y=x.match(/<([^>]
)>([\s\S])</[^>]>/);
if(y) z[y[1].split(" ")[0]]=utf8read(unescapexml(y[2]));
return z;
});

sst = parsexmltag(sst[1]); s.Count = sst.count; s.Unique = sst.uniqueCount;

}
return s;

}

(the inner match expression has changed).

Since this is apparently a rich text string, it has a 'r' instead of a 't'
value, so the retrieval in parseSheet, which assumes:

        case 's': p.v = strs[parseInt(p.v, 10)].t; break;

Will still return undefined.

This should probably still be undefined, and the readme should be updated
to note that rich text in cells is not supported by the converter.

The fix to the inner match, however, should be done so that expressions
are properly extracted into the Strings array.


Reply to this email directly or view it on GitHubhttps://github.com/Niggler/js-xlsx/issues/25#issuecomment-17259038
.

@mitchellsundt thanks for reporting this. I think the "Right Thing" looks like this: - cells have a `.r` or some other property for the rich text - function to convert those values to proper HTML (replacing special tags like `rFont`) - for each cell, the `.t` will hold a plaintext rendering (strip out the formatting) I'm not in front of computer now; I expect to push this later tonight On Tue, Apr 30, 2013 at 6:42 PM, Mitch Sundt notifications@github.comwrote: > First half of fix is: > > /\* 18.4 Shared String Table _/ > function parseStrs(data) { > var s = []; > var sst = data.match(new RegExp("]_)>([\s\S]_)<\/sst>","m")); > if(sst) { > s = sst[2].replace(//g,"").split(/<\/si>/).map(function(x) { var z = {}; > var y=x.match(/<([^>]_)>([\s\S]_)<\/[^>]_>/); > if(y) z[y[1].split(" ")[0]]=utf8read(unescapexml(y[2])); > return z; > }); > > ``` > sst = parsexmltag(sst[1]); s.Count = sst.count; s.Unique = sst.uniqueCount; > ``` > > } > return s; > > } > > (the inner match expression has changed). > > Since this is apparently a rich text string, it has a 'r' instead of a 't' > value, so the retrieval in parseSheet, which assumes: > > ``` > case 's': p.v = strs[parseInt(p.v, 10)].t; break; > ``` > > Will still return undefined. > > This should probably still be undefined, and the readme should be updated > to note that rich text in cells is not supported by the converter. > > The fix to the inner match, however, should be done so that expressions > are properly extracted into the Strings array. > > — > Reply to this email directly or view it on GitHubhttps://github.com/Niggler/js-xlsx/issues/25#issuecomment-17259038 > .
mitchellsundt commented 2013-04-30 23:26:46 +00:00 (Migrated from github.com)

Here's a revised parseStrs that ensures that the <t>...</t> tags match at the beginning and end and do not have any XML (open angle brackets) between those tags. Strings[] objects then have .t for simple text, and .raw for XML text (because that is a raw series of XML elements -- see below)

/* 18.4 Shared String Table */
function parseStrs(data) {
    var s = [];
    var sst = data.match(new RegExp("<sst ([^>]*)>([\\s\\S]*)<\/sst>","m"));
    if(sst) {
        s = sst[2].replace(/<si>/g,"").split(/<\/si>/).map(function(x) { var z = {};
            var simpleString=x.match(/^<t>([^<]*)<\/t>$/);
            if (simpleString) {
            if (simpleString) {
                z['t'] = utf8read(unescapexml(simpleString[1]));
            } else {
                z['raw'] = x;
            }
            return z;
        });

        sst = parsexmltag(sst[1]); s.Count = sst.count; s.Unique = sst.uniqueCount;
    }
    return s;
}

In the parseSheet function, I did this for the 's' case:

            case 's': {
                    var sval = strs[parseInt(p.v, 10)];
                    if ( "t" in sval ) {
                        p.v = sval.t;
                    } else {
                        delete p.v;
                        p.raw = sval.raw;
                    }
                }
                break;

I.e., put the XML string into the .raw field. My example XLSX file has the following for the 'raw' string:

"<r><t xml:space="preserve">This inserts a 'joins' entry into the column_definitions table for the </t></r><r><rPr><b/><sz val="10"/><color rgb="FF000000"/><rFont val="Arial"/><family val="2"/></rPr><t>household_id</t></r><r><rPr><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor"/></rPr><t xml:space="preserve"> column of the </t></r><r><rPr><b/><sz val="10"/><color rgb="FF000000"/><rFont val="Arial"/><family val="2"/></rPr><t>household</t></r><r><rPr><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor"/></rPr><t xml:space="preserve"> table_id of the form: 

"[ { table_id: household_member, element_name: household_id } ]"
</t></r>"

Note that this is just a mixed series of <r> and <t> tags so it isn't wrapped by any enclosing tag

Here's a revised parseStrs that ensures that the &lt;t&gt;...&lt;/t&gt; tags match at the beginning and end and do not have any XML (open angle brackets) between those tags. Strings[] objects then have .t for simple text, and .raw for XML text (because that is a raw series of XML elements -- see below) ``` /* 18.4 Shared String Table */ function parseStrs(data) { var s = []; var sst = data.match(new RegExp("<sst ([^>]*)>([\\s\\S]*)<\/sst>","m")); if(sst) { s = sst[2].replace(/<si>/g,"").split(/<\/si>/).map(function(x) { var z = {}; var simpleString=x.match(/^<t>([^<]*)<\/t>$/); if (simpleString) { if (simpleString) { z['t'] = utf8read(unescapexml(simpleString[1])); } else { z['raw'] = x; } return z; }); sst = parsexmltag(sst[1]); s.Count = sst.count; s.Unique = sst.uniqueCount; } return s; } ``` In the parseSheet function, I did this for the 's' case: ``` case 's': { var sval = strs[parseInt(p.v, 10)]; if ( "t" in sval ) { p.v = sval.t; } else { delete p.v; p.raw = sval.raw; } } break; ``` I.e., put the XML string into the .raw field. My example XLSX file has the following for the 'raw' string: ``` "<r><t xml:space="preserve">This inserts a 'joins' entry into the column_definitions table for the </t></r><r><rPr><b/><sz val="10"/><color rgb="FF000000"/><rFont val="Arial"/><family val="2"/></rPr><t>household_id</t></r><r><rPr><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor"/></rPr><t xml:space="preserve"> column of the </t></r><r><rPr><b/><sz val="10"/><color rgb="FF000000"/><rFont val="Arial"/><family val="2"/></rPr><t>household</t></r><r><rPr><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor"/></rPr><t xml:space="preserve"> table_id of the form: "[ { table_id: household_member, element_name: household_id } ]" </t></r>" ``` Note that this is just a mixed series of &lt;r&gt; and &lt;t&gt; tags so it isn't wrapped by any enclosing tag
Niggler commented 2013-05-01 05:05:54 +00:00 (Migrated from github.com)

@mitchellsundt I pushed an update that puts the plaintext in the .t field even if the text is rich (and populates the .r field for plain text as well as rich text), so for plaintext uses (like CSV) this should suffice. The other half (actually parsing the rich text and generating HTML or other output) is in the pipeline.

As for missing support, the ultimate goal is 100% coverage (so any bug or missing feature is very much undesired

@mitchellsundt I pushed an update that puts the plaintext in the .t field even if the text is rich (and populates the .r field for plain text as well as rich text), so for plaintext uses (like CSV) this should suffice. The other half (actually parsing the rich text and generating HTML or other output) is in the pipeline. As for missing support, the ultimate goal is 100% coverage (so any bug or missing feature is very much undesired
mitchellsundt commented 2013-05-01 16:30:25 +00:00 (Migrated from github.com)

Thanks!

On Tue, Apr 30, 2013 at 10:05 PM, Niggler notifications@github.com wrote:

@mitchellsundt https://github.com/mitchellsundt I pushed an update that
puts the plaintext in the .t field even if the text is rich (and populates
the .r field for plain text as well as rich text), so for plaintext uses
(like CSV) this should suffice. The other half (actually parsing the rich
text and generating HTML or other output) is in the pipeline.

As for missing support, the ultimate goal is 100% coverage (so any bug or
missing feature is very much undesired


Reply to this email directly or view it on GitHubhttps://github.com/Niggler/js-xlsx/issues/25#issuecomment-17268924
.

Mitch Sundt
Software Engineer
University of Washington
mitchellsundt@gmail.com

Thanks! On Tue, Apr 30, 2013 at 10:05 PM, Niggler notifications@github.com wrote: > @mitchellsundt https://github.com/mitchellsundt I pushed an update that > puts the plaintext in the .t field even if the text is rich (and populates > the .r field for plain text as well as rich text), so for plaintext uses > (like CSV) this should suffice. The other half (actually parsing the rich > text and generating HTML or other output) is in the pipeline. > > As for missing support, the ultimate goal is 100% coverage (so any bug or > missing feature is very much undesired > > — > Reply to this email directly or view it on GitHubhttps://github.com/Niggler/js-xlsx/issues/25#issuecomment-17268924 > . ## Mitch Sundt Software Engineer University of Washington mitchellsundt@gmail.com
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sheetjs/sheetjs#25
No description provided.