Support for parsing Comments #43

2014-01-15T07:38:30Z

kinwahlai commented

2014-01-15 07:38:30 +00:00

(Migrated from github.com)

Thanks for a great library!
Here is my best attempt at contributing support to parse comments.
I used an existing file already in the project for the regressions tests.
I realize I coded the changes directly in the built xlsx.
Let me know if you want me to change something.

Thanks!

Comments parts listed in the [Content Types] are parsed.
Sheets's relationships are parsed.
Comments parts are correlated to their corresponding sheets parts.
Comments's contents are added to the ref'ed cells.
Rich text styling properties are currently ignored.

For example:

{
  "!ref": "A1:B3",
  "A1": {
    "v": 1,
    "t": "n"
  },
  "B1": {
    "v": "one",
    "t": "s",
    "r": "one",
    "c": [
      { "a": "Yegor Kozlov",
       "t": [ "Yegor Kozlov:",
              "\r\nfirst cell" ]
      }
    ]
  }
}

Thanks for a great library! Here is my best attempt at contributing support to parse comments. I used an existing file already in the project for the regressions tests. I realize I coded the changes directly in the built xlsx. Let me know if you want me to change something. Thanks! Comments parts listed in the [Content Types] are parsed. Sheets's relationships are parsed. Comments parts are correlated to their corresponding sheets parts. Comments's contents are added to the ref'ed cells. Rich text styling properties are currently ignored. For example: ``` { "!ref": "A1:B3", "A1": { "v": 1, "t": "n" }, "B1": { "v": "one", "t": "s", "r": "one", "c": [ { "a": "Yegor Kozlov", "t": [ "Yegor Kozlov:", "\r\nfirst cell" ] } ] } } ```

SheetJSDev commented

2014-01-15 14:02:42 +00:00

(Migrated from github.com)

The test failed because I moved the POI test cases to a separate file. I pushed changes to the test_files repo and this repo's tests.lst -- can you pull the changes, make init then make test?

The test failed because I moved the POI test cases to a separate file. I pushed changes to the test_files repo and this repo's tests.lst -- can you pull the changes, `make init` then `make test`?

kinwahlai commented

2014-01-16 02:35:32 +00:00

(Migrated from github.com)

OK thanks for your attention. We rebased our branch on the top of your latest master.
We are working on putting the changes into the bits files to support the tests as expected by Travis.

OK thanks for your attention. We rebased our branch on the top of your latest master. We are working on putting the changes into the bits files to support the tests as expected by Travis.

SheetJSDev commented

2014-01-16 02:55:42 +00:00

(Migrated from github.com)

@kinwahlai I can reconcile that later. According to https://travis-ci.org/SheetJS/js-xlsx/jobs/17037467, Travis is complaining because the file SimpleWithComments.xlsx was renamed to apachepoi_SimpleWithComments.xlsx. The fix is in the test script:

var wb = XLSX.readFile('./test_files/apachepoi_SimpleWithComments.xlsx');

@kinwahlai I can reconcile that later. According to https://travis-ci.org/SheetJS/js-xlsx/jobs/17037467, Travis is complaining because the file `SimpleWithComments.xlsx` was renamed to `apachepoi_SimpleWithComments.xlsx`. The fix is in the test script: ``` var wb = XLSX.readFile('./test_files/apachepoi_SimpleWithComments.xlsx'); ```

SheetJSDev commented

2014-01-16 15:20:30 +00:00

(Migrated from github.com)

Hmm @kinwah @kinwahlai I thought I made a comment yesterday but that seemed to have disappeared ...

Can you take a look at https://github.com/SheetJS/test_files/blob/master/comments_stress_test.xlsx ? In the first sheet, there is a rich-text comment and the raw text is broken up. It may make more sense to concatenate rather than push the raw comments.

Also, do you need the rich text as well? There is some basic HTML conversion (search for 18.4.7 in the code) which really should be moved out of the shared string table function

    <comment ref="C1" authorId="0">
      <text>
        <r>
          <rPr>
            <b/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
            <family val="2"/>
          </rPr>
          <t xml:space="preserve">I </t>
        </r>
        <r>
          <rPr>
            <b/>
            <i/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
          </rPr>
          <t>really</t>
        </r>
        <r>
          <rPr>
            <b/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
            <family val="2"/>
          </rPr>
          <t xml:space="preserve"> </t>
        </r>
        <r>
          <rPr>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
            <family val="2"/>
          </rPr>
          <t xml:space="preserve">hope that </t>
        </r>
        <r>
          <rPr>
            <i/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
          </rPr>
          <t>xlsx</t>
        </r>
        <r>
          <rPr>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
            <family val="2"/>
          </rPr>
          <t xml:space="preserve"> </t>
        </r>
        <r>
          <rPr>
            <u/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
          </rPr>
          <t>decides not</t>
        </r>
        <r>
          <rPr>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
            <family val="2"/>
          </rPr>
          <t xml:space="preserve"> </t>
        </r>
        <r>
          <rPr>
            <i/>
            <u/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
          </rPr>
          <t xml:space="preserve">to use </t>
        </r>
        <r>
          <rPr>
            <b/>
            <i/>
            <u/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
          </rPr>
          <t>magic</t>
        </r>
        <r>
          <rPr>
            <b/>
            <u/>
            <sz val="9"/>
            <color indexed="81"/>
            <rFont val="Calibri"/>
          </rPr>
          <t xml:space="preserve"> like rPr</t>
        </r>
      </text>
    </comment>

Hmm @kinwah @kinwahlai I thought I made a comment yesterday but that seemed to have disappeared ... Can you take a look at https://github.com/SheetJS/test_files/blob/master/comments_stress_test.xlsx ? In the first sheet, there is a rich-text comment and the raw text is broken up. It may make more sense to concatenate rather than push the raw comments. Also, do you need the rich text as well? There is some basic HTML conversion (search for 18.4.7 in the code) which really should be moved out of the shared string table function ``` <comment ref="C1" authorId="0"> <text> <r> <rPr> <sz val="9"/> <color indexed="81"/> <rFont val="Calibri"/> <family val="2"/> </rPr> <t xml:space="preserve">I </t> </r> <r> <rPr> <sz val="9"/> <color indexed="81"/> <rFont val="Calibri"/> </rPr> <t>really</t> </r> <r> <rPr> <sz val="9"/> <color indexed="81"/> <rFont val="Calibri"/> <family val="2"/> </rPr> <t xml:space="preserve"> </t> </r> <r> <rPr> <sz val="9"/> <color indexed="81"/> <rFont val="Calibri"/> <family val="2"/> </rPr> <t xml:space="preserve">hope that </t> </r> <r> <rPr> <sz val="9"/> <color indexed="81"/> <rFont val="Calibri"/> </rPr> <t>xlsx</t> </r> <r> <rPr> <sz val="9"/> <color indexed="81"/> <rFont val="Calibri"/> <family val="2"/> </rPr> <t xml:space="preserve"> </t> </r> <r> <rPr> <sz val="9"/> <color indexed="81"/> <rFont val="Calibri"/> </rPr> <t>decides not</t> </r> <r> <rPr> <sz val="9"/> <color indexed="81"/> <rFont val="Calibri"/> <family val="2"/> </rPr> <t xml:space="preserve"> </t> </r> <r> <rPr> <sz val="9"/> <color indexed="81"/> <rFont val="Calibri"/> </rPr> <t xml:space="preserve">to use </t> </r> <r> <rPr> <sz val="9"/> <color indexed="81"/> <rFont val="Calibri"/> </rPr> <t>magic</t> </r> <r> <rPr> <sz val="9"/> <color indexed="81"/> <rFont val="Calibri"/> </rPr> <t xml:space="preserve"> like rPr</t> </r> </text> </comment> ```

kinwahlai commented

2014-01-17 02:27:13 +00:00

(Migrated from github.com)

Hi @SheetJSDev; well in our usecase we need to know the internal structure of the comments. And we are not interested by the styling as html.
For example: to find out that a google-spreadsheet's note matches a comment we make sure we see single rich text in the comment. We put some JSON in some of those notes and we dont want the html styling to appear in there.
When we see multiple rich texts we discard them as they are not notes.

In fact we have the same requirement with number formats: we need to know about the number-formats as defined in the spreadsheet and we are not interested in the formatting of the raw values for them.

For the sake of consistency with the rest of the code we can certainly help with doing to the comments the same type of things than what happens for the other rich texts.

What do you think?

Hi @SheetJSDev; well in our usecase we need to know the internal structure of the comments. And we are not interested by the styling as html. For example: to find out that a google-spreadsheet's note matches a comment we make sure we see single rich text in the comment. We put some JSON in some of those notes and we dont want the html styling to appear in there. When we see multiple rich texts we discard them as they are not notes. In fact we have the same requirement with number formats: we need to know about the number-formats as defined in the spreadsheet and we are not interested in the formatting of the raw values for them. For the sake of consistency with the rest of the code we can certainly help with doing to the comments the same type of things than what happens for the other rich texts. What do you think?

SheetJSDev commented

2014-01-17 03:20:03 +00:00

(Migrated from github.com)

@kinwahlai my point was according to the spec, the text tags don't have author information. It's at the outer comment tag level (see my annotations below):

<comment ref="C1" authorId="0">   <-- AUTHOR IS IDENTIFIED HERE
      <text>  <--- this element does not have the author id
        <r> <-- this element does not have the author id
          <rPr>
             ...
          </rPr>
          <t xml:space="preserve">I </t> <-- this element does not have the author id
        </r>

Hence I would think it makes sense to combine the t tags. I have not played with google docs much, so I don't know how they do it (and my comment may not make sense for gdocs). Can you share an example file with comments from multiple users?

Also, Excel generates an empty tag for comments with no text. See the comments_stress_test.xlsx file (https://github.com/SheetJS/test_files/blob/master/comments_stress_test.xlsx):

<comment ref="B2" authorId="0"><text/>

You need to guard for the possibility that x.match(/<text>([^\u2603]*)<\/text>/m) is null

@kinwahlai my point was according to the spec, the text tags don't have author information. It's at the outer comment tag level (see my annotations below): ``` <comment ref="C1" authorId="0"> <-- AUTHOR IS IDENTIFIED HERE <text> <--- this element does not have the author id <r> <-- this element does not have the author id <rPr> ... </rPr> <t xml:space="preserve">I </t> <-- this element does not have the author id </r> ``` Hence I would think it makes sense to combine the t tags. I have not played with google docs much, so I don't know how they do it (and my comment may not make sense for gdocs). Can you share an example file with comments from multiple users? --- Also, Excel generates an empty tag for comments with no text. See the comments_stress_test.xlsx file (https://github.com/SheetJS/test_files/blob/master/comments_stress_test.xlsx): ``` <comment ref="B2" authorId="0"><text/> ``` You need to guard for the possibility that `x.match(/<text>([^\u2603]*)<\/text>/m)` is null

hmalphettes commented

2014-01-18 02:12:33 +00:00

(Migrated from github.com)

Hi @SheetJSDev, @kinwahlai and I are pairing on this code.
The example sheet we were looking at in the openxml spec had 2 rich texts; I mistook it for the author and body of the comment.
Thanks for clarifying this. In fact we are parsing the author as expected.

Thanks for noting the bug regarding the potentially null text element that would crash the parser.

OK to concatenate the comments as a first step and to pull out the code that applies the comments styles as html styles. It does not really match our usecase and I think it would make our life harder if we were trying to generate xlsx files from the generated model.
But we don't know better than that for now at the moment ;-)

We will add a few more commits to that branch for your kind review then.

Hi @SheetJSDev, @kinwahlai and I are pairing on this code. The example sheet we were looking at in the openxml spec had 2 rich texts; I mistook it for the author and body of the comment. Thanks for clarifying this. In fact we are parsing the author as expected. Thanks for noting the bug regarding the potentially null text element that would crash the parser. OK to concatenate the comments as a first step and to pull out the code that applies the comments styles as html styles. It does not really match our usecase and I think it would make our life harder if we were trying to generate xlsx files from the generated model. But we don't know better than that for now at the moment ;-) We will add a few more commits to that branch for your kind review then.

SheetJSDev commented

2014-01-18 03:48:41 +00:00

(Migrated from github.com)

@kinwahlai @hmalphettes Keep in mind that the goal here is to find a solution that makes "sense". There is no standard solution here, which is why discussion is good :)

If you use the Excel UI "Insert Comment" option, it inserts the author name followed by a colon (in bold). However, since that is actual content in the comment, you can change it or remove it (and that's how you generate the empty comment in the sample sheet).

I think it would make our life harder if we were trying to generate xlsx files from the generated model.

Google docs does some wild stuff with the comments. A friend and I tried to make individual comments as well as comment "conversation", and apparently gdocs is not writing author information using the author tag (authors are identified using a dash followed by the name):

    <comment ref="A1" authorId="0">
      <text>
        <t xml:space="preserve">Foo bar

baz qux
    -Sheet JS
This is a conversation
    -Dev SheetJS</t>
      </text>
    </comment>

complete comments.xml from exported XLSX:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<comments xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <authors>
    <author/>
  </authors>
  <commentList>
    <comment ref="A2" authorId="0">
      <text>
        <t xml:space="preserve">5 6 7 8
    -Sheet JS</t>
      </text>
    </comment>
    <comment ref="A2" authorId="0">
      <text>
        <t xml:space="preserve">1 2 3 4
    -Dev SheetJS</t>
      </text>
    </comment>
    <comment ref="A1" authorId="0">
      <text>
        <t xml:space="preserve">Foo bar

baz qux
    -Sheet JS
This is a conversation
    -Dev SheetJS</t>
      </text>
    </comment>
  </commentList>
</comments>

@kinwahlai @hmalphettes Keep in mind that the goal here is to find a solution that makes "sense". There is no standard solution here, which is why discussion is good :) If you use the Excel UI "Insert Comment" option, it inserts the author name followed by a colon (in bold). However, since that is actual content in the comment, you can change it or remove it (and that's how you generate the empty comment in the sample sheet). > I think it would make our life harder if we were trying to generate xlsx files from the generated model. Google docs does some wild stuff with the comments. A friend and I tried to make individual comments as well as comment "conversation", and apparently gdocs is not writing author information using the author tag (authors are identified using a dash followed by the name): ``` <comment ref="A1" authorId="0"> <text> <t xml:space="preserve">Foo bar baz qux -Sheet JS This is a conversation -Dev SheetJS</t> </text> </comment> ``` [![screenshot](http://i.imgur.com/wtSXNX4.png)](http://i.imgur.com/wtSXNX4.png) complete comments.xml from exported XLSX: ``` <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <comments xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"> <authors> <author/> </authors> <commentList> <comment ref="A2" authorId="0"> <text> <t xml:space="preserve">5 6 7 8 -Sheet JS</t> </text> </comment> <comment ref="A2" authorId="0"> <text> <t xml:space="preserve">1 2 3 4 -Dev SheetJS</t> </text> </comment> <comment ref="A1" authorId="0"> <text> <t xml:space="preserve">Foo bar baz qux -Sheet JS This is a conversation -Dev SheetJS</t> </text> </comment> </commentList> </comments> ```

hmalphettes commented

2014-01-18 08:02:34 +00:00

(Migrated from github.com)

@SheetJSDev thanks for trying all these examples.
And thanks for reminding us of the bigger cause.

In our narrow usecase what we wanted to access were google's 'notes'
Once saved as an xlsx, they show up like this:

<?xml version="1.0" encoding="UTF-8"?>
<comments xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <authors>
    <author />
  </authors>
  <commentList>
    <comment ref="A1" authorId="0">
      <text>
        <t xml:space="preserve">{stc_datasource:'c3b00170-642b-11e3-949a-0800200c9a66',stc_object:'stc_application'}</t>
      </text>
    </comment>
  </commentList>
</comments>

The simplest case of them all so far.

@SheetJSDev thanks for trying all these examples. And thanks for reminding us of the bigger cause. In our narrow usecase what we wanted to access were google's 'notes' Once saved as an xlsx, they show up like this: ``` <?xml version="1.0" encoding="UTF-8"?> <comments xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"> <authors> <author /> </authors> <commentList> <comment ref="A1" authorId="0"> <text> <t xml:space="preserve">{stc_datasource:'c3b00170-642b-11e3-949a-0800200c9a66',stc_object:'stc_application'}</t> </text> </comment> </commentList> </comments> ``` The simplest case of them all so far.

SheetJSDev commented

2014-01-18 10:00:31 +00:00

(Migrated from github.com)

@hmalphettes @kinwahlai Playing with gdocs a bit more:

google docs comments and notes are plaintext. The UI is a text field and the generated xml only uses t tags
each comment has only one text tag. I can't seem to generate a case where gdocs generates a comment with multiple text tag children.
Google treats conversations and comments differently (notes and comments are separated with ---- when a note is present, but do not appear in a cell with only comments):

<!-- with note -->
<t xml:space="preserve">note with comment test

----
5 6 7 8
    -Sheet JS
----
1 2 3 4
    -Dev SheetJS</t>

<!-- without note -->
<t xml:space="preserve">Foo bar

baz qux
    -Sheet JS
This is a conversation
    -Dev SheetJS</t>

As a result, you probably will need to do post-processing in your app, and I think the concatenation comment still applies for the individual t tags within a given text tag.

@hmalphettes @kinwahlai Playing with gdocs a bit more: 1) google docs comments and notes are plaintext. The UI is a text field and the generated xml only uses `t` tags 2) each comment has only one text tag. I can't seem to generate a case where gdocs generates a comment with multiple text tag children. 3) Google treats conversations and comments differently (notes and comments are separated with `----` when a note is present, but do not appear in a cell with only comments): ```  <t xml:space="preserve">note with comment test ---- 5 6 7 8 -Sheet JS ---- 1 2 3 4 -Dev SheetJS</t>  <t xml:space="preserve">Foo bar baz qux -Sheet JS This is a conversation -Dev SheetJS</t> ``` 4) As a result, you probably will need to do post-processing in your app, and I think the [concatenation comment](https://github.com/SheetJS/js-xlsx/pull/43#issuecomment-32477653) still applies for the individual t tags within a given text tag.

hmalphettes commented

2014-01-18 13:49:48 +00:00

(Migrated from github.com)

OK @SheetJSDev thanks again for looking into more spreadsheets. I am convinced.

I have pulled up the code that parses the rich text for sst and reused it to parse the comments.
It does the concatenation and passes the raw values along.
That would work fine for our use case.

Here is what the comment of the test xlsx looks like now:

{ a: 'Yegor Kozlov',
  t: 'Yegor Kozlov:\r\nfirst cell',
  raw: '<r><rPr><b/><sz val="8"/><color indexed="81"/><rFont val="Tahoma"/></rPr><t>Yegor Kozlov:</t></r><r><rPr><sz val="8"/><color indexed="81"/><rFont val="Tahoma"/></rPr><t xml:space="preserve">\r\nfirst cell</t></r>',
  r: '<span style="font-weight: bold;">Yegor Kozlov:</span><span style=""><br/>first cell</span>' }

Let us know!

OK @SheetJSDev thanks again for looking into more spreadsheets. I am convinced. I have pulled up the code that parses the rich text for sst and reused it to parse the comments. It does the concatenation and passes the raw values along. That would work fine for our use case. Here is what the comment of the test xlsx looks like now: ``` { a: 'Yegor Kozlov', t: 'Yegor Kozlov:\r\nfirst cell', raw: '<r><rPr><sz val="8"/><color indexed="81"/><rFont val="Tahoma"/></rPr><t>Yegor Kozlov:</t></r><r><rPr><sz val="8"/><color indexed="81"/><rFont val="Tahoma"/></rPr><t xml:space="preserve">\r\nfirst cell</t></r>', r: 'Yegor Kozlov: first cell' } ``` Let us know!

SheetJSDev commented

2014-01-22 04:09:08 +00:00

(Migrated from github.com)

@hmalphettes @kinwahlai Thanks for the patches! Now for the difficult part (figuring out how to do the same thing for https://github.com/SheetJS/js-xls)