Skip to content

Punctiation regexp is incomplete #108

@rlidwka

Description

@rlidwka

I came across a discrepancy between cmark and commonmark.js output:

$ echo '**。**话' | ./cmark/build/src/cmark 
<p>****</p>
$ echo '**。**话' | ./commonmark.js/bin/commonmark 
<p><strong></strong></p>

So, according to spec v26,

A punctuation character is an ASCII punctuation character or anything in the Unicode classes Pc, Pd, Pe, Pf, Pi, Po, or Ps.

Character "。" or U+3002 belongs to a class Punctuation, Other [Po] (see http://www.fileformat.info/info/unicode/char/3002/index.htm), but it's not included here:

https://github.com/jgm/commonmark.js/blob/3587c91c62128e54a236648ff1ac4a1ad1cd5ad8/lib/inlines.js#L41

For the reference, here's the regexp from unicode-8.0.0 package (we're using that in markdown-it), which includes this character (and appears to be a lot larger):

https://github.com/mathiasbynens/unicode-8.0.0/blob/master/General_Category/Punctuation/regex.js

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions