Skip to content

Replace pyparsing with lark for better error messages#124

Open
pkienzle wants to merge 17 commits into
masterfrom
use-lark-parse
Open

Replace pyparsing with lark for better error messages#124
pkienzle wants to merge 17 commits into
masterfrom
use-lark-parse

Conversation

@pkienzle
Copy link
Copy Markdown
Collaborator

@pkienzle pkienzle commented Mar 5, 2026

This is a standalone program for exploring lark as a replacement for pyparsing.

It does a better job of error messages (#34), and it is more robust than the existing parser.

parse_formula should be a drop-in replacement for periodictable.formulas.parse_formula but it hasn't yet een fully tested.

@pkienzle
Copy link
Copy Markdown
Collaborator Author

pkienzle commented May 22, 2026

Most of the error messages are an improvement over the old parser, though some of them are still confusing.

Here's a mix of accepted and rejected formulas along with the associated error messages. Note that lines marked with ## didn't parse in the old parser.

$ python -m periodictable.lark_parse

!!!  DNA:CAGT  # incorrect case for FASTA type not properly identified
Expected one of @DENSITY[ni] COUNT SYMBOL in
 DNA:CAGT  
    ^

!!!  dna CAGT  # missing colon in FASTA
Expected :SEQ in
 dna CAGT  
    ^

!!!# SUPERCHARGE should be the only valid token here
Expected SYMBOL in
 O²  
   ^

!!!  ₃H2O  # badly placed subscript
Expected one of NUMBER SYMBOL aa:SEQ in
 ₃H2O  
 ^

!!!  // 3g Ca  # // is not a comment
Expected one of NUMBER SYMBOL aa:SEQ in
 // 3g Ca  
 ^

!!!  3g Ca@ // 5g Si # missing density value
Expected @DENSITY[ni] in
 3g Ca@ // 5g Si 
        ^

!!!  Ca@i  # missing density value  ##
Expected @DENSITY[ni] in
 Ca@i  
    ^

!!!  Ca ⁺⁺  # extra space before valence
Expected one of @DENSITY[ni] NUMBER SYMBOL in
 Ca ⁺⁺  
    ^

!!!  Ca++  # missing braces in valence: the + is acting as SEPARATOR
Expected one of NUMBER SYMBOL in
 Ca++  
    ^

!!!  Ca2+  # missing braces in valence: the 2 is acting as COUNT and the + as SEPARATOR
Expected one of NUMBER SYMBOL in
 Ca2+  
      ^

!!!  Ca{2}  # missing charge in valence
Expected CHARGE[+-] in
 Ca{2}  
     ^

!!!  37 vol% H2O@1 / 5% D2O@1  # missing /
Expected one of // @DENSITY[ni] in
 37 vol% H2O@1 / 5% D2O@1  
              ^

!!!  37 vol% H2O@1 /// 5% D2O@1  # extra /
Expected one of NUMBER SYMBOL aa:SEQ in
 37 vol% H2O@1 /// 5% D2O@1  
                 ^

!!!  H2O@1h  # bad density mode
Expected @DENSITY[ni] in
 H2O@1h  
      ^

!!!  37 vol% NaCl@2.16 // H2O@1 // D2O@1  # percent missing in middle part
Expected end of formula in
 37 vol% NaCl@2.16 // H2O@1 // D2O@1  
                            ^

!!!  37 vol% H2O@1 // 5% D2O@1  # percent not allowed in last part
Expected one of // @DENSITY[ni] in
 37 vol% H2O@1 // 5% D2O@1  
                          ^

!!!  37 vol% H2O@1 // 5 vol% D2O@1  # only % in subsequent parts
Expected % in
 37 vol% H2O@1 // 5 vol% D2O@1  
                    ^

!!!  37% H2O@1 // D2O@1  # missing vol% or wt%
Expected one of SYMBOL UNIT[mL] UNIT[mg] UNIT[mm] vol% wt% in
 37% H2O@1 // D2O@1  
   ^

!!!  37 val% H2O@1 // D2O@1  # bad spelling of vol%
Expected one of UNIT[mL] UNIT[mg] UNIT[mm] vol% wt% in
 37 val% H2O@1 // D2O@1  
    ^

!!!  Fe[56O2 # bad isotope syntax
Expected ] in
 Fe[56O2 
      ^

!!!  Co[181]  # bad isotope
'181 is not an isotope of Co'

!!!  Ca{2+O2  # bad valence syntax
Expected } in
 Ca{2+O2  
      ^

!!!  Co{17-}  # bad valence
valence 17- is not valid for Co

!!!  3..5 mg NaCl
Expected one of SYMBOL UNIT[mL] UNIT[mg] UNIT[mm] vol% wt% in
 3..5 mg NaCl
   ^

!!!  3.5 fm Si # bad units at the start; could be wt%/vol% or LENGTH, VOLUME, MASS 
Expected one of UNIT[mL] UNIT[mg] UNIT[mm] vol% wt% in
 3.5 fm Si 
     ^

!!!  3.5 mm Si // 2.5 nm SiO2 //
Expected NUMBER in
 3.5 mm Si // 2.5 nm SiO2 //
                           ^

!!!  3.5 mm Si // 2.5 nm SiO2 // 35 mm cG
Expected :SEQ in
 3.5 mm Si // 2.5 nm SiO2 // 35 mm cG
                                    ^

!!!  ((Co) # mismatched LPAR
Expected one of @DENSITY[ni] NUMBER SYMBOL in
 ((Co) 
      ^

!!!  Co)  # mismatched RPAR
Expected one of @DENSITY[ni] COUNT SYMBOL in
 Co)  
   ^

!!!  bad:CAGT  # bad sequence type
Invalid fasta sequence type 'bad:'

*** Co
 => Co @ 8.90

*** dna:CAGT
 => C₃₉H₃₇¹H₁₀N₁₅O₂₅P₄ @ 1.69

*** (Co@5) ##
 => Co @ 5.00

*** (((Co@5)@6)) ##
 => Co @ 6.00

*** CaCO3
 => CaCO₃

*** CaCO₃
 => CaCO₃

*** CaCO3+6H2O
 => CaCO₃(H₂O)₆

*** CaCO3 6H2O
 => CaCO₃(H₂O)₆

*** CaCO3(H2O)6
 => CaCO₃(H₂O)₆

*** CaCO3 (H2O)6
 => CaCO₃(H₂O)₆

*** (Ca(CO3)((H2O)6))
 => CaCO₃(H₂O)₆

*** CaCO₃·6H₂O  ##
 => CaCO₃(H₂O)₆

*** DHO
 => DHO

!!! Ca{2++}  # bad valence string
Use 2+ instead of 2++ for valence

*** Ca⁺⁺  # also Ca{2+}  ##
 => Ca²⁺ @ 1.55

*** O²⁻   ##
 => O²⁻ @ 1.14

*** H[1]
 => ¹H @ 0.07

*** ²H⁺    # D{+} ##
 => D⁺ @ 0.14

*** O²H⁻   # OD{-} ##
 => OD⁻

*** O²⁻H⁺  # O{2-}H{+} ##
 => O²⁻H⁺

*** O²⁻²H⁺ # O{2-}D{+} ##
 => O²⁻D⁺

*** H2O@1
 => H₂O @ 1.00

*** D2O@1n
 => D₂O @ 1.11

*** D2O @ 1.11  ##
 => D₂O @ 1.11

*** D2O@1.11i
 => D₂O @ 1.11

*** HO{1-}
 => HO⁻

*** H[1]{1-}O
 => ¹H⁻O

*** H2SO4
 => H₂SO₄

*** C3H4H[1]NO@1.29n
 => C₃H₄¹HNO @ 1.29

*** 78.2H2O[16] + 21.8H2O[18] @1n  # density applies to composite
 => (H₂¹⁶O)₇₈.₂(H₂¹8O)₂₁.₈ @ 1.02

*** dna:CAGT @1n  # fasta density override
 => C₃₉H₃₇¹H₁₀N₁₅O₂₅P₄ @ 1.00

*** 50 wt% Co // Ti
 => CoTi₁.₂₃₁₁₈₆₂₈₇₀₀₃₅₇₂₆ @ 6.01

*** 33 wt% Co // 33% Fe // Ti
 => CoFe₁.₀₅₅₂₉₉₃₈₂₂₁₈₆₄₁Ti₁.₂₆₈₄₉₄₉₆₂₃₆₇₃₁₇₂ @ 6.50

!!!  93 wt% Co // 33% Fe // Ti  # More than 100 wt%
Total weight 126% is more than 100% in wt% mixture

!!!  93 vol% Co // 33% Fe // Ti  # More than 100 vol%
Total volume 126% is more than 100% in vol% mixture

*** 20 vol% (10 wt% NaCl@2.16 // H2O@1) // D2O@1n
 => NaCl(H₂O)₂₉.₁₉₅₅₅₅₀₁₀₈₂₄₃₁(D₂O)₁₂₂.₇₈₉₅₃₅₈₈₉₁₄₅₈₅ @ 1.10

*** NaCl(H2O)29.1966(D2O)122.794@1.10i
 => NaCl(H₂O)₂₉.₁₉₆₆(D₂O)₁₂₂.₇₉₄ @ 1.10

*** 5g NaCl // 50mL H2O@1
 => NaCl(H₂O)₃₂.₄₃₉₅₀₅₅₆₇₅₈₂₅₇

*** 5g NaCl@2.16 // 50mL H2O@1
 => NaCl(H₂O)₃₂.₄₃₉₅₀₅₅₆₇₅₈₂₅₇ @ 1.05

!!!  5g NaCl // 50mL H2O   # Need density for H2O to convert volume to mass
Need the mass density of H2O

*** (10 wt% NaCl // H2O)@1.07n # set density of a mixture
 => NaCl(H₂O)₂₉.₁₉₅₅₅₅₀₁₀₈₂₄₃₁ @ 1.07

*** 50 mL (45 mL H2O@1 // 5 g NaCl)@1.0707 // 20 mL D2O@1n
 => (H₂O)₂₉.₁₉₅₅₅₅₀₁₀₈₂₄₃₁NaCl(D₂O)₁₂.₁₁₈₉₈₉₆₅₈₁₉₈₄₀₂ @ 1.08

*** 1 cm Si // 5 nm Cr // 10 nm Au
 => Si₁₁₉₉₉₂₂.₉₇₃₇₄₆₂₄₉₅CrAu₁.₄₁₇₂₁₈₀₀₉₃₁₂₁₃₆ @ 2.33

*** aa:RELEELNVPGEIVESLSSSEESITRINKKIEKFQSEEQQQTEDELQDKIHPFAQTQSLVYPFPGPIPNSLPQNIPPLTQTPVVVPPFLQPEVMGVSKVKEAMAPKHKEMPFPKYPVEPFTESQSLTLTDVENLHLPLPLLQSWMHQPHQPLPPTVMFPPQSVLSLSQSKVLPVPQKAVPYPQRDMPIQAFLLYQEPVLGPVRGPFPIIV
 => C₁₀₈₀H₁₃₇₀¹H₃₁₉N₂₆₈O₃₁₀S₆ @ 1.27

!!!  Bl2Oh   # Bad symbol
Element Bl doesn't exist

!!!  5 Mg NaCl // 50mL H2O@1  # Bad units
Expected one of UNIT[mL] UNIT[mg] UNIT[mm] vol% wt% in
 5 Mg NaCl // 50mL H2O@1  
   ^

!!!  4 nm NaCl@2.17// 50 g Si  # Can't use mass in layer mixture
Expected UNIT[mm] in
 4 nm NaCl@2.17// 50 g Si  
                     ^

@pkienzle pkienzle marked this pull request as ready for review May 22, 2026 16:26
@pkienzle
Copy link
Copy Markdown
Collaborator Author

Here are the error messages from pyparsing for comparison:

!!!  DNA:CAGT  # incorrect case for FASTA type not properly identified
unknown element A

!!!  dna CAGT  # missing colon in FASTA
Expected end of text, found 'dna'  (at char 1), (line:1, col:2)

!!!# SUPERCHARGE should be the only valid token here
Expected end of text, found '²'  (at char 2), (line:1, col:3)

!!!  ₃H2O  # badly placed subscript
Expected end of text, found ''  (at char 1), (line:1, col:2)

!!!  // 3g Ca  # // is not a comment
Expected end of text, found '/'  (at char 1), (line:1, col:2)

!!!  3g Ca@ // 5g Si # missing density value
Expected end of text, found '3g'  (at char 1), (line:1, col:2)

!!!  Ca@i  # missing density value  ##
!!! pyparsing fails

!!!  Ca ⁺⁺  # extra space before valence
Expected end of text, found ''  (at char 4), (line:1, col:5)

!!!  Ca++  # missing braces in valence: the + is acting as SEPARATOR
Expected end of text, found '+'  (at char 3), (line:1, col:4)

!!!  Ca2+  # missing braces in valence: the 2 is acting as COUNT and the + as SEPARATOR
Expected end of text, found '+'  (at char 4), (line:1, col:5)

!!!  Ca{2}  # missing charge in valence
Expected end of text, found '{'  (at char 3), (line:1, col:4)

!!!  37 vol% H2O@1 / 5% D2O@1  # missing /
Expected end of text, found '37'  (at char 1), (line:1, col:2)

!!!  37 vol% H2O@1 /// 5% D2O@1  # extra /
Expected end of text, found '37'  (at char 1), (line:1, col:2)

!!!  H2O@1h  # bad density mode
Expected end of text, found 'h'  (at char 6), (line:1, col:7)

!!!  37 vol% NaCl@2.16 // H2O@1 // D2O@1  # percent missing in middle part
Expected end of text, found '37'  (at char 1), (line:1, col:2)

!!!  37 vol% H2O@1 // 5% D2O@1  # percent not allowed in last part
Expected end of text, found '37'  (at char 1), (line:1, col:2)

!!!  37 vol% H2O@1 // 5 vol% D2O@1  # only % in subsequent parts
Expected end of text, found '37'  (at char 1), (line:1, col:2)

!!!  37% H2O@1 // D2O@1  # missing vol% or wt%
Expected end of text, found '37'  (at char 1), (line:1, col:2)

!!!  37 val% H2O@1 // D2O@1  # bad spelling of vol%
Expected end of text, found '37'  (at char 1), (line:1, col:2)

!!!  Fe[56O2 # bad isotope syntax
Expected end of text, found '['  (at char 3), (line:1, col:4)

!!!  Co[181]  # bad isotope
'181 is not an isotope of Co'

!!!  Ca{2+O2  # bad valence syntax
Expected end of text, found '{'  (at char 3), (line:1, col:4)

!!!  Co{17-}  # bad valence
valence 17- is not valid for Co

!!!  3..5 mg NaCl
Expected end of text, found '3'  (at char 1), (line:1, col:2)

!!!  3.5 fm Si # bad units at the start; could be wt%/vol% or LENGTH, VOLUME, MASS 
Expected end of text, found '3'  (at char 1), (line:1, col:2)

!!!  3.5 mm Si // 2.5 nm SiO2 //
Expected end of text, found '3'  (at char 1), (line:1, col:2)

!!!  3.5 mm Si // 2.5 nm SiO2 // 35 mm cG
Expected end of text, found '3'  (at char 1), (line:1, col:2)

!!!  ((Co) # mismatched LPAR
Expected end of text, found '('  (at char 1), (line:1, col:2)

!!!  Co)  # mismatched RPAR
Expected end of text, found ')'  (at char 3), (line:1, col:4)

!!!  bad:CAGT  # bad sequence type
Expected end of text, found 'bad'  (at char 1), (line:1, col:2)

*** Co
 => Co @ 8.90

*** dna:CAGT
 => C₃₉H₃₇¹H₁₀N₁₅O₂₅P₄ @ 1.69

*** (Co@5) ##
!!! pyparsing fails

*** (((Co@5)@6)) ##
!!! pyparsing fails

*** CaCO3
 => CaCO₃

*** CaCO₃
 => CaCO₃

*** CaCO3+6H2O
 => CaCO₃(H₂O)₆

*** CaCO3 6H2O
 => CaCO₃(H₂O)₆

*** CaCO3(H2O)6
 => CaCO₃(H₂O)₆

*** CaCO3 (H2O)6
 => CaCO₃(H₂O)₆

*** (Ca(CO3)((H2O)6))
 => CaCO₃(H₂O)₆

*** CaCO₃·6H₂O  ##
!!! pyparsing fails

*** DHO
 => DHO

!!! Ca{2++}  # bad valence string
Expected end of text, found '{'  (at char 2), (line:1, col:3)

*** Ca⁺⁺  # also Ca{2+}  ##
!!! pyparsing fails

*** O²⁻   ##
!!! pyparsing fails

*** H[1]
 => ¹H @ 0.07

*** ²H⁺    # D{+} ##
!!! pyparsing fails

*** O²H⁻   # OD{-} ##
!!! pyparsing fails

*** O²⁻H⁺  # O{2-}H{+} ##
!!! pyparsing fails

*** O²⁻²H⁺ # O{2-}D{+} ##
!!! pyparsing fails

*** H2O@1
 => H₂O @ 1.00

*** D2O@1n
 => D₂O @ 1.11

*** D2O @ 1.11  ##
!!! pyparsing fails

*** D2O@1.11i
 => D₂O @ 1.11

*** HO{1-}
 => HO⁻

*** H[1]{1-}O
 => ¹H⁻O

*** H2SO4
 => H₂SO₄

*** C3H4H[1]NO@1.29n
 => C₃H₄¹HNO @ 1.29

*** 78.2H2O[16] + 21.8H2O[18] @1n  # density applies to composite
 => (H₂¹⁶O)₇₈.₂(H₂¹8O)₂₁.₈ @ 1.02

*** dna:CAGT @1n  # fasta density override
 => C₃₉H₃₇¹H₁₀N₁₅O₂₅P₄ @ 1.00

*** 50 wt% Co // Ti
 => CoTi₁.₂₃₁₁₈₆₂₈₇₀₀₃₅₇₂₆ @ 6.01

*** 33 wt% Co // 33% Fe // Ti
 => CoFe₁.₀₅₅₂₉₉₃₈₂₂₁₈₆₄₁Ti₁.₂₆₈₄₉₄₉₆₂₃₆₇₃₁₇₂ @ 6.50

!!!  93 wt% Co // 33% Fe // Ti  # More than 100 wt%
Expected end of text, found '93'  (at char 1), (line:1, col:2)

!!!  93 vol% Co // 33% Fe // Ti  # More than 100 vol%
Expected end of text, found '93'  (at char 1), (line:1, col:2)

*** 20 vol% (10 wt% NaCl@2.16 // H2O@1) // D2O@1n
 => NaCl(H₂O)₂₉.₁₉₅₅₅₅₀₁₀₈₂₄₃₁(D₂O)₁₂₂.₇₈₉₅₃₅₈₈₉₁₄₅₈₅ @ 1.10

*** NaCl(H2O)29.1966(D2O)122.794@1.10i
 => NaCl(H₂O)₂₉.₁₉₆₆(D₂O)₁₂₂.₇₉₄ @ 1.10

*** 5g NaCl // 50mL H2O@1
 => NaCl(H₂O)₃₂.₄₃₉₅₀₅₅₆₇₅₈₂₅₇

*** 5g NaCl@2.16 // 50mL H2O@1
 => NaCl(H₂O)₃₂.₄₃₉₅₀₅₅₆₇₅₈₂₅₇ @ 1.05

!!!  5g NaCl // 50mL H2O   # Need density for H2O to convert volume to mass
Expected end of text, found '5g'  (at char 1), (line:1, col:2)

*** (10 wt% NaCl // H2O)@1.07n # set density of a mixture
 => NaCl(H₂O)₂₉.₁₉₅₅₅₅₀₁₀₈₂₄₃₁ @ 1.07

*** 50 mL (45 mL H2O@1 // 5 g NaCl)@1.0707 // 20 mL D2O@1n
 => (H₂O)₂₉.₁₉₅₅₅₅₀₁₀₈₂₄₃₁NaCl(D₂O)₁₂.₁₁₈₉₈₉₆₅₈₁₉₈₄₀₂ @ 1.08

*** 1 cm Si // 5 nm Cr // 10 nm Au
 => Si₁₁₉₉₉₂₂.₉₇₃₇₄₆₂₄₉₅CrAu₁.₄₁₇₂₁₈₀₀₉₃₁₂₁₃₆ @ 2.33

*** aa:RELEELNVPGEIVESLSSSEESITRINKKIEKFQSEEQQQTEDELQDKIHPFAQTQSLVYPFPGPIPNSLPQNIPPLTQTPVVVPPFLQPEVMGVSKVKEAMAPKHKEMPFPKYPVEPFTESQSLTLTDVENLHLPLPLLQSWMHQPHQPLPPTVMFPPQSVLSLSQSKVLPVPQKAVPYPQRDMPIQAFLLYQEPVLGPVRGPFPIIV
 => C₁₀₈₀H₁₃₇₀¹H₃₁₉N₂₆₈O₃₁₀S₆ @ 1.27

!!!  Bl2Oh   # Bad symbol
unknown element Bl

!!!  5 Mg NaCl // 50mL H2O@1  # Bad units
Expected end of text, found '5'  (at char 1), (line:1, col:2)

!!!  4 nm NaCl@2.17// 50 g Si  # Can't use mass in layer mixture
Expected end of text, found '4'  (at char 1), (line:1, col:2)

@pkienzle pkienzle changed the title Explore lark as replacement for the pyparsing formula parser Replace pyparsing with lark for better error messages May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant