Replace pyparsing with lark for better error messages#124
Open
pkienzle wants to merge 17 commits into
Open
Conversation
added 14 commits
March 5, 2026 15:45
Collaborator
Author
|
Most of the error messages are an improvement over the old parser, though some of them are still confusing. Here's a mix of accepted and rejected formulas along with the associated error messages. Note that lines marked with ## didn't parse in the old parser. $ python -m periodictable.lark_parse
!!! DNA:CAGT # incorrect case for FASTA type not properly identified
Expected one of @DENSITY[ni] COUNT SYMBOL in
DNA:CAGT
^
!!! dna CAGT # missing colon in FASTA
Expected :SEQ in
dna CAGT
^
!!! O² # SUPERCHARGE should be the only valid token here
Expected SYMBOL in
O²
^
!!! ₃H2O # badly placed subscript
Expected one of NUMBER SYMBOL aa:SEQ in
₃H2O
^
!!! // 3g Ca # // is not a comment
Expected one of NUMBER SYMBOL aa:SEQ in
// 3g Ca
^
!!! 3g Ca@ // 5g Si # missing density value
Expected @DENSITY[ni] in
3g Ca@ // 5g Si
^
!!! Ca@i # missing density value ##
Expected @DENSITY[ni] in
Ca@i
^
!!! Ca ⁺⁺ # extra space before valence
Expected one of @DENSITY[ni] NUMBER SYMBOL in
Ca ⁺⁺
^
!!! Ca++ # missing braces in valence: the + is acting as SEPARATOR
Expected one of NUMBER SYMBOL in
Ca++
^
!!! Ca2+ # missing braces in valence: the 2 is acting as COUNT and the + as SEPARATOR
Expected one of NUMBER SYMBOL in
Ca2+
^
!!! Ca{2} # missing charge in valence
Expected CHARGE[+-] in
Ca{2}
^
!!! 37 vol% H2O@1 / 5% D2O@1 # missing /
Expected one of // @DENSITY[ni] in
37 vol% H2O@1 / 5% D2O@1
^
!!! 37 vol% H2O@1 /// 5% D2O@1 # extra /
Expected one of NUMBER SYMBOL aa:SEQ in
37 vol% H2O@1 /// 5% D2O@1
^
!!! H2O@1h # bad density mode
Expected @DENSITY[ni] in
H2O@1h
^
!!! 37 vol% NaCl@2.16 // H2O@1 // D2O@1 # percent missing in middle part
Expected end of formula in
37 vol% NaCl@2.16 // H2O@1 // D2O@1
^
!!! 37 vol% H2O@1 // 5% D2O@1 # percent not allowed in last part
Expected one of // @DENSITY[ni] in
37 vol% H2O@1 // 5% D2O@1
^
!!! 37 vol% H2O@1 // 5 vol% D2O@1 # only % in subsequent parts
Expected % in
37 vol% H2O@1 // 5 vol% D2O@1
^
!!! 37% H2O@1 // D2O@1 # missing vol% or wt%
Expected one of SYMBOL UNIT[mL] UNIT[mg] UNIT[mm] vol% wt% in
37% H2O@1 // D2O@1
^
!!! 37 val% H2O@1 // D2O@1 # bad spelling of vol%
Expected one of UNIT[mL] UNIT[mg] UNIT[mm] vol% wt% in
37 val% H2O@1 // D2O@1
^
!!! Fe[56O2 # bad isotope syntax
Expected ] in
Fe[56O2
^
!!! Co[181] # bad isotope
'181 is not an isotope of Co'
!!! Ca{2+O2 # bad valence syntax
Expected } in
Ca{2+O2
^
!!! Co{17-} # bad valence
valence 17- is not valid for Co
!!! 3..5 mg NaCl
Expected one of SYMBOL UNIT[mL] UNIT[mg] UNIT[mm] vol% wt% in
3..5 mg NaCl
^
!!! 3.5 fm Si # bad units at the start; could be wt%/vol% or LENGTH, VOLUME, MASS
Expected one of UNIT[mL] UNIT[mg] UNIT[mm] vol% wt% in
3.5 fm Si
^
!!! 3.5 mm Si // 2.5 nm SiO2 //
Expected NUMBER in
3.5 mm Si // 2.5 nm SiO2 //
^
!!! 3.5 mm Si // 2.5 nm SiO2 // 35 mm cG
Expected :SEQ in
3.5 mm Si // 2.5 nm SiO2 // 35 mm cG
^
!!! ((Co) # mismatched LPAR
Expected one of @DENSITY[ni] NUMBER SYMBOL in
((Co)
^
!!! Co) # mismatched RPAR
Expected one of @DENSITY[ni] COUNT SYMBOL in
Co)
^
!!! bad:CAGT # bad sequence type
Invalid fasta sequence type 'bad:'
*** Co
=> Co @ 8.90
*** dna:CAGT
=> C₃₉H₃₇¹H₁₀N₁₅O₂₅P₄ @ 1.69
*** (Co@5) ##
=> Co @ 5.00
*** (((Co@5)@6)) ##
=> Co @ 6.00
*** CaCO3
=> CaCO₃
*** CaCO₃
=> CaCO₃
*** CaCO3+6H2O
=> CaCO₃(H₂O)₆
*** CaCO3 6H2O
=> CaCO₃(H₂O)₆
*** CaCO3(H2O)6
=> CaCO₃(H₂O)₆
*** CaCO3 (H2O)6
=> CaCO₃(H₂O)₆
*** (Ca(CO3)((H2O)6))
=> CaCO₃(H₂O)₆
*** CaCO₃·6H₂O ##
=> CaCO₃(H₂O)₆
*** DHO
=> DHO
!!! Ca{2++} # bad valence string
Use 2+ instead of 2++ for valence
*** Ca⁺⁺ # also Ca{2+} ##
=> Ca²⁺ @ 1.55
*** O²⁻ ##
=> O²⁻ @ 1.14
*** H[1]
=> ¹H @ 0.07
*** ²H⁺ # D{+} ##
=> D⁺ @ 0.14
*** O²H⁻ # OD{-} ##
=> OD⁻
*** O²⁻H⁺ # O{2-}H{+} ##
=> O²⁻H⁺
*** O²⁻²H⁺ # O{2-}D{+} ##
=> O²⁻D⁺
*** H2O@1
=> H₂O @ 1.00
*** D2O@1n
=> D₂O @ 1.11
*** D2O @ 1.11 ##
=> D₂O @ 1.11
*** D2O@1.11i
=> D₂O @ 1.11
*** HO{1-}
=> HO⁻
*** H[1]{1-}O
=> ¹H⁻O
*** H2SO4
=> H₂SO₄
*** C3H4H[1]NO@1.29n
=> C₃H₄¹HNO @ 1.29
*** 78.2H2O[16] + 21.8H2O[18] @1n # density applies to composite
=> (H₂¹⁶O)₇₈.₂(H₂¹8O)₂₁.₈ @ 1.02
*** dna:CAGT @1n # fasta density override
=> C₃₉H₃₇¹H₁₀N₁₅O₂₅P₄ @ 1.00
*** 50 wt% Co // Ti
=> CoTi₁.₂₃₁₁₈₆₂₈₇₀₀₃₅₇₂₆ @ 6.01
*** 33 wt% Co // 33% Fe // Ti
=> CoFe₁.₀₅₅₂₉₉₃₈₂₂₁₈₆₄₁Ti₁.₂₆₈₄₉₄₉₆₂₃₆₇₃₁₇₂ @ 6.50
!!! 93 wt% Co // 33% Fe // Ti # More than 100 wt%
Total weight 126% is more than 100% in wt% mixture
!!! 93 vol% Co // 33% Fe // Ti # More than 100 vol%
Total volume 126% is more than 100% in vol% mixture
*** 20 vol% (10 wt% NaCl@2.16 // H2O@1) // D2O@1n
=> NaCl(H₂O)₂₉.₁₉₅₅₅₅₀₁₀₈₂₄₃₁(D₂O)₁₂₂.₇₈₉₅₃₅₈₈₉₁₄₅₈₅ @ 1.10
*** NaCl(H2O)29.1966(D2O)122.794@1.10i
=> NaCl(H₂O)₂₉.₁₉₆₆(D₂O)₁₂₂.₇₉₄ @ 1.10
*** 5g NaCl // 50mL H2O@1
=> NaCl(H₂O)₃₂.₄₃₉₅₀₅₅₆₇₅₈₂₅₇
*** 5g NaCl@2.16 // 50mL H2O@1
=> NaCl(H₂O)₃₂.₄₃₉₅₀₅₅₆₇₅₈₂₅₇ @ 1.05
!!! 5g NaCl // 50mL H2O # Need density for H2O to convert volume to mass
Need the mass density of H2O
*** (10 wt% NaCl // H2O)@1.07n # set density of a mixture
=> NaCl(H₂O)₂₉.₁₉₅₅₅₅₀₁₀₈₂₄₃₁ @ 1.07
*** 50 mL (45 mL H2O@1 // 5 g NaCl)@1.0707 // 20 mL D2O@1n
=> (H₂O)₂₉.₁₉₅₅₅₅₀₁₀₈₂₄₃₁NaCl(D₂O)₁₂.₁₁₈₉₈₉₆₅₈₁₉₈₄₀₂ @ 1.08
*** 1 cm Si // 5 nm Cr // 10 nm Au
=> Si₁₁₉₉₉₂₂.₉₇₃₇₄₆₂₄₉₅CrAu₁.₄₁₇₂₁₈₀₀₉₃₁₂₁₃₆ @ 2.33
*** aa:RELEELNVPGEIVESLSSSEESITRINKKIEKFQSEEQQQTEDELQDKIHPFAQTQSLVYPFPGPIPNSLPQNIPPLTQTPVVVPPFLQPEVMGVSKVKEAMAPKHKEMPFPKYPVEPFTESQSLTLTDVENLHLPLPLLQSWMHQPHQPLPPTVMFPPQSVLSLSQSKVLPVPQKAVPYPQRDMPIQAFLLYQEPVLGPVRGPFPIIV
=> C₁₀₈₀H₁₃₇₀¹H₃₁₉N₂₆₈O₃₁₀S₆ @ 1.27
!!! Bl2Oh # Bad symbol
Element Bl doesn't exist
!!! 5 Mg NaCl // 50mL H2O@1 # Bad units
Expected one of UNIT[mL] UNIT[mg] UNIT[mm] vol% wt% in
5 Mg NaCl // 50mL H2O@1
^
!!! 4 nm NaCl@2.17// 50 g Si # Can't use mass in layer mixture
Expected UNIT[mm] in
4 nm NaCl@2.17// 50 g Si
^ |
Collaborator
Author
|
Here are the error messages from pyparsing for comparison: !!! DNA:CAGT # incorrect case for FASTA type not properly identified
unknown element A
!!! dna CAGT # missing colon in FASTA
Expected end of text, found 'dna' (at char 1), (line:1, col:2)
!!! O² # SUPERCHARGE should be the only valid token here
Expected end of text, found '²' (at char 2), (line:1, col:3)
!!! ₃H2O # badly placed subscript
Expected end of text, found '₃' (at char 1), (line:1, col:2)
!!! // 3g Ca # // is not a comment
Expected end of text, found '/' (at char 1), (line:1, col:2)
!!! 3g Ca@ // 5g Si # missing density value
Expected end of text, found '3g' (at char 1), (line:1, col:2)
!!! Ca@i # missing density value ##
!!! pyparsing fails
!!! Ca ⁺⁺ # extra space before valence
Expected end of text, found '⁺' (at char 4), (line:1, col:5)
!!! Ca++ # missing braces in valence: the + is acting as SEPARATOR
Expected end of text, found '+' (at char 3), (line:1, col:4)
!!! Ca2+ # missing braces in valence: the 2 is acting as COUNT and the + as SEPARATOR
Expected end of text, found '+' (at char 4), (line:1, col:5)
!!! Ca{2} # missing charge in valence
Expected end of text, found '{' (at char 3), (line:1, col:4)
!!! 37 vol% H2O@1 / 5% D2O@1 # missing /
Expected end of text, found '37' (at char 1), (line:1, col:2)
!!! 37 vol% H2O@1 /// 5% D2O@1 # extra /
Expected end of text, found '37' (at char 1), (line:1, col:2)
!!! H2O@1h # bad density mode
Expected end of text, found 'h' (at char 6), (line:1, col:7)
!!! 37 vol% NaCl@2.16 // H2O@1 // D2O@1 # percent missing in middle part
Expected end of text, found '37' (at char 1), (line:1, col:2)
!!! 37 vol% H2O@1 // 5% D2O@1 # percent not allowed in last part
Expected end of text, found '37' (at char 1), (line:1, col:2)
!!! 37 vol% H2O@1 // 5 vol% D2O@1 # only % in subsequent parts
Expected end of text, found '37' (at char 1), (line:1, col:2)
!!! 37% H2O@1 // D2O@1 # missing vol% or wt%
Expected end of text, found '37' (at char 1), (line:1, col:2)
!!! 37 val% H2O@1 // D2O@1 # bad spelling of vol%
Expected end of text, found '37' (at char 1), (line:1, col:2)
!!! Fe[56O2 # bad isotope syntax
Expected end of text, found '[' (at char 3), (line:1, col:4)
!!! Co[181] # bad isotope
'181 is not an isotope of Co'
!!! Ca{2+O2 # bad valence syntax
Expected end of text, found '{' (at char 3), (line:1, col:4)
!!! Co{17-} # bad valence
valence 17- is not valid for Co
!!! 3..5 mg NaCl
Expected end of text, found '3' (at char 1), (line:1, col:2)
!!! 3.5 fm Si # bad units at the start; could be wt%/vol% or LENGTH, VOLUME, MASS
Expected end of text, found '3' (at char 1), (line:1, col:2)
!!! 3.5 mm Si // 2.5 nm SiO2 //
Expected end of text, found '3' (at char 1), (line:1, col:2)
!!! 3.5 mm Si // 2.5 nm SiO2 // 35 mm cG
Expected end of text, found '3' (at char 1), (line:1, col:2)
!!! ((Co) # mismatched LPAR
Expected end of text, found '(' (at char 1), (line:1, col:2)
!!! Co) # mismatched RPAR
Expected end of text, found ')' (at char 3), (line:1, col:4)
!!! bad:CAGT # bad sequence type
Expected end of text, found 'bad' (at char 1), (line:1, col:2)
*** Co
=> Co @ 8.90
*** dna:CAGT
=> C₃₉H₃₇¹H₁₀N₁₅O₂₅P₄ @ 1.69
*** (Co@5) ##
!!! pyparsing fails
*** (((Co@5)@6)) ##
!!! pyparsing fails
*** CaCO3
=> CaCO₃
*** CaCO₃
=> CaCO₃
*** CaCO3+6H2O
=> CaCO₃(H₂O)₆
*** CaCO3 6H2O
=> CaCO₃(H₂O)₆
*** CaCO3(H2O)6
=> CaCO₃(H₂O)₆
*** CaCO3 (H2O)6
=> CaCO₃(H₂O)₆
*** (Ca(CO3)((H2O)6))
=> CaCO₃(H₂O)₆
*** CaCO₃·6H₂O ##
!!! pyparsing fails
*** DHO
=> DHO
!!! Ca{2++} # bad valence string
Expected end of text, found '{' (at char 2), (line:1, col:3)
*** Ca⁺⁺ # also Ca{2+} ##
!!! pyparsing fails
*** O²⁻ ##
!!! pyparsing fails
*** H[1]
=> ¹H @ 0.07
*** ²H⁺ # D{+} ##
!!! pyparsing fails
*** O²H⁻ # OD{-} ##
!!! pyparsing fails
*** O²⁻H⁺ # O{2-}H{+} ##
!!! pyparsing fails
*** O²⁻²H⁺ # O{2-}D{+} ##
!!! pyparsing fails
*** H2O@1
=> H₂O @ 1.00
*** D2O@1n
=> D₂O @ 1.11
*** D2O @ 1.11 ##
!!! pyparsing fails
*** D2O@1.11i
=> D₂O @ 1.11
*** HO{1-}
=> HO⁻
*** H[1]{1-}O
=> ¹H⁻O
*** H2SO4
=> H₂SO₄
*** C3H4H[1]NO@1.29n
=> C₃H₄¹HNO @ 1.29
*** 78.2H2O[16] + 21.8H2O[18] @1n # density applies to composite
=> (H₂¹⁶O)₇₈.₂(H₂¹8O)₂₁.₈ @ 1.02
*** dna:CAGT @1n # fasta density override
=> C₃₉H₃₇¹H₁₀N₁₅O₂₅P₄ @ 1.00
*** 50 wt% Co // Ti
=> CoTi₁.₂₃₁₁₈₆₂₈₇₀₀₃₅₇₂₆ @ 6.01
*** 33 wt% Co // 33% Fe // Ti
=> CoFe₁.₀₅₅₂₉₉₃₈₂₂₁₈₆₄₁Ti₁.₂₆₈₄₉₄₉₆₂₃₆₇₃₁₇₂ @ 6.50
!!! 93 wt% Co // 33% Fe // Ti # More than 100 wt%
Expected end of text, found '93' (at char 1), (line:1, col:2)
!!! 93 vol% Co // 33% Fe // Ti # More than 100 vol%
Expected end of text, found '93' (at char 1), (line:1, col:2)
*** 20 vol% (10 wt% NaCl@2.16 // H2O@1) // D2O@1n
=> NaCl(H₂O)₂₉.₁₉₅₅₅₅₀₁₀₈₂₄₃₁(D₂O)₁₂₂.₇₈₉₅₃₅₈₈₉₁₄₅₈₅ @ 1.10
*** NaCl(H2O)29.1966(D2O)122.794@1.10i
=> NaCl(H₂O)₂₉.₁₉₆₆(D₂O)₁₂₂.₇₉₄ @ 1.10
*** 5g NaCl // 50mL H2O@1
=> NaCl(H₂O)₃₂.₄₃₉₅₀₅₅₆₇₅₈₂₅₇
*** 5g NaCl@2.16 // 50mL H2O@1
=> NaCl(H₂O)₃₂.₄₃₉₅₀₅₅₆₇₅₈₂₅₇ @ 1.05
!!! 5g NaCl // 50mL H2O # Need density for H2O to convert volume to mass
Expected end of text, found '5g' (at char 1), (line:1, col:2)
*** (10 wt% NaCl // H2O)@1.07n # set density of a mixture
=> NaCl(H₂O)₂₉.₁₉₅₅₅₅₀₁₀₈₂₄₃₁ @ 1.07
*** 50 mL (45 mL H2O@1 // 5 g NaCl)@1.0707 // 20 mL D2O@1n
=> (H₂O)₂₉.₁₉₅₅₅₅₀₁₀₈₂₄₃₁NaCl(D₂O)₁₂.₁₁₈₉₈₉₆₅₈₁₉₈₄₀₂ @ 1.08
*** 1 cm Si // 5 nm Cr // 10 nm Au
=> Si₁₁₉₉₉₂₂.₉₇₃₇₄₆₂₄₉₅CrAu₁.₄₁₇₂₁₈₀₀₉₃₁₂₁₃₆ @ 2.33
*** aa:RELEELNVPGEIVESLSSSEESITRINKKIEKFQSEEQQQTEDELQDKIHPFAQTQSLVYPFPGPIPNSLPQNIPPLTQTPVVVPPFLQPEVMGVSKVKEAMAPKHKEMPFPKYPVEPFTESQSLTLTDVENLHLPLPLLQSWMHQPHQPLPPTVMFPPQSVLSLSQSKVLPVPQKAVPYPQRDMPIQAFLLYQEPVLGPVRGPFPIIV
=> C₁₀₈₀H₁₃₇₀¹H₃₁₉N₂₆₈O₃₁₀S₆ @ 1.27
!!! Bl2Oh # Bad symbol
unknown element Bl
!!! 5 Mg NaCl // 50mL H2O@1 # Bad units
Expected end of text, found '5' (at char 1), (line:1, col:2)
!!! 4 nm NaCl@2.17// 50 g Si # Can't use mass in layer mixture
Expected end of text, found '4' (at char 1), (line:1, col:2) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a standalone program for exploring lark as a replacement for pyparsing.
It does a better job of error messages (#34), and it is more robust than the existing parser.
parse_formulashould be a drop-in replacement forperiodictable.formulas.parse_formulabut it hasn't yet een fully tested.