-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Thanks for a great library! While using it for Yiddish, I noticed that some of the transliterations do not conform to the YIVO romanization standard.
To pinpoint what kind of errors uroman is making, I conducted a romanization experiment using the data from Saleva (2020) and another library for Yiddish romanization called yiddish.
Here are some benchmark numbers using accuracy and mean F1 score as defined in Proceedings of the Seventh Named Entities Workshop:
| library | mean_f1 | accuracy |
|---|---|---|
| uroman | 0.937 | 0.458 |
| yiddish | 0.990 | 0.936 |
The diffs for what uroman gets wrong can be found here. Many seem to be i/y mismatches as well as Hebrew expansion errors:
-aforizm
+aforyzm
-aparatshik
-apteyk
-apteyker
-apikoyres
+aparatshyk
+aptyyk
+aptyyker
+apykurs
-apetit
+apetyt
Would it be possible to implement the -l yid flag such that the output conforms to the YIVO romanization standard?
As far as I'm aware, it's by far the most used romanization format for Yiddish.
Thanks!