Description
When converting a source string and a target string into a list of edits, using the character 'K' in the target string can cause a formatting error.
This results in incorrect edit strings for word-edits-append and subword-edits-append.
Steps to Reproduce
from tokenizer import Tokenizer
from alignment.aligner import word_level_alignment, char_level_alignment
from create_edits import create_edits
tokenizer = Tokenizer("bert-base-uncased")
src_sent = "test"
tgt_sent = "K test"
word_level_align = word_level_alignment(src_sent=src_sent, tgt_sent=tgt_sent)
char_level_align = char_level_alignment(word_level_align)
edits = create_edits(char_level_align, word_level_align, tokenizer)
print(edits['word-edits-append'])
print(edits['subword-edits-append'])
The output for the append edits incorrectly duplicates the 'K' inside the brackets:
[{
"subword": "test",
"raw_subword": "test",
"edit": "A_[KKKK]KKKK"
}]
[{
"subword": "test",
"raw_subword": "test",
"edit": "A_[KKKK]KKKK"
}]
Both of the outputs have a A_[KKKK]KKKK edit where we would expect a A_[K]KKKK edit.
Proposed Solution
It looks like this is coming from the insert_to_append() function in edits/utils.py. I will open a PR with a fix for this shortly!
Description
When converting a source string and a target string into a list of edits, using the character 'K' in the target string can cause a formatting error.
This results in incorrect edit strings for
word-edits-appendandsubword-edits-append.Steps to Reproduce
The output for the append edits incorrectly duplicates the 'K' inside the brackets:
[{ "subword": "test", "raw_subword": "test", "edit": "A_[KKKK]KKKK" }] [{ "subword": "test", "raw_subword": "test", "edit": "A_[KKKK]KKKK" }]Both of the outputs have a
A_[KKKK]KKKKedit where we would expect aA_[K]KKKKedit.Proposed Solution
It looks like this is coming from the insert_to_append() function in edits/utils.py. I will open a PR with a fix for this shortly!