Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Fix BPE algorithm and control flow issues
Addresses review feedback:
- #449 (comment)
- #449 (comment)
- #449 (comment)

Three fixes:
1. Remove force unwrap: Use .map() pattern instead of bestMerge!.mergeIndex
2. Flatten nested if: Use guard let ... else { continue } in merge loop
3. Fix BPE algorithm: Merge ALL occurrences of winning pair per iteration
   (standard BPE from Sennrich et al. 2016), not just first occurrence

The original implementation only merged the first occurrence of each pair,
requiring O(k) extra iterations for k duplicate pairs. Now follows standard
BPE: find best pair, merge all occurrences, repeat.

Validation:
- All CustomVocabularyTests pass (11/11)
- Build succeeds
- Follows AGENTS.md/CLAUDE.md guidelines (no force unwrap, no nested if)
  • Loading branch information
Alex-Wengg committed Mar 28, 2026
commit 29648376a983025065b48189ef344d4da01455b1
Original file line number Diff line number Diff line change
Expand Up @@ -111,26 +111,40 @@ public final class BpeTokenizer: Sendable {
// Apply BPE merges iteratively
while true {
// Find the highest priority merge (earliest in merges list)
var bestMerge: (index: Int, mergeIndex: Int)? = nil
var bestMergeIndex: Int? = nil
var bestMergePair: (String, String)? = nil

for i in 0..<word.count - 1 {
let pair = (word[i], word[i + 1])

// Check if this pair has a merge rule
if let mergeIndex = merges.firstIndex(where: { $0.0 == pair.0 && $0.1 == pair.1 }) {
if bestMerge == nil || mergeIndex < bestMerge!.mergeIndex {
bestMerge = (i, mergeIndex)
}
guard let mergeIndex = merges.firstIndex(where: { $0.0 == pair.0 && $0.1 == pair.1 }) else {
continue
}

// Update best merge if this is higher priority (lower index)
if bestMergeIndex.map({ mergeIndex < $0 }) ?? true {
bestMergeIndex = mergeIndex
bestMergePair = pair
}
}

// No more merges possible
guard let (index, _) = bestMerge else { break }

// Apply the merge
let merged = word[index] + word[index + 1]
word[index] = merged
word.remove(at: index + 1)
guard let (first, second) = bestMergePair else { break }

// Apply the merge to ALL occurrences of the winning pair (standard BPE)
var newWord: [String] = []
var i = 0
while i < word.count {
if i < word.count - 1 && word[i] == first && word[i + 1] == second {
newWord.append(first + second)
i += 2 // Skip the next token since we merged it
} else {
newWord.append(word[i])
i += 1
}
}
word = newWord
}

// Convert tokens to IDs
Expand Down
Loading