This repository accompanies the paper The Inverse Lyndon Array: Definition, Properties, and Linear-Time Construction. It contains a C++17 implementation of the linear-time construction of the inverse Lyndon array, together with the standard Lyndon-array baseline, correctness checks, dataset download utilities, and reproducible benchmark scripts.
The main object studied in the paper is the inverse Lyndon array λ⁻¹, where λ⁻¹[i] is the length of the longest inverse Lyndon factor starting at position i. In the inverse setting, the key recovery formula is:
λ⁻¹[i] = next₋₁[i] - i + lce(i, next₋₁[i])
where next₋₁[i] is the next greater suffix position and the LCE term coincides with the border correction proved in the paper.
The current codebase focuses on three practical tasks.
- Construction of the standard Lyndon array
λwith the LCE-NSS approach. - Construction of the inverse Lyndon array
λ⁻¹with the LCE-NGS approach. - Experimental comparison between the two constructions on random strings, structured synthetic families, and real corpora.
At the moment, the repository is centered on construction, verification, and benchmarking. The paper also discusses ICFL recovery from λ⁻¹, but that recovery procedure is not exposed as a separate implementation in the current code snapshot.
.
├── download_datasets.py # Downloads the benchmark corpora
├── lyndon_benchmark.cpp # C++17 implementation and benchmark driver
└── run_pipeline.sh # End-to-end reproducible pipeline
The program computes the standard Lyndon array λ using the nearest smaller suffix framework with LCE acceleration.
The program computes the inverse Lyndon array λ⁻¹ using the nearest greater suffix framework. In the implementation, the array is recovered from the NGS structure and the associated LCE values with:
lambda_inv[i] = ngs[i] - i + nlce[i];This is the algorithmic counterpart of the characterization proved in the paper.
The verify mode compares both computed arrays, λ and λ⁻¹, against brute-force reference implementations on random inputs.
The repository is intentionally lightweight. You only need:
- A C++17 compiler, such as
g++. - Python 3, for dataset download.
- A Unix-like shell environment to run the full pipeline script.
The provided pipeline compiles with aggressive optimization flags:
g++ -std=c++17 -O3 -march=native -DNDEBUG lyndon_benchmark.cpp -o lyndon_benchmarkTo run the full experimental pipeline:
bash run_pipeline.shThe script performs the following steps:
- downloads the datasets into
datasets/, - compiles
lyndon_benchmark.cpp, - runs correctness verification,
- benchmarks random inputs,
- benchmarks structured synthetic inputs,
- benchmarks real corpora,
- runs dedicated border-heavy and profiling experiments.
Results are written under:
results/<timestamp>/
A complete run produces CSV and log files such as:
random.csv
structured.csv
real.csv
border.csv
border_profile.csv
random_profile.csv
structured_profile.csv
real_profile.csv
The plain CSV files report mean running times for LCE-NSS, LCE-NGS, and the corresponding recovery steps. The profile CSV files also expose internal counters such as character comparisons, reuse hits, explicit extension calls, suffix-link traversals, and other profiling statistics used to study the practical linear-time behavior.
You can also compile and run the driver manually.
g++ -std=c++17 -O3 -march=native -DNDEBUG lyndon_benchmark.cpp -o lyndon_benchmarkThen use one of the supported modes.
./lyndon_benchmark verify 5000 25./lyndon_benchmark bench 5./lyndon_benchmark bench_profile 1./lyndon_benchmark bench_struct 5./lyndon_benchmark bench_border 3
./lyndon_benchmark bench_border_profile 3./lyndon_benchmark bench_files 5 5000000 datasets/english.txt datasets/dna.txt./lyndon_benchmark random 100000 5
./lyndon_benchmark file datasets/english.txt 5000000 5The benchmark driver includes the following input families.
- Random strings over alphabets of size
2,4, and26. - Repetitive constant strings.
- Repetitive periodic strings.
- Long-border instances with configurable border percentage.
- Quasi-monotone strings with controlled noise.
- Real corpora loaded from text files.
These families were chosen to compare the standard and inverse constructions on both generic and border-sensitive inputs.
The downloader fetches two datasets from Pizza and Chili and three files from the Canterbury corpus mirror.
english.txt
dna.txt
bible.txt
e.coli
world192.txt
If a direct raw download from the mirror fails, the Python script falls back to downloading the Canterbury archive as a zip file and extracts the required files automatically.
The shell pipeline can be customized through environment variables.
VERIFY_N=5000
VERIFY_ITERS=25
BENCH_RUNS=5
PROFILE_RUNS=1
BORDER_RUNS=3
MAXLEN=5000000
DO_PROFILE=1Example:
BENCH_RUNS=10 PROFILE_RUNS=3 MAXLEN=1000000 bash run_pipeline.shThe repository is designed to support the experimental side of the paper.
- It implements the standard
LCE-NSSconstruction for the Lyndon array. - It implements the inverse
LCE-NGSconstruction for the inverse Lyndon array. - It validates both arrays against brute force on random inputs.
- It compares timing and profiling behavior on random, structured, and real datasets.
This makes the code suitable both for reproducing the benchmark tables and for inspecting the algorithmic behavior behind the theoretical linear-time result.
This repository currently exposes the benchmark and verification implementation directly in a single C++ source file.