Skip to content

FLaTNNBio/inverselyndonarray

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Inverse Lyndon Array

This repository accompanies the paper The Inverse Lyndon Array: Definition, Properties, and Linear-Time Construction. It contains a C++17 implementation of the linear-time construction of the inverse Lyndon array, together with the standard Lyndon-array baseline, correctness checks, dataset download utilities, and reproducible benchmark scripts.

The main object studied in the paper is the inverse Lyndon array λ⁻¹, where λ⁻¹[i] is the length of the longest inverse Lyndon factor starting at position i. In the inverse setting, the key recovery formula is:

λ⁻¹[i] = next₋₁[i] - i + lce(i, next₋₁[i])

where next₋₁[i] is the next greater suffix position and the LCE term coincides with the border correction proved in the paper.

What this repository contains

The current codebase focuses on three practical tasks.

  1. Construction of the standard Lyndon array λ with the LCE-NSS approach.
  2. Construction of the inverse Lyndon array λ⁻¹ with the LCE-NGS approach.
  3. Experimental comparison between the two constructions on random strings, structured synthetic families, and real corpora.

At the moment, the repository is centered on construction, verification, and benchmarking. The paper also discusses ICFL recovery from λ⁻¹, but that recovery procedure is not exposed as a separate implementation in the current code snapshot.

Repository layout

.
├── download_datasets.py   # Downloads the benchmark corpora
├── lyndon_benchmark.cpp   # C++17 implementation and benchmark driver
└── run_pipeline.sh        # End-to-end reproducible pipeline

Algorithms implemented

1. Standard Lyndon array, LCE-NSS

The program computes the standard Lyndon array λ using the nearest smaller suffix framework with LCE acceleration.

2. Inverse Lyndon array, LCE-NGS

The program computes the inverse Lyndon array λ⁻¹ using the nearest greater suffix framework. In the implementation, the array is recovered from the NGS structure and the associated LCE values with:

lambda_inv[i] = ngs[i] - i + nlce[i];

This is the algorithmic counterpart of the characterization proved in the paper.

3. Verification by brute force

The verify mode compares both computed arrays, λ and λ⁻¹, against brute-force reference implementations on random inputs.

Build requirements

The repository is intentionally lightweight. You only need:

  1. A C++17 compiler, such as g++.
  2. Python 3, for dataset download.
  3. A Unix-like shell environment to run the full pipeline script.

The provided pipeline compiles with aggressive optimization flags:

g++ -std=c++17 -O3 -march=native -DNDEBUG lyndon_benchmark.cpp -o lyndon_benchmark

Quick start

To run the full experimental pipeline:

bash run_pipeline.sh

The script performs the following steps:

  1. downloads the datasets into datasets/,
  2. compiles lyndon_benchmark.cpp,
  3. runs correctness verification,
  4. benchmarks random inputs,
  5. benchmarks structured synthetic inputs,
  6. benchmarks real corpora,
  7. runs dedicated border-heavy and profiling experiments.

Results are written under:

results/<timestamp>/

Main output files

A complete run produces CSV and log files such as:

random.csv
structured.csv
real.csv
border.csv
border_profile.csv
random_profile.csv
structured_profile.csv
real_profile.csv

The plain CSV files report mean running times for LCE-NSS, LCE-NGS, and the corresponding recovery steps. The profile CSV files also expose internal counters such as character comparisons, reuse hits, explicit extension calls, suffix-link traversals, and other profiling statistics used to study the practical linear-time behavior.

Direct command-line usage

You can also compile and run the driver manually.

g++ -std=c++17 -O3 -march=native -DNDEBUG lyndon_benchmark.cpp -o lyndon_benchmark

Then use one of the supported modes.

Correctness verification

./lyndon_benchmark verify 5000 25

Random benchmark

./lyndon_benchmark bench 5

Profiled random benchmark

./lyndon_benchmark bench_profile 1

Structured synthetic benchmark

./lyndon_benchmark bench_struct 5

Border-focused experiment

./lyndon_benchmark bench_border 3
./lyndon_benchmark bench_border_profile 3

Real-file benchmark

./lyndon_benchmark bench_files 5 5000000 datasets/english.txt datasets/dna.txt

Single random instance or single file

./lyndon_benchmark random 100000 5
./lyndon_benchmark file datasets/english.txt 5000000 5

Benchmark families

The benchmark driver includes the following input families.

  1. Random strings over alphabets of size 2, 4, and 26.
  2. Repetitive constant strings.
  3. Repetitive periodic strings.
  4. Long-border instances with configurable border percentage.
  5. Quasi-monotone strings with controlled noise.
  6. Real corpora loaded from text files.

These families were chosen to compare the standard and inverse constructions on both generic and border-sensitive inputs.

Datasets used by the pipeline

The downloader fetches two datasets from Pizza and Chili and three files from the Canterbury corpus mirror.

english.txt
dna.txt
bible.txt
e.coli
world192.txt

If a direct raw download from the mirror fails, the Python script falls back to downloading the Canterbury archive as a zip file and extracts the required files automatically.

Environment variables for the full pipeline

The shell pipeline can be customized through environment variables.

VERIFY_N=5000
VERIFY_ITERS=25
BENCH_RUNS=5
PROFILE_RUNS=1
BORDER_RUNS=3
MAXLEN=5000000
DO_PROFILE=1

Example:

BENCH_RUNS=10 PROFILE_RUNS=3 MAXLEN=1000000 bash run_pipeline.sh

Relation to the paper

The repository is designed to support the experimental side of the paper.

  1. It implements the standard LCE-NSS construction for the Lyndon array.
  2. It implements the inverse LCE-NGS construction for the inverse Lyndon array.
  3. It validates both arrays against brute force on random inputs.
  4. It compares timing and profiling behavior on random, structured, and real datasets.

This makes the code suitable both for reproducing the benchmark tables and for inspecting the algorithmic behavior behind the theoretical linear-time result.

Citation

Notes

This repository currently exposes the benchmark and verification implementation directly in a single C++ source file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors