Skip to content

Cython fast path cuts MDF4 open time by up to 4×#222

Merged
ratal merged 4 commits intomasterfrom
dev
Apr 11, 2026
Merged

Cython fast path cuts MDF4 open time by up to 4×#222
ratal merged 4 commits intomasterfrom
dev

Conversation

@ratal
Copy link
Copy Markdown
Owner

@ratal ratal commented Apr 11, 2026

mdfreader 4.3 — Release Notes


Performance

Up to 4× faster metadata parsing for large MDF4 files.
A new Cython SymBufReader replaces Python's file-object dispatch with a
bidirectional 64 KB buffered reader, cutting syscall overhead on files with
many data groups and channel groups.

Up to 3× faster CN/CC/SI/TX chain reading (Cython fast path).
The hot loop that walks channel-name, conversion, and source-information linked
lists is now implemented in Cython using POSIX pread(), C packed structs, and
a zero-copy <TX> bytes scan instead of lxml.objectify. On files with ~36 000
channels the total open time drops from ~1.9 s to ~0.6 s. The Python fallback
path is kept automatically when the Cython extension is not available.


Bug Fixes

  • Scipy optionalimport mdfreader no longer fails when scipy is not
    installed; the import is now lazy (only resample() requires it).

Documentation

  • Full Sphinx documentation overhauled with quick-start examples, architecture
    tables, and a new Performance page documenting the Cython optimisations
    and how to verify the fast path is active.
  • mdfinfo4.py and dataRead.pyx now carry comprehensive docstrings covering
    the on-disk block layout, fast-path design constraints, and the SI-cache
    strategy.

Packaging

  • PyPI metadata fixed: license_files declared in setup.cfg, pyproject.toml
    reduced to build-system table only, resolving License-File header rejection
    by twine/PyPI.

ratal and others added 4 commits April 8, 2026 00:01
- read_all_channels_sorted_record: replace chunk loop with single
  readinto() into a pre-allocated recarray (zero-copy, writeable) — the
  main win: T3 1.7 GB file drops from 1.5s to 0.4s (4x faster)
- DZ transpose: arr.T.tobytes() instead of .copy() + tobytes() (one
  fewer intermediate allocation)
- Vectorize sign-extension in _apply_unsorted_bit_masking using
  np.where(negative, bitwise_or(temp_u, sign_extend), temp_u)
- SI block cache in Info4._si_cache (keyed by file pointer) to skip
  duplicate Source Information block reads
- Add SymBufReader cdef class to dataRead.pyx: bidirectional-buffered
  file wrapper that fills its C-level buffer centred on the current
  position, matching the mdfr Rust SymBufReader design; activated for
  all Info4 metadata reads via _SymBufReader import
- Bump version to 4.3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nel files

Adds read_cn_chain_fast() to dataRead.pyx: a Cython function that reads the
entire MDF4 CN linked list using POSIX pread() (zero Python file-object
dispatch), C packed structs + memcpy for zero-copy parsing, and a fast
<TX>...</TX> bytes scan replacing lxml.objectify for the common MD block
pattern. Falls back to full Python CCBlock for complex cc_type 3/7-11.

Benchmarks (3-run best):
  test.mf4  (36k channels): 0.90s → 0.61s (3.1x total from 1.9s baseline)
  T3        (480 channels):  0.40s → 0.33s (4.5x total from 1.5s baseline)

mdfinfo4.py: import read_cn_chain_fast; modify read_cn_blocks() to use fast
Cython path for files with fileno() (raw open() or SymBufReader), falling
back to the Python path otherwise. Post-processing handles composition blocks
(CA/CN/DS/CL/CV/CU), VLSD/VLSC detection, and CC completion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
README.md:
  - Add Performance section documenting read_cn_chain_fast, SymBufReader and
    vectorised data reading with benchmark table (1.9s → 0.6s on 36k channels)
  - Expand Requirements into structured table; clarify Cython fallback
  - Rewrite Installation with build-from-source steps
  - Convert channel-structure list to a table; document masterChannelList
  - Memory-saving options expanded into descriptive bullets

mdfinfo4.py:
  - Module docstring rewritten: explains fast vs. fallback path, key classes,
    design constraints (why CC val/ref and composition stay in Python)
  - Info4 class docstring: full Attributes section including _si_cache,
    complete dictionary layout table with all top-level keys and their meaning

dataRead.pyx:
  - Fast-reader section header expanded with technique summary and design
    constraints
  - All six C packed structs documented with field-level comments including
    value enumerations, bit flags, and byte-offset rationale
  - _fast_read_tx_or_md, _fast_read_si docstrings expanded

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
conf.py:
  - Bump version to 4.3, copyright year to 2025
  - Switch theme to sphinx_rtd_theme
  - Add sphinx.ext.viewcode, autosummary; configure autodoc_default_options
  - Add intersphinx mappings (Python, NumPy)
  - Remove missing _static path warning

docs/index.rst:
  - Add quick-start code example and pip/source installation snippet
  - Architecture table mapping each module to its responsibility
  - Channel dict structure table
  - Integrated 'performance' page into toctree

docs/performance.rst (new):
  - Benchmark table: 1.9s → 0.6s on 36k-channel file
  - Detailed explanation of all three optimisations:
    pread()+C-structs CN chain reader, SymBufReader, single-call readinto()
  - How to verify the fast path is active

Per-module index.rst files:
  - mdfreader: Mdf vs MdfInfo purpose, typical usage snippets
  - mdf: channel dict layout table, field constant reference
  - mdfinfo4: reading-path comparison table, Info4 dict structure example
  - mdf4reader: key classes, data block type table, conversion type table
  - mdfinfo3: MDF3 block key reference
  - mdf3reader: MDF3 vs MDF4 differences
  - channel: method reference table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

@ratal ratal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good

@ratal ratal merged commit 79895c5 into master Apr 11, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant