efficiently convert websites into human and LLM friendly formats with the web accessibility tree.
This is a demo of converting a somewhat involved website into an human/LLM-friendly markdown file by feeding a dump of the accessibility tree to an LLM. This allows you to retain layout and semantic context in the output file.
For reference, this is prompt used:
# Instructions
- Render the following accessibility tree of a website in markdown.
- Take liberties with layout and structuring.
- Make sure images are linked to, but do not render them with the "!" qualifier.
## Chrome Accessibility Tree
{XML representation of AX tree}This is a demo of converting a wikipedia article into a human/LLM-friendly markdown file through the use of a variety of heuristics. This is much faster than feeding an accessibility tree to an LLM but does not retain semantic or layout information.
# setup headless chrome and ublock origin
go run ./cmd/setup
# dump the accessibility tree of some website (with filtering of useless ax nodes)
go run ./cmd/dump-ax-tree "https://somewebsite.com/..." > serialized.xml
# run the distiller
cd cmd/distill-test
go build && ./distill-test
# check the results
cat out_en.wikipedia.org.mdcd backend/build
cmake ..
make