The documentation generator is responsible for producing the HTML and RDF-based outputs. It downloads a bunch of spreadsheets containing the data for DPV and other vocabularies (such as DPV-GDPR), converts it to RDF serialisations, and generates HTML documentation using the W3C ReSpec template.
The Data Privacy Vocabulary (DPV) is available at https://w3id.org/dpv and its repository is at https://github.com/w3c/dpv.
There are 3 python scripts to execute for each of the three tasks. We use uv to manage python (see end of document).
If you have updated concepts or want to regenerate the spreadsheets from which all RDF and HTML is produced - use 100_download_CSV.py (by default it will download and extract all spreadsheets). You can use --ds <name> to only download and extract specific spreadsheets. See the Downloading CSV data section below for more information on this.
If you want to generate the RDF files - 200_serialise_RDF.py which will will create RDF serialisations for all DPV modules and extensions. You can use --vocab=<name> to generate outputs only for a specific vocabulary or extension. By default, it will generate outputs for all vocabularies.
If you want to generate the HTML files - 300_generate_HTML.py will generate HTML documentation for all DPV modules and extensions. To only generate the HTML for guides, use 300_generate_HTML.py --guides. You can use --vocab=<name> to generate outputs only for a specific vocabulary or extension. By default, it will generate outputs for all vocabularies. You can use --skip=<name> to skip loading specific vocabularies, e.g. loc, to speed up the process. The skip parameters support wildcards as suffixes, e.g. legal* will match all legal vocabularies.
To generate the zip files for publishing DPV releases on GitHub, use 900_generate_releases.sh, which will produce zip files in releases folder.
To change metadata and config for the above processes, see vocab_management.py
- Internet connectivity - for downloading the spreadsheets from Google Sheets hosting the DPV terms and metadata with
100script. - Python 3.11+ (preferably as close to the latest version as possible) - for executing scripts
- Python modules to be installed using
pip- seerequirements.txt
The below are optional additional requirements for publishing DPV in alternate serialisations:
- Ontology Converter v2.0 from https://github.com/sszuev/ont-converter (grab the latest release from https://github.com/sszuev/ont-converter/releases) - required to convert RDF to Manchester Syntax with
201script.
The below are optional additional requirements for validations using SHACL:
- Java runtime 18+ (preferably as close to the latest version as possible) - for executing SHACL validations (this is optional)
- TopBraid SHACL validation binary from https://github.com/TopQuadrant/shacl (grab the latest release from https://repo1.maven.org/maven2/org/topbraid/shacl/)
100_download_CSV.py will download the CSV data from a Google Sheets document and store it in the vocab_csv path specified. The outcome will be a CSV file for each sheet. To only download and generate the CSVs for specific modules/extensions, use --ds <name> where name is the key in DPV_FILES present in the script. E.g. to download spreadsheets containing purposes, use --ds purpose_processing. Running the script without any parameters will download and extract all spreadsheets.
This uses the Google Sheet export link to download the sheet data in CSV form. Needs specifying the document ID in DPV_DOCUMENT_ID variable and listing the sheet name(s) in DPV_SHEETS. The default save path for CSV is vocab_csv. This results in downloading the Google Sheet as a Excel spreadsheet and then locally exporting each tab as a CSV file.
This uses rdflib to generate the RDF data from CSV. Namespaces are manually represented in the top of the document and are automatically handled in text as URI references. Serialisations to be produced are registered in RDF_SERIALIZATIONS variable - see vocab_management.py file for config variables regarding import, export, namespaces, metadata, etc.
The way the RDF generation works is as follows:
- In
vocab_management.py, the CSV file path is associated with a schema in theCSVFILESvariable. For example, classes, properties, and taxonomy are schemas commonly used for most CSVs. - The schema is a reference to a dict in
vocab_schemas.pywhere each CSV column is mapped to a function invocab_funcs.py. - The generator in
300script takes each row, and calls the appropriate function by passing the specific cell and the entire row as values (along with others stuff like currently used namespace). - The function returns a list of triples which are added to the current graph being generated.
- In addition to the above, the generator
300script also deals with modules and extensions by using the metadata/config variables invocab_management.py. - The generator
300script outputs the RDF files using RDFS+SKOS serialisation, then converts them to OWL using SPARQL CONSTRUCT queries. In addition, it can also add in custom OWL-only triples at this stage.
The 300_generate_HTML.py script is used to produce the HTML documentation for all DPV modules and extensions. This uses jinja2 to render the HTML file from a template. The RDF data is loaded for each vocabulary and its modules for all RDF filepaths as defined in vocab_management.py. The data is stored in-memory as a giant dictionary so that lookups can be performed across extensions (e.g. to get the label of a parent concept from another vocabulary). See vocab_management.py file for export paths and the configuration of each template assigned to a vocabulary or module.
This script also produces the guides HTML files. By default, it will do this automatically after producing all the RDF documentation. To only produce the guides, use the --guides flag.
In between steps 2 (generate RDF) and 3 (generate HTML), there can be a series of tests done to ensure the RDF is generated correctly. For this, some basic SHACL constraints are defined in shacl_shapes.
The folder shacl_shapes holds the constraints in shapes.ttl to verify the vocabulary terms contain some basic annotations. The script verify.py executes the SHACL validator (currently hardcoded to use the TopBraid SHACL binary as shaclvalidate.sh), retrieves the results, runs a SPARQL query on them to get the failing nodes and messages.
The script uses DATA_PATHS to declare what data files should be validated. Currently, it will only validate Turtle (.ttl) files for simplicity as all files are duplicate serialisations of each other. The variable SHAPES declares the list of shape files to use. For each folder in DATA_PATHS, the script will execute the SHACL binary to check constraints defined in each of the SHAPES files.
To execute the tests, and to use the TopBraid SHACL binary, download the latest release from maven, extract it somewhere and note the path of the folder. Export SHACLROOT in the shell the script is going to run in (or e.g. save it in the bash profile) to the path of the folder. To be more precise, $SHACLROOT/shaclvalidate.sh should result in the binary being executed.
The output of the script lists the data and shapes files being used in the validation process, the number of errors found, and a list of data nodes and the corresponding failure message.
The spell check is run for HTML documents by using aspell
or a similar tool, with the dictionary containing words used in documents
provided as ./dictionary-aspell-en.pws. If using aspell, the command is:
aspell -d en_GB --extra-dicts=./dictionary-aspell-en.pws -c <file>If not using aspell, the file above is a simple text file with one word
per limit - which can be added or linked to whatever spell check is being
used.
For spell check in RDF / source, currently this is best done in the source
tool e.g. Google Sheets has a spell check option which should be used to
check for valid English terms. Running aspell on a CSV can be cumbersome
and difficult to complete as there are a large number of files to process.
Another reason to prefer the source tool is that if the CSV are modified,
the changes will still need to be synced back to the GSheets.
- Fixing an error in the vocabulary terms i.e. term label, property, annotation --> Make the changes in the Google Sheet, and run the
100script to download CSV, then200to produce RDF, then300to produce HTML. - Fixing an error in serialisation, e.g.,
rdf:Propertyis defined asrdfs:Property--> Make the changes in the200script for generating RDF, and300script to generate HTML. - Changing content in HTML documentation, e.g., change motivation paragraph --> Make the changes in the relevant
templateand300script to generate HTML.
uv as the Python tooling of choice to replace pip, venv, and a bunch of other stuff. This is not necessary to use, but is recommended (it's fast!).
# to install requirements
uv sync
# to execute a script
uv run <file> <params>
# to update requirements
uv lock --upgrade Required minimum python version and dependencies are declared in pyproject.toml.