-
Notifications
You must be signed in to change notification settings - Fork 33
Documentation Generation
The development and maintenance of DPV takes place primarily through shared spreadsheets (see code/vocab_csv folder). The terms and their annotated metadata are declared in the spreadsheet, and used as input to generate the RDF files and HTML documentation through the code tooling. The code tool is a collection of Python scripts to assist in the automation of downloading the spreadsheet as CSV files, generating RDF files, validating them for correctness, and producing the HTML documentation.
The documentation generator is responsible for producing the HTML and RDF-based outputs. It downloads a bunch of spreadsheets containing the data for DPV and other vocabularies (such as DPV-GDPR), converts it to RDF serialisations, and generates HTML documentation using the W3C ReSpec template.
Therefore, whenever adding a new term or changing existing ones, the following steps are recommended to update the DPV vocabulary and documentation:
There are 3 scripts to execute for each of the three tasks.
If you have updated concepts or want to regenerate the spreadsheets from which all RDF and HTML is produced - use ./100_download_CSV.py (by default it will download and extract all spreadsheets). You can use --ds <name> to only download and extract specific spreadsheets. See the Downloading CSV data section below for more information on this.
If you want to generate the RDF files - ./200_serialise_RDF.py which will will create RDF serialisations for all DPV modules and extensions.
If you want to generate the HTML files - ./300_generate_HTML.py will generate HTML documentation for all DPV modules and extensions.
To also generate the HTML for guides, use ./300_generate_HTML.py --guides.
To generate the zip files for publishing DPV releases on GitHub, use ./900_generate_releases.sh, which will produce zip files in releases folder.
To change metadata and config for the above processes, see vocab_management.py
- Internet connectivity - for downloading the spreadsheets from Google Sheets hosting the DPV terms and metadata with
100script. - Python 3.12+ (preferably as close to the latest version as possible) - for executing scripts
- Python modules to be installed using
pip- seerequirements.txt
The below are optional additional requirements for publishing DPV in alternate serialisations:
- Ontology Converter v2.0 from https://github.com/sszuev/ont-converter (grab the latest release from https://github.com/sszuev/ont-converter/releases) - required to convert RDF to Manchester Syntax with
201script.
The below are optional additional requirements for validations using SHACL:
- Java runtime 18+ (preferably as close to the latest version as possible) - for executing SHACL validations (this is optional)
- TopBraid SHACL validation binary from https://github.com/TopQuadrant/shacl (grab the latest release from https://repo1.maven.org/maven2/org/topbraid/shacl/)
./100_download_CSV.py will download the CSV data from a Google Sheets document and store it in the vocab_csv path specified. The outcome will be a CSV file for each sheet. To only download and generate the CSVs for specific modules/extensions, use --ds <name> where name is the key in DPV_FILES present in the script. E.g. to download spreadsheets containing purposes, use --ds purpose_processing. Running the script without any parameters will download and extract all spreadsheets.
This uses the Google Sheets export link to download the sheet data in CSV form. Needs specifying the document ID in DPV_DOCUMENT_ID variable and listing the sheet name(s) in DPV_SHEETS. The default save path for CSV is vocab_csv. This results in downloading the Google Sheets as an Excel spreadsheet and then locally exporting each tab as a CSV file.
This uses rdflib to generate the RDF data from CSV. Namespaces are manually represented in the top of the document and are automatically handled in text as URI references. Serialisations to be produced are registered in RDF_SERIALIZATIONS variable - see vocab_management.py file for config variables regarding import, export, namespaces, metadata, etc.
The way the RDF generation works is as follows:
- In
vocab_management.py, the CSV file path is associated with a schema in theCSVFILESvariable. For example, classes, properties, and taxonomy are schemas commonly used for most CSVs. - The schema is a reference to a dict in
vocab_schemas.pywhere each CSV column is mapped to a function invocab_funcs.py. - The generator in
300script takes each row, and calls the appropriate function by passing the specific cell and the entire row as values (along with others stuff like currently used namespace). - The function returns a list of triples which are added to the current graph being generated.
- In addition to the above, the generator
300script also deals with modules and extensions by using the metadata/config variables invocab_management.py. - The generator
300script outputs the RDF files using RDFS+SKOS serialisation, then converts them to OWL using SPARQL CONSTRUCT queries. In addition, it can also add in custom OWL-only triples at this stage.
The ./300_generate_HTML.py script is used to produce the HTML documentation for all DPV modules and extensions. This uses jinja2 to render the HTML file from a template. The RDF data is loaded for each vocabulary and its modules for all RDF filepaths as defined in vocab_management.py. The data is stored in-memory as a giant dictionary so that lookups can be performed across extensions (e.g. to get the label of a parent concept from another vocabulary). See vocab_management.py file for export paths and the configuration of each template assigned to a vocabulary or module.
This script can also produces the guides and the mappings HTML files.
By default, it will not do this automatically.
To produce the guides, use the --guides flag.
To produce the mappings, use the --mappings flag.
In between steps 2 (generate RDF) and 3 (generate HTML), there can be a series of tests done to ensure the RDF is generated correctly. For this, some basic SHACL constraints are defined in shacl_shapes.
The folder shacl_shapes holds the constraints in shapes.ttl to verify the vocabulary terms contain some basic annotations. The script verify.py executes the SHACL validator (currently hardcoded to use the TopBraid SHACL binary as shaclvalidate.sh), retrieves the results, runs a SPARQL query on them to get the failing nodes and messages.
The script uses DATA_PATHS to declare what data files should be validated. Currently, it will only validate Turtle (.ttl) files for simplicity as all files are duplicate serialisations of each other. The variable SHAPES declares the list of shape files to use. For each folder in DATA_PATHS, the script will execute the SHACL binary to check constraints defined in each of the SHAPES files.
To execute the tests, and to use the TopBraid SHACL binary, download the latest release from maven, extract it somewhere and note the path of the folder. Export SHACLROOT in the shell the script is going to run in (or e.g. save it in the bash profile) to the path of the folder. To be more precise, $SHACLROOT/shaclvalidate.sh should result in the binary being executed.
The output of the script lists the data and shapes files being used in the validation process, the number of errors found, and a list of data nodes and the corresponding failure message.
The spell check is run for HTML documents by using aspell
or a similar tool, with the dictionary containing words used in documents
provided as ./dictionary-aspell-en.pws. If using aspell, the command is:
aspell -d en_GB --extra-dicts=./dictionary-aspell-en.pws -c <file>If not using aspell, the file above is a simple text file with one word
per limit - which can be added or linked to whatever spell check is being
used.
For spell check in RDF / source, currently this is best done in the source
tool e.g. Google Sheets has a spell check option which should be used to
check for valid English terms. Running aspell on a CSV can be cumbersome
and difficult to complete as there are a large number of files to process.
Another reason to prefer the source tool is that if the CSV are modified,
the changes will still need to be synced back to the Google Sheets.
- Fixing an error in the vocabulary terms e.g., term label, property, annotation --> Make the changes in the Google Sheets, and run the
100script to download CSV, then200to produce RDF, then300to produce HTML. - Fixing an error in serialisation e.g.,
rdf:Propertyis defined asrdfs:Propety--> Make the changes in the200script for generating RDF, and300script to generate HTML. - Changing content in HTML documentation e.g., change motivation paragraph --> Make the changes in the relevant
templateand300script to generate HTML.
Notes from Harsh: I have switched to using uv as the Python tooling of choice to replace pip, venv, and a bunch of other stuff. This is not necessary to use, but is recommended (it's fast!).
uv run <file> <params>
uv lock --upgrade
uv pip freeze > requirements.txtThe following checklist lists actions which must be completed upon finalisation and publication of a release
- run
200script and ensure all RDF and CSV outputs are consistent (there are no fluctuating changes) - run
290script for SHACL shape validations which are incode/shacl_shapesand ensure output isVALIDATION PASSED. If otherwise, the report is generated invalidation.htmlto assist in fixing the issues - run
300script and ensure all HTML outputs are consistent (there are no fluctuating changes) - change
SERIALIZATION_SETinvocab_management.pytoFINALand generate additional RDF outputs - for each vocabulary, test with OOPS! using the RDF/HTML (
.rdf) -- see note at end - for each vocabulary, test with [FOOPS!)(https://foops.linkeddata.es/) -- see note at end
- Update
README.mdin repo root with info on latest version - Update
404.htmlwith links to latest version - Update version numbers
- Version mentions in
<version>/README.md - Version mentions in
README.mdin root -
version,date-released, andtitles values inCITATION.cffin root -
DPV_VERSIONand related constants incode/vocab_management.py -
VERSIONincode/900_generate_releases.sh
- Version mentions in
- Update "Changelog" section in each extension
- Update
changelog.htmlin repo root - Update versioned IRIs/links
- In Jinja templates
- The link for "Search Index" in the DPV Specifications paragraph in
code/jinja2_resources/macro_dpv_document_family.jinja2
- The link for "Search Index" in the DPV Specifications paragraph in
- In rewrite rules in
w3id_config/.htaccess
- In Jinja templates
- Update #145 with announcement
- Post announcement on DPVCG mailing list
- false positives:
-
P34: Untyped classandP35 Untyped property: Untyped class/property means a concept was used as if it was a Class/Property, but it hasn't been declared as such. This is because OOPS! just assumes that the input is a hardcore OWL ontology and thus requires ALL concepts used to be declared within the input graph/file -- even if they are external references. This is the default behaviour when a file is exported from OWL tools like Jena and Protege. Since this is not how the DPV is modelled and provided, this issue is a false-positive and is ignored in quality assurance. This issue will occur in extensions where the parent concept is not part of the extension, e.g. concept in TECH references DPV concept which is not defined in the TECH extension.
-