Use stable ID for contributor, affiliation, and source blank nodes#274
Use stable ID for contributor, affiliation, and source blank nodes#274
Conversation
The generated IDs will look like: - `_:person-316c589472c97217` - `_:org-218d661eb0940e1a` - `_:web-5349a9b296f3a7553e6a9132e1df2712` Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
code/vocab_management.py
Outdated
| return authors | ||
|
|
||
|
|
||
| def hash_id(input_text: str, output_prefix: str, output_hash_len: int) -> str: |
There was a problem hiding this comment.
We should prefer UUIDs over hashes for ensuring the ids translate well (even if we are dealing with BNodes). This will also help with using the BNodes later in graphs in case of has/id conflicts. Continuing this thought, if we're going to the trouble of assigned ids to BNodes - we might as well make them fully quantified ids in the graph and remove the BNode. E.g. contrib-1234.
|
This is fantastic, thanks @bact ! My thoughts on implementing these were to get rid of BNodes, use UUID3 or UUID5 to generate iris for contributors and org -- so that these are consistent across graphs, wont have conflicts, and wont be dependant on our code (so we don't need to maintain stuff later). I'll merge your code, then try the above and see which one is better to work with. |
The IDs will look like: - _:person-DanielDo-316c5894 - _:org-Unabhäng-41451ff1 - _:web-GDPRArt4-1g-http-2fd893c49a38fd2a Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
|
The stable UUID3 and UUID5 are good too, maybe better in terms of standards. I haven't aware of those options before. Noted that I have updated the code slightly; the most recent commit (48f2ae0) is adding a short slug for human readability, to make human inspection a little bit easier:
If this is not preferred, we can go back to the 1d9df97 commit. (Also feel free to just the hash/slug length for uniqueness.) |
closes #274 Squashed commit of the following: commit ded8a6162245f17e423e547c13205f1c0d503b55 Author: Harshvardhan Pandit <me@harshp.com> Date: Mon Apr 7 06:47:55 2025 +0100 replaces BNode ID generation slugify and md5 hash - person has bnode with prefix person-<slugified name> - org has bnode with prefix org-<slugified name> - references are bnodes with md5 hash of url (labels can change) - introduces additional dependency `python-slugify` commit 05b0ade3ad3a0795f8f0ea38a46dc00e5546326f Author: Arthit Suriyawongkul <arthit@gmail.com> Date: Fri Mar 28 13:48:58 2025 +0000 Add slug for human readability The IDs will look like: - _:person-DanielDo-316c5894 - _:org-Unabhäng-41451ff1 - _:web-GDPRArt4-1g-http-2fd893c49a38fd2a Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com> commit a3769811ebf614218c85811b0bef9da5512647e9 Author: Arthit Suriyawongkul <arthit@gmail.com> Date: Thu Mar 27 23:00:52 2025 +0000 Use stable ID for contributor, affiliation, and source The generated IDs will look like: - `_:person-316c589472c97217` - `_:org-218d661eb0940e1a` - `_:web-5349a9b296f3a7553e6a9132e1df2712` Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com> Co-authored-by: Arthit Suriyawongkul <arthit@gmail.com>
|
@bact I implemented in 40a9aa5 what I think is a simpler (to maintain) version of your code. It produces triples as below, which are stable across regeneration and extensions. If this is fine, I'll close the PR, and add the RDF/HTML outputs. |
|
@coolharsh55 that looks neat. please close this PR. thank you |
- #273 use DPV_VERSION in links - #274 stable BNode ids for contributors, orgs, sources - #271 typos in references - #265 -- same as #274 - changes affect all RDF and HTML - this commit should be the last commit where random BNodes are used for person etc. and the subsequent commits should be cleaner Co-authored-by: Arthit Suriyawongkul <arthit@gmail.com>
Pull Request
To address issue discussed in #266, use contributor name, affiliation name, source label, source URL, and DPV version number to generate a stable ID within a DPV version. The generated IDs for blank nodes will look like:
_:person-316c589472c97217_:org-218d661eb0940e1a_:web-5349a9b296f3a7553e6a9132e1df2712(The prefix
_:is automatically added by RDFLib as a convention for blank nodes)These IDs are only intended to be used locally.
Will fix #266
Update: We can also have something like this, for human readability:
_:person-DanielDo-316c5894_:org-Unabhäng-41451ff1_:web-GDPRArt4-1g-http-2fd893c49a38fd2a