Skip to content

Use stable ID for contributor, affiliation, and source blank nodes#274

Closed
bact wants to merge 2 commits intow3c:devfrom
bact:gen-stable-local-id
Closed

Use stable ID for contributor, affiliation, and source blank nodes#274
bact wants to merge 2 commits intow3c:devfrom
bact:gen-stable-local-id

Conversation

@bact
Copy link
Collaborator

@bact bact commented Mar 27, 2025

Pull Request

To address issue discussed in #266, use contributor name, affiliation name, source label, source URL, and DPV version number to generate a stable ID within a DPV version. The generated IDs for blank nodes will look like:

  • _:person-316c589472c97217
  • _:org-218d661eb0940e1a
  • _:web-5349a9b296f3a7553e6a9132e1df2712

(The prefix _: is automatically added by RDFLib as a convention for blank nodes)

These IDs are only intended to be used locally.

Will fix #266

Update: We can also have something like this, for human readability:

  • _:person-DanielDo-316c5894
  • _:org-Unabhäng-41451ff1
  • _:web-GDPRArt4-1g-http-2fd893c49a38fd2a

The generated IDs will look like:

- `_:person-316c589472c97217`
- `_:org-218d661eb0940e1a`
- `_:web-5349a9b296f3a7553e6a9132e1df2712`

Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
@bact bact added the code label Mar 28, 2025
return authors


def hash_id(input_text: str, output_prefix: str, output_hash_len: int) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should prefer UUIDs over hashes for ensuring the ids translate well (even if we are dealing with BNodes). This will also help with using the BNodes later in graphs in case of has/id conflicts. Continuing this thought, if we're going to the trouble of assigned ids to BNodes - we might as well make them fully quantified ids in the graph and remove the BNode. E.g. contrib-1234.

@coolharsh55
Copy link
Collaborator

This is fantastic, thanks @bact ! My thoughts on implementing these were to get rid of BNodes, use UUID3 or UUID5 to generate iris for contributors and org -- so that these are consistent across graphs, wont have conflicts, and wont be dependant on our code (so we don't need to maintain stuff later). I'll merge your code, then try the above and see which one is better to work with.

The IDs will look like:

- _:person-DanielDo-316c5894
- _:org-Unabhäng-41451ff1
- _:web-GDPRArt4-1g-http-2fd893c49a38fd2a

Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
@bact
Copy link
Collaborator Author

bact commented Apr 1, 2025

The stable UUID3 and UUID5 are good too, maybe better in terms of standards. I haven't aware of those options before.

Noted that I have updated the code slightly; the most recent commit (48f2ae0) is adding a short slug for human readability, to make human inspection a little bit easier:

  • _:person-316c5894xxxxxxxx --> _:person-DanielDo-316c5894
  • _:org-41451ff1xxxxxxxx --> _:org-Unabhäng-41451ff1
  • _:web-2fd893c49a38fd2axxxxxxxxxxxxxxxx --> _:web-GDPRArt4-1g-http-2fd893c49a38fd2a

If this is not preferred, we can go back to the 1d9df97 commit. (Also feel free to just the hash/slug length for uniqueness.)

coolharsh55 added a commit that referenced this pull request Apr 7, 2025
closes #274

Squashed commit of the following:

commit ded8a6162245f17e423e547c13205f1c0d503b55
Author: Harshvardhan Pandit <me@harshp.com>
Date:   Mon Apr 7 06:47:55 2025 +0100

    replaces BNode ID generation slugify and md5 hash

    - person has bnode with prefix person-<slugified name>
    - org has bnode with prefix org-<slugified name>
    - references are bnodes with md5 hash of url (labels can change)
    - introduces additional dependency `python-slugify`

commit 05b0ade3ad3a0795f8f0ea38a46dc00e5546326f
Author: Arthit Suriyawongkul <arthit@gmail.com>
Date:   Fri Mar 28 13:48:58 2025 +0000

    Add slug for human readability

    The IDs will look like:

    - _:person-DanielDo-316c5894
    - _:org-Unabhäng-41451ff1
    - _:web-GDPRArt4-1g-http-2fd893c49a38fd2a

    Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>

commit a3769811ebf614218c85811b0bef9da5512647e9
Author: Arthit Suriyawongkul <arthit@gmail.com>
Date:   Thu Mar 27 23:00:52 2025 +0000

    Use stable ID for contributor, affiliation, and source

    The generated IDs will look like:

    - `_:person-316c589472c97217`
    - `_:org-218d661eb0940e1a`
    - `_:web-5349a9b296f3a7553e6a9132e1df2712`

    Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>

Co-authored-by: Arthit Suriyawongkul <arthit@gmail.com>
@coolharsh55
Copy link
Collaborator

@bact I implemented in 40a9aa5 what I think is a simpler (to maintain) version of your code. It produces triples as below, which are stable across regeneration and extensions. If this is fine, I'll close the PR, and add the RDF/HTML outputs.

<term> contributor _:person-harshvardhan-j-pandit ;
_:person-harshvardhan-j-pandit a dct:Agent,
        foaf:Person ;
    scoro:hasORCID "0000-0002-5068-3714" ;
    org:memberOf _:org-adapt-centre-dublin-city-university ;
    foaf:homepage "https://harshp.com/" ;
    foaf:name "Harshvardhan J. Pandit" .
_:org-adapt-centre-trinity-college-dublin a foaf:Organization ;
    foaf:name "ADAPT Centre, Trinity College Dublin" .

@bact
Copy link
Collaborator Author

bact commented Apr 7, 2025

@coolharsh55 that looks neat. please close this PR. thank you

@coolharsh55 coolharsh55 closed this Apr 7, 2025
@bact bact deleted the gen-stable-local-id branch April 7, 2025 13:25
coolharsh55 added a commit that referenced this pull request Apr 7, 2025
- #273 use DPV_VERSION in links
- #274 stable BNode ids for contributors, orgs, sources
- #271 typos in references
- #265 -- same as #274
- changes affect all RDF and HTML
- this commit should be the last commit where random BNodes are used for
  person etc. and the subsequent commits should be cleaner

Co-authored-by: Arthit Suriyawongkul <arthit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants