Skip to content

[Bug]: "One to many", "many to one" or "one to one" for dataset relationship to datasource - zero length temp dag file #204

@distora-w3

Description

@distora-w3

Description

Problem 1: Temporary zero-length dag file in the output directory

drwxrwxr-x 2 fc fc 4096 Aug 14 09:40 .
drwxrwxr-x 6 fc fc 4096 Aug 14 08:29 ..
-rw-rw-r-- 1 fc fc    0 Aug 14 08:59 9d55c1f3-b8cd-4155-9371-2fad030d059c.car
-rw-rw-r-- 1 fc fc  114 Aug 14 09:40 baga6....5qi4tsazlqbq.car

Problem 2: What is the agreed relationship between data source and dataset?
Which of these is correct?
 - data source can have many datasets?
 - data source can only have one dataset?
 - dataset can have many data sources?


Context:
I was able to make a mistake by adding a new file to the incorrect input folder and created a zero-length dag file.

This might not be a problem, but a consequence of not understanding the data source to dataset relationship. In any case, the zero-length temporary files should not exist after the process has terminated.

The example in the "steps to reproduce" section below, shows a dataset with two data sources. 

Related issue [1]:
If a dataset can have multiple data sources, then I expect the preparation to occur on the dataset, which means the dataset worker would scan all folders for the dataset AND data sources. (From the test, it appears we only scan only the data source referenced items. (new item in data source 1 is missed and requires a manual scan) 

Related issue [2]
CLI tool needed to print out dataset and datasource relationship
eg 
$ singularity dataset schema
Dataset    Datasource
01-test    
-------------[name 1]
-------------[name 2]

Steps to Reproduce

Incorrect relationship datasource and dataset can produce empty dag file (see folder 01-test-out-a and 9d55c1f3-b8cd-4155-9371-2fad030d059c.car)

01-test-out:
total 16
drwxrwxr-x 2 fc fc 4096 Aug 14 08:59 .
drwxrwxr-x 6 fc fc 4096 Aug 14 08:29 ..
-rw-rw-r-- 1 fc fc  334 Aug 14 08:59 baga6ea4seaqiwnjf2ydmouvqhovhnismpsokdttmv3tiaipdkcafhl4qyndjscy.car
-rw-rw-r-- 1 fc fc  371 Aug 14 08:59 baga6ea4seaqnaqlh4koxgqetwjnzztxnmpdn3msyslcq3l7mfwkullew6gnkenq.car

01-test-out-a:
total 12
drwxrwxr-x 2 fc fc 4096 Aug 14 09:40 .
drwxrwxr-x 6 fc fc 4096 Aug 14 08:29 ..
-rw-rw-r-- 1 fc fc    0 Aug 14 08:59 9d55c1f3-b8cd-4155-9371-2fad030d059c.car
-rw-rw-r-- 1 fc fc  114 Aug 14 09:40 baga6ea4seaqdqljklw2naa5her4j2bt6otjg7zj3ijlwkrbpzs35qi4tsazlqbq.car

Observation: Dataset has two --output dir (see filepath column)

 singularity dataset list-pieces 01-test
2023-08-14T10:03:17.647Z        INFO    database        database/connstring_cgo.go:26   Opening sqlite database (cgo version)
ID  CreatedAt             PieceCID                                                          PieceSize    RootCID                                                      FileSize  FilePath                                                                                                         DatasetID  SourceID  ChunkID
1   2023-08-14 08:59:40Z  baga6ea4seaqiwnjf2ydmouvqhovhnismpsokdttmv3tiaipdkcafhl4qyndjscy  34359738368  bafkreig6kienkbdckf6dmt7shmtq3yocnhjwbl2ki3g3wyzogyq5oprftq  334       /mnt/blockstorage/testround2/01-test-out/baga6ea4seaqiwnjf2ydmouvqhovhnismpsokdttmv3tiaipdkcafhl4qyndjscy.car    1          1         1
2   2023-08-14 08:59:40Z  baga6ea4seaqnaqlh4koxgqetwjnzztxnmpdn3msyslcq3l7mfwkullew6gnkenq  34359738368  bafkreidqt6tqn52ubergv3gjl7kuhh4xw5q4m2ao2zz2ki4j3yhrl4zuxy  371       /mnt/blockstorage/testround2/01-test-out/baga6ea4seaqnaqlh4koxgqetwjnzztxnmpdn3msyslcq3l7mfwkullew6gnkenq.car    1          1
3   2023-08-14 09:40:26Z  baga6ea4seaqdqljklw2naa5her4j2bt6otjg7zj3ijlwkrbpzs35qi4tsazlqbq  34359738368  bafkreih6xy24gufltqg7rewebsjzuf7sx7zpmh4n7xaicha66p3hiavogu  114       /mnt/blockstorage/testround2/01-test-out-a/baga6ea4seaqdqljklw2naa5her4j2bt6otjg7zj3ijlwkrbpzs35qi4tsazlqbq.car  1          1         2
[START] ==============##### This Seems to work ####
# prep
# reset database & delete file
rm /mnt/blockstorage/testround2/01-test*/*
singularity admin reset --really-do-it

# create test files in shell
cd /mnt/blockstorage/testround2/01-test-in
for i in {1..5}; do echo "Hello from file ${i}" > "hello${i}.txt"; done

# check empty
singularity dataset list
singularity datasource list

# create a dataset
singularity dataset create  --output-dir /mnt/blockstorage/testround2/01-test-out 01-test
singularity datasource add local 01-test /mnt/blockstorage/testround2/01-test-in
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource daggen 1
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource inspect dags 1

# Now update -----------------
mkdir /mnt/blockstorage/testround2/01-test-out-a
singularity dataset update  --output-dir /mnt/blockstorage/testround2/01-test-out-a 01-test
singularity run dataset-worker --exit-on-complete --exit-on-error

# >>>   add a new file to datasource 2  <<<<<
mkdir /mnt/blockstorage/testround2/01-test-in-a
cd /mnt/blockstorage/testround2/01-test-in-a        # directory ok
echo "Hello from file 6" > hello6.txt
singularity datasource add local 01-test /mnt/blockstorage/testround2/01-test-in-a
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource daggen 2
singularity run dataset-worker --exit-on-complete --exit-on-error
# check to see what has occured

[START] ==============##### This Does NOT seem to work ####

== PREP ====
# reset database & delete files

rm /mnt/blockstorage/testround2/01-test*/*
singularity admin reset --really-do-it

# create test files in shell
cd /mnt/blockstorage/testround2/01-test-in
for i in {1..5}; do echo "Hello from file ${i}" > "hello${i}.txt"; done

# check empty
singularity dataset list
singularity datasource list

# create a dataset
singularity dataset create  --output-dir /mnt/blockstorage/testround2/01-test-out 01-test
singularity datasource add local 01-test /mnt/blockstorage/testround2/01-test-in
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource daggen 1
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource inspect dags 1

# Now update -----------------
mkdir /mnt/blockstorage/testround2/01-test-out-a
singularity dataset update  --output-dir /mnt/blockstorage/testround2/01-test-out-a 01-test
singularity run dataset-worker --exit-on-complete --exit-on-error

# >>> add a new file to datasource 1  <<<<<<<<<
mkdir /mnt/blockstorage/testround2/01-test-in-a
cd /mnt/blockstorage/testround2/01-test-in           # directory not ok
echo "Hello from file 6" > hello6.txt
singularity datasource add local 01-test /mnt/blockstorage/testround2/01-test-in-a
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource daggen 2
singularity run dataset-worker --exit-on-complete --exit-on-error
# check to see what has occured
# new file is not scanned
# zero length temp dag file created


Version

singularity v0.2.47-1f0d416-dirty

Operating System

Linux

Database Backend

SQLite

Additional context

As part of testin #90

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriage

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions