Description
Problem 1: Temporary zero-length dag file in the output directory
drwxrwxr-x 2 fc fc 4096 Aug 14 09:40 .
drwxrwxr-x 6 fc fc 4096 Aug 14 08:29 ..
-rw-rw-r-- 1 fc fc 0 Aug 14 08:59 9d55c1f3-b8cd-4155-9371-2fad030d059c.car
-rw-rw-r-- 1 fc fc 114 Aug 14 09:40 baga6....5qi4tsazlqbq.car
Problem 2: What is the agreed relationship between data source and dataset?
Which of these is correct?
- data source can have many datasets?
- data source can only have one dataset?
- dataset can have many data sources?
Context:
I was able to make a mistake by adding a new file to the incorrect input folder and created a zero-length dag file.
This might not be a problem, but a consequence of not understanding the data source to dataset relationship. In any case, the zero-length temporary files should not exist after the process has terminated.
The example in the "steps to reproduce" section below, shows a dataset with two data sources.
Related issue [1]:
If a dataset can have multiple data sources, then I expect the preparation to occur on the dataset, which means the dataset worker would scan all folders for the dataset AND data sources. (From the test, it appears we only scan only the data source referenced items. (new item in data source 1 is missed and requires a manual scan)
Related issue [2]
CLI tool needed to print out dataset and datasource relationship
eg
$ singularity dataset schema
Dataset Datasource
01-test
-------------[name 1]
-------------[name 2]
Steps to Reproduce
Incorrect relationship datasource and dataset can produce empty dag file (see folder 01-test-out-a and 9d55c1f3-b8cd-4155-9371-2fad030d059c.car)
01-test-out:
total 16
drwxrwxr-x 2 fc fc 4096 Aug 14 08:59 .
drwxrwxr-x 6 fc fc 4096 Aug 14 08:29 ..
-rw-rw-r-- 1 fc fc 334 Aug 14 08:59 baga6ea4seaqiwnjf2ydmouvqhovhnismpsokdttmv3tiaipdkcafhl4qyndjscy.car
-rw-rw-r-- 1 fc fc 371 Aug 14 08:59 baga6ea4seaqnaqlh4koxgqetwjnzztxnmpdn3msyslcq3l7mfwkullew6gnkenq.car
01-test-out-a:
total 12
drwxrwxr-x 2 fc fc 4096 Aug 14 09:40 .
drwxrwxr-x 6 fc fc 4096 Aug 14 08:29 ..
-rw-rw-r-- 1 fc fc 0 Aug 14 08:59 9d55c1f3-b8cd-4155-9371-2fad030d059c.car
-rw-rw-r-- 1 fc fc 114 Aug 14 09:40 baga6ea4seaqdqljklw2naa5her4j2bt6otjg7zj3ijlwkrbpzs35qi4tsazlqbq.car
Observation: Dataset has two --output dir (see filepath column)
singularity dataset list-pieces 01-test
2023-08-14T10:03:17.647Z INFO database database/connstring_cgo.go:26 Opening sqlite database (cgo version)
ID CreatedAt PieceCID PieceSize RootCID FileSize FilePath DatasetID SourceID ChunkID
1 2023-08-14 08:59:40Z baga6ea4seaqiwnjf2ydmouvqhovhnismpsokdttmv3tiaipdkcafhl4qyndjscy 34359738368 bafkreig6kienkbdckf6dmt7shmtq3yocnhjwbl2ki3g3wyzogyq5oprftq 334 /mnt/blockstorage/testround2/01-test-out/baga6ea4seaqiwnjf2ydmouvqhovhnismpsokdttmv3tiaipdkcafhl4qyndjscy.car 1 1 1
2 2023-08-14 08:59:40Z baga6ea4seaqnaqlh4koxgqetwjnzztxnmpdn3msyslcq3l7mfwkullew6gnkenq 34359738368 bafkreidqt6tqn52ubergv3gjl7kuhh4xw5q4m2ao2zz2ki4j3yhrl4zuxy 371 /mnt/blockstorage/testround2/01-test-out/baga6ea4seaqnaqlh4koxgqetwjnzztxnmpdn3msyslcq3l7mfwkullew6gnkenq.car 1 1
3 2023-08-14 09:40:26Z baga6ea4seaqdqljklw2naa5her4j2bt6otjg7zj3ijlwkrbpzs35qi4tsazlqbq 34359738368 bafkreih6xy24gufltqg7rewebsjzuf7sx7zpmh4n7xaicha66p3hiavogu 114 /mnt/blockstorage/testround2/01-test-out-a/baga6ea4seaqdqljklw2naa5her4j2bt6otjg7zj3ijlwkrbpzs35qi4tsazlqbq.car 1 1 2
[START] ==============##### This Seems to work ####
# prep
# reset database & delete file
rm /mnt/blockstorage/testround2/01-test*/*
singularity admin reset --really-do-it
# create test files in shell
cd /mnt/blockstorage/testround2/01-test-in
for i in {1..5}; do echo "Hello from file ${i}" > "hello${i}.txt"; done
# check empty
singularity dataset list
singularity datasource list
# create a dataset
singularity dataset create --output-dir /mnt/blockstorage/testround2/01-test-out 01-test
singularity datasource add local 01-test /mnt/blockstorage/testround2/01-test-in
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource daggen 1
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource inspect dags 1
# Now update -----------------
mkdir /mnt/blockstorage/testround2/01-test-out-a
singularity dataset update --output-dir /mnt/blockstorage/testround2/01-test-out-a 01-test
singularity run dataset-worker --exit-on-complete --exit-on-error
# >>> add a new file to datasource 2 <<<<<
mkdir /mnt/blockstorage/testround2/01-test-in-a
cd /mnt/blockstorage/testround2/01-test-in-a # directory ok
echo "Hello from file 6" > hello6.txt
singularity datasource add local 01-test /mnt/blockstorage/testround2/01-test-in-a
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource daggen 2
singularity run dataset-worker --exit-on-complete --exit-on-error
# check to see what has occured
[START] ==============##### This Does NOT seem to work ####
== PREP ====
# reset database & delete files
rm /mnt/blockstorage/testround2/01-test*/*
singularity admin reset --really-do-it
# create test files in shell
cd /mnt/blockstorage/testround2/01-test-in
for i in {1..5}; do echo "Hello from file ${i}" > "hello${i}.txt"; done
# check empty
singularity dataset list
singularity datasource list
# create a dataset
singularity dataset create --output-dir /mnt/blockstorage/testround2/01-test-out 01-test
singularity datasource add local 01-test /mnt/blockstorage/testround2/01-test-in
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource daggen 1
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource inspect dags 1
# Now update -----------------
mkdir /mnt/blockstorage/testround2/01-test-out-a
singularity dataset update --output-dir /mnt/blockstorage/testround2/01-test-out-a 01-test
singularity run dataset-worker --exit-on-complete --exit-on-error
# >>> add a new file to datasource 1 <<<<<<<<<
mkdir /mnt/blockstorage/testround2/01-test-in-a
cd /mnt/blockstorage/testround2/01-test-in # directory not ok
echo "Hello from file 6" > hello6.txt
singularity datasource add local 01-test /mnt/blockstorage/testround2/01-test-in-a
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource daggen 2
singularity run dataset-worker --exit-on-complete --exit-on-error
# check to see what has occured
# new file is not scanned
# zero length temp dag file created
Version
singularity v0.2.47-1f0d416-dirty
Operating System
Linux
Database Backend
SQLite
Additional context
As part of testin #90
Description
Steps to Reproduce
Incorrect relationship datasource and dataset can produce empty dag file (see folder 01-test-out-a and 9d55c1f3-b8cd-4155-9371-2fad030d059c.car)
Observation: Dataset has two --output dir (see filepath column)
Version
singularity v0.2.47-1f0d416-dirty
Operating System
Linux
Database Backend
SQLite
Additional context
As part of testin #90