Indexing algorithm

Hello,

this is more a suggestion than an issue.

The duc indexing is already quite fast but you might be interested in the filesystem crawling algorithm of [robinhood](https://github.com/cea-hpc/robinhood) also written in C, if you don't already know the project:

`To go beyond the performance of classical scanning tools,
robinhood implements a multi-threaded version of depth-first
traversal[4]. To parallelize the scan, the namespace traversal
is split into individual tasks that consist in reading single
directories. A pool of worker threads performs these tasks
following a depth-first strategy (as illustrated on figure 3).`

from [Taking back control of HPC file systems with Robinhood Policy Engine](https://arxiv.org/pdf/1505.01448.pdf) paper. See [here](https://github.com/cea-hpc/robinhood/wiki/Papers-%26-presentations) for more.

If the storage could handle parallel requests nicely, that's a huge win. Right now I don't have any metrics to compare the crawling performance between robinhood and duc, but I'm using robinhood on a petabyte filer through nfs3 with nearly 1 billion files on it, and the crawling algorithm performance scales quite linearly up to 8 threads. After that, I can't tell if the filer or the mysql database that holds the results is the bottleneck at this moment. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing algorithm #161

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Indexing algorithm #161

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions