-
Notifications
You must be signed in to change notification settings - Fork 82
Description
Hello,
this is more a suggestion than an issue.
The duc indexing is already quite fast but you might be interested in the filesystem crawling algorithm of robinhood also written in C, if you don't already know the project:
To go beyond the performance of classical scanning tools, robinhood implements a multi-threaded version of depth-first traversal[4]. To parallelize the scan, the namespace traversal is split into individual tasks that consist in reading single directories. A pool of worker threads performs these tasks following a depth-first strategy (as illustrated on figure 3).
from Taking back control of HPC file systems with Robinhood Policy Engine paper. See here for more.
If the storage could handle parallel requests nicely, that's a huge win. Right now I don't have any metrics to compare the crawling performance between robinhood and duc, but I'm using robinhood on a petabyte filer through nfs3 with nearly 1 billion files on it, and the crawling algorithm performance scales quite linearly up to 8 threads. After that, I can't tell if the filer or the mysql database that holds the results is the bottleneck at this moment.