Skip to content

Indexing algorithm #161

@jbd

Description

@jbd

Hello,

this is more a suggestion than an issue.

The duc indexing is already quite fast but you might be interested in the filesystem crawling algorithm of robinhood also written in C, if you don't already know the project:

To go beyond the performance of classical scanning tools, robinhood implements a multi-threaded version of depth-first traversal[4]. To parallelize the scan, the namespace traversal is split into individual tasks that consist in reading single directories. A pool of worker threads performs these tasks following a depth-first strategy (as illustrated on figure 3).

from Taking back control of HPC file systems with Robinhood Policy Engine paper. See here for more.

If the storage could handle parallel requests nicely, that's a huge win. Right now I don't have any metrics to compare the crawling performance between robinhood and duc, but I'm using robinhood on a petabyte filer through nfs3 with nearly 1 billion files on it, and the crawling algorithm performance scales quite linearly up to 8 threads. After that, I can't tell if the filer or the mysql database that holds the results is the bottleneck at this moment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions