i have run into several reasons why indexing a doc can fail (api error, pdfreader error, pdf error, etc.,.). however, the current CLI and functions for indexing simply error out when this happens. a more graceful strategy could be implemented, to move the skipped pdfs/docs into a list, and reporting failed instances at the end instead of erroring out.
for the moment, i'm reusing the indexing functions and adding to a list (https://github.com/sensein/paperqa-test/blob/53d5dcf6af3d44645668a01103247d6d8b0de86a/index_abcd.py#L26). but this iterative process is slowly and relies on monitoring.
i have run into several reasons why indexing a doc can fail (api error, pdfreader error, pdf error, etc.,.). however, the current CLI and functions for indexing simply error out when this happens. a more graceful strategy could be implemented, to move the skipped pdfs/docs into a list, and reporting failed instances at the end instead of erroring out.
for the moment, i'm reusing the indexing functions and adding to a list (https://github.com/sensein/paperqa-test/blob/53d5dcf6af3d44645668a01103247d6d8b0de86a/index_abcd.py#L26). but this iterative process is slowly and relies on monitoring.