Appendix A is associated with Chapter 2: Text data and where to find them of the book -- Manika Lamba and Margam Madhusudhan (2021) Text Mining for Information Professionals: An Uncharted Territory, SpringerNature.
Lamba, Manika, & Madhusudhan, Margam. (2021). Appendix-A: Online Repositories Available for Text Mining (Version v1.0). http://doi.org/10.5281/zenodo.5104488
| Repository | Description | Data Types |
|---|---|---|
| Registry of Research Data Repositories | Searchable registry of over 2,000 repositories that host research data. Individual datasets may be subject to use restrictions | Archived, audiovisual, configuration, databases, images, network-based, raw, scientific and statistical data among others |
| Harvard Dataverse | Searchable repository of research data in a variety of formats. Individual datasets may be subject to use restrictions | Applications, audio, documents, FITS, images, tabular data, text, compressed files (e.g. ZIP) |
| Full-text corpus data | Contains full-text, downloadable corpus data from six large English corpora. Individual datasets may be subject to use restrictions or require purchase | Databases, plain text |
| English-Corpora | Contains downloadable corpora developed by Mark Davies, Brigham Young University. Individual datasets may be subject to use restrictions or require purchase | Databases, plain text |
| Project Gutenberg | Offers over 58,000 free eBooks in a variety of languagues | ePub, HTML, Kindle, plain text |
| Spatial Data Repository | Provides geographically-linked health and demographic data from DHS Program and the U.S. Census Bureau for mapping in geographic information systems (GIS) | Various geospatial formats, CSV |
| Natural Earth | Free vector and raster map data | ESRI shapefile, TIFF, TFW |
| New York University (NYU) Spatial Data Repository | Provides a catalog of geospatial data and maps available from New York University | Image, Polygon, Raster, Line, Point, Mixed |
| Hathi-Trust | Non-profit large-scale digital preservation repository that includes digital content from research libraries via Google Books and Internet Archive initiatives | |
| Global NDLTD | Open-access electronic theses and dissertations database provided by the Networked Digital Library of Theses and Dissertations | |
| Open Access Theses and Dissertations | Open-access electronic theses and dissertations database | |
| PQDT Open | Full-text open access theses and dissertations database | |
| arXiv | Provides open-access pre-print full-text in the field of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and system science,and economics | |
| biorXiv | Provides open-access pre-print full-text in the field of life sciences | |
| Wikipedia | Collects and develops content for the public in an open-access environment |
Adapted from
©2020 MIT Libraries - reprinted with permission. https://libraries.mit.edu/scholarly/publishing/apis-for-scholarly-resources/. Accessed 26th Feb 2020,
©2020 Purdue University - reprinted with permission. https://guides.lib.purdue.edu/c.php?g=412592. Accessed 26th Feb 2020,
©2020 USC LibGuides - reprinted with permission. https://libguides.usc.edu/contentmining/databases. Accessed 26th Feb 2020.
| Resource | Description | Fee | Result Format | Limitations | Registration |
|---|---|---|---|---|---|
| arXiv | It provides access to both metadata and article abstracts | Free | Atom | None | None |
| SAO/NASA Astrophysics Data System (ADS) | It provides access to bibliographic data on astronomy and physics publications | Free | JSON | Rate limits apply | Key required |
| BioMed Central | It provides access to both metadata and full-text content | Free | XML,JSON | None | Key required |
| Chronicling America | It provides access to historic newspapers and select digitized newspaper pages | Free | HTML(default),JSON,Atom | None | None |
| CrossRef | It provides access to metadata records with CrossRef DOIs | Free | JSON | None | None |
| Digital Public Library of America | It provides access to metadata of its collection | Free | JSON-LD | None | Key required |
| HathiTrust (Bibliographic API) | It provides access to bibliographic and rights information for its collection. It does not provide API for bulk-retrieval of records | Free | MARC-XML,JSON | No specific limits, however, only intended for small numbers of items. Permission must besought for bulk retrieval | None |
| HathiTrust (Data API) | It provides access to HathiTrust and Google digitized texts of public domain works | Free | XML, JSON | No specific limits. However, consult their policies on data use | Key required |
| IEEE Xplore | It provides metadata for the articles submitted to the database | Free | XML | Max 200 results per query | Must subscribe to or be a member of an institution that subscribes to IEEE Xplore |
| JSTOR Data for Research | It provides access to content on JSTOR for research and teaching | Free | Zip files, XML | Max 25,000 documents per dataset; users can get access to more number of datasets by special request | Requires MyJSTOR account registration |
| Library of Congress | It provides multiple APIs available to download bibliographic data and search Library of Congress digital collections | Free | Varies | Varies | Most APIs do not require key |
| Nature | It provides access to the metadata of its collection | Free | XML, JSON, and more | No specific limits; however, downloads should be limited to “reasonable rates” Springer Nature TDM Policy | Varies |
| National Library of Medicine | It provides 29 separate APIs for accessing a wide variety of content from various NLM databases | Varies | Varies | Varies | Varies |
| National Center for Biotechnology Information | It offers several public APIs to access many databases and tools, including PubMed, PMC, Gene, Nuccore, and Protein | Free | Varies | Varies | Key required for some |
| Organisation for Economic Co-Operation and Development (OECD) | It provides access to the top used OECD datasets | Free | JSON, XML | Max 1,000,000 results per query, max URL length of 1,000 characters | None |
| Open Academic Graph | It provides datasets for citations drawn from two large academic graphs: Microsoft Academic Graph and AMiner | Free | Zip, JSON | None | None |
| ORCID | It provides researcher profile data | Free, with subscription options | HTML, XML, or JSON | Two options: 1) Users can access the free Public API, which only returns data marked as “public”; 2) Become an ORCID member to receive API credentials |
ORCID ID Account required |
| Oxford English Dictionary (OED) | It provides access to its datasets | Free, with subscription options | JSON | 3,000 requests per month and 60 calls per minute with a free option, other options available | Key required. Academic Researchers can request free access |
| PLoS Article-Level Metrics | It provides article-level metrics (including usage statistics, citation counts, and social networking activity) for articles published in PLOS journals and articles added to PLOS Hubs: Biodiversity | Free | XML, JSON, CSV | Results limited to batches of 50 at a time | Key required |
| PLOS Search | It allows PLoS content to be queried for integration with web, desktop, or mobile applications | Free | XML, JSON | Max is 7200 requests a day, 300 per hour, 10 per minute; users should wait 5 seconds for each query to return results; requests should not return more than 100 rows. API users are limited to no more than five concurrent connections from a single IP address | Key required |
| SpringerLink | It provides access to the metadata of its collection | Free | XML, JSON, and more | No specific limits; however, downloads should be limited to “reasonable rates.” Springer Nature TDM Policy | Varies |
| Worldbank | It provides access to WorldBank statistical databases, indicators, projects, and loans, credits, financial statements and other data related to financial operations | Free | Varies | Request volume limits are unspecified, but should be “reasonable” | None |