Open datasets for the community!
Normalized, curated and enriched, these datasets were specifically designed for data science workloads.
You can find the published datasets in GPTilt's Hugging Face profile. All datasets are published in parquet format, or available as Apache Iceberg tables.
Alternatively, if you are interested in running the data pipelines yourself, find instructions below.
If you're looking for the previous dataset tiering, please refer to the relevant doc.
The GPTilt Dataset Catalogue splits datasets into two tiers:
- Clean: these datasets are true to the raw data, with some additional transformations that improve coherence (usage of
matchIdinstead ofgameId), increase usability (e.g. the addition of inventory data), reduce the scope (fields that aren't particularly relevant are removed), and denormalize (e.g. splitting kill events into kill and assist events) the underlying data. - Curated: these datasets, on the other hand, perform opinionated transformations, with some specific goal in mind. Data accuracy remains, but the dataset structure is markedly different. It can be aggregated at a different level of granularity, have additional columns based on complex rules, or one-hot encodings.
- A table or storage directory's fully qualified name is
root.dataset.schema.table. - The
rootis the environment (e.g.dev,prod). - The
datasetis the highest logical aggregator - typically the platform from which the data was originally collected (e.g.riot_api,youtube). - The
schemaspecifies the degree of quality of the data (e.g.raw,clean,curated). - The
tablespecifies a relation (e.g.league_entries, which contains data collected from the Riot API on player entries in a league).
If you'd rather do things yourself, the easiest way to go about it is to clone the repository, open a terminal inside the newly created directory, and run make init. This will create the Python virtual environment and install the project dependencies. You can then activate the virtual environment with source venv/bin/activate if you're in a Linux environment, or source venv/Scripts/activate if you're in a Windows environment.
Then, you'll be able to use the Dagster CLI to boot the orchestrator up:
dagster devThe ingestion, processing, and curation of the GPTilt Dataset Catalogue is orchestrated with Dagster. We highly recommend you get acquainted with Dagster before moving forward.
Most pipelines require a number of secrets that should be available at runtime as environment variables. If you include them in a .env file in the repository root, Dagster will automatically load them before executing the pipelines.
⚠️ Not all pipelines are public, but the repository utilities (e.g.make init) take that into account - so everything should work fine!.
ds-chatbot: One of the very few private repositories of GPTilt! 🤫ds-common: Common utilities for the repository. 🛠️ds-documents: For building an enriched document store. 📚ds-hugging-face: For publishing datasets to 🤗 Hugging Face.ds-riot-api: For building assets with provenance from the Riot Games API. 🌐ds-runtime: Runtime variables and environment context. ⏲️ds-scribe: For transcribing audio to text. 🖊️ds-storage: Custom storage interface. Maintenance limited to usage. 🏭ds-tables: TBD.
Contributions are welcome! If you have ideas for new utilities, find bugs, or want to improve existing code, please feel free to open an issue or submit a pull request on the GitHub repository.
All datasets are licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
GPTilt isn't endorsed by Riot Games and doesn't reflect the views or opinions of Riot Games or anyone officially involved in producing or managing Riot Games properties. Riot Games, and all associated properties are trademarks or registered trademarks of Riot Games, Inc.