Skip to content

Commit 5e69d08

Browse files
authored
add description of dataset document (#742)
1 parent c10c349 commit 5e69d08

File tree

2 files changed

+61
-1
lines changed

2 files changed

+61
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@ Load and prepare data by running the following code:
160160
161161
This dataset is created by public data collected by [crawler scripts](scripts/data_collector/), which have been released in
162162
the same repository.
163-
Users could create the same dataset with it.
163+
Users could create the same dataset with it. [Description of dataset](https://github.com/microsoft/qlib/tree/main/scripts/data_collector#description-of-dataset)
164164
165165
*Please pay **ATTENTION** that the data is collected from [Yahoo Finance](https://finance.yahoo.com/lookup), and the data might not be perfect.
166166
We recommend users to prepare their own data if they have a high-quality dataset. For more information, users can refer to the [related document](https://qlib.readthedocs.io/en/latest/component/data.html#converting-csv-format-into-qlib-format)*.

scripts/data_collector/README.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Data Collector
2+
3+
## Introduction
4+
5+
Scripts for data collection
6+
7+
- yahoo: get *US/CN* stock data from *Yahoo Finance*
8+
- fund: get fund data from *http://fund.eastmoney.com*
9+
- cn_index: get *CN index* from *http://www.csindex.com.cn*, *CSI300*/*CSI100*
10+
- us_index: get *US index* from *https://en.wikipedia.org/wiki*, *SP500*/*NASDAQ100*/*DJIA*/*SP400*
11+
- contrib: scripts for some auxiliary functions
12+
13+
14+
## Custom Data Collection
15+
16+
> Specific implementation reference: https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo
17+
18+
1. Create a dataset code directory in the current directory
19+
2. Add `collector.py`
20+
- add collector class:
21+
```python
22+
CUR_DIR = Path(__file__).resolve().parent
23+
sys.path.append(str(CUR_DIR.parent.parent))
24+
from data_collector.base import BaseCollector, BaseNormalize, BaseRun
25+
class UserCollector(BaseCollector):
26+
...
27+
```
28+
- add normalize class:
29+
```python
30+
class UserNormalzie(BaseNormalize):
31+
...
32+
```
33+
- add `CLI` class:
34+
```python
35+
class Run(BaseRun):
36+
...
37+
```
38+
3. add `README.md`
39+
4. add `requirements.txt`
40+
41+
42+
## Description of dataset
43+
44+
| | Basic data |
45+
|------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
46+
| Features | **Price/Volume**: <br>&nbsp;&nbsp; - $close/$open/$low/$high/$volume/$change/$factor |
47+
| Calendar | **\<freq>.txt**: <br>&nbsp;&nbsp; - day.txt<br>&nbsp;&nbsp; - 1min.txt |
48+
| Instruments | **\<market>.txt**: <br>&nbsp;&nbsp; - required: **all.txt**; <br>&nbsp;&nbsp; - csi300.txt/csi500.txt/sp500.txt |
49+
50+
- `Features`: data, **digital**
51+
- if not **adjusted**, **factor=1**
52+
53+
### Data-dependent component
54+
55+
> To make the component running correctly, the dependent data are required
56+
57+
| Component | required data |
58+
|---------------------------------------------------|--------------------------------|
59+
| Data retrieval | Features, Calendar, Instrument |
60+
| Backtest | **Features[Price/Volume]**, Calendar, Instruments |

0 commit comments

Comments
 (0)