How to get the data for pretraining stage?

Hello,

thank you for your great work on "Driving with LLMs".

I'm trying to reproduce the pre-training part but I'm confused with the data you used.

My understanding is that for each vector, there should be some QA pairs regarding it and in total there should 100k QA pairs.

My question is:

How do you get the 100k QA pairs related to the vector?

What is the difference between the captioning data from LanGen and the 100k QA pairs?

For each vector, how many QA did you generate?
 
Thank you for your time and for any clarification you can provide.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the data for pretraining stage? #30

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to get the data for pretraining stage? #30

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions