Currently each DataNode is launched as a separate process/JVM, and we fool it into thinking it has all of its necessary blocks by creating the files as 0-length. It would be much more efficient to launch all of the DataNodes in the same JVM using MiniDFSCluster, and to use SimulatedFSDataset to store the block metadata only in-memory, saving us from having to create millions of sparse files on disk.
Currently each DataNode is launched as a separate process/JVM, and we fool it into thinking it has all of its necessary blocks by creating the files as 0-length. It would be much more efficient to launch all of the DataNodes in the same JVM using
MiniDFSCluster, and to useSimulatedFSDatasetto store the block metadata only in-memory, saving us from having to create millions of sparse files on disk.