-
Notifications
You must be signed in to change notification settings - Fork 29.3k
[SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations #30019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -207,18 +207,14 @@ private[spark] object HadoopFSUtils extends Logging { | |
| // Note that statuses only include FileStatus for the files and dirs directly under path, | ||
| // and does not include anything else recursively. | ||
| val statuses: Array[FileStatus] = try { | ||
| fs match { | ||
| // DistributedFileSystem overrides listLocatedStatus to make 1 single call to namenode | ||
| // to retrieve the file status with the file block location. The reason to still fallback | ||
| // to listStatus is because the default implementation would potentially throw a | ||
| // FileNotFoundException which is better handled by doing the lookups manually below. | ||
| case (_: DistributedFileSystem | _: ViewFileSystem) if !ignoreLocality => | ||
| val remoteIter = fs.listLocatedStatus(path) | ||
| new Iterator[LocatedFileStatus]() { | ||
| def next(): LocatedFileStatus = remoteIter.next | ||
| def hasNext(): Boolean = remoteIter.hasNext | ||
| }.toArray | ||
| case _ => fs.listStatus(path) | ||
| if (ignoreLocality) { | ||
| fs.listStatus(path) | ||
| } else { | ||
| val remoteIter = fs.listLocatedStatus(path) | ||
| new Iterator[LocatedFileStatus]() { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might be good to pull this out into something reusable for any |
||
| def next(): LocatedFileStatus = remoteIter.next | ||
| def hasNext(): Boolean = remoteIter.hasNext | ||
| }.toArray | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the longer you can incrementally do per entry in the remote iterator, the more latencies talking to the object stores can be hidden. See HADOOP-17074 and HADOOP-17023 for details; one of the PRs shows some numbers there. If the spark API could return an iterator/yield and the processing of it used that, a lot of that listing cost could be absorbed entirely.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes it would be lovely if we can get async listing here, but I think it requires a much bigger surgery - up to the top level currently Spark's RDD model requires all the input partitions to be ready before it can start processing (deeply embedded in its primitives such as map/reduce). We can perhaps add the async logic here in this class but I think "local" processing we're doing here is far cheaper than the remote listing and perhaps can't gain much from the change. We can wrap the iterator and make it looks like a lazy array until certain info is needed but again I think it won't go very far until we make extensive changes in upper stack like in |
||
| } | ||
| } catch { | ||
| // If we are listing a root path for SQL (e.g. a top level directory of a table), we need to | ||
|
|
@@ -288,55 +284,7 @@ private[spark] object HadoopFSUtils extends Logging { | |
| if (filter != null) allFiles.filter(f => filter.accept(f.getPath)) else allFiles | ||
| } | ||
|
|
||
| val missingFiles = mutable.ArrayBuffer.empty[String] | ||
| val filteredLeafStatuses = doFilter(allLeafStatuses) | ||
| val resolvedLeafStatuses = filteredLeafStatuses.flatMap { | ||
| case f: LocatedFileStatus => | ||
| Some(f) | ||
|
|
||
| // NOTE: | ||
| // | ||
| // - Although S3/S3A/S3N file system can be quite slow for remote file metadata | ||
| // operations, calling `getFileBlockLocations` does no harm here since these file system | ||
| // implementations don't actually issue RPC for this method. | ||
| // | ||
| // - Here we are calling `getFileBlockLocations` in a sequential manner, but it should not | ||
| // be a big deal since we always use to `parallelListLeafFiles` when the number of | ||
| // paths exceeds threshold. | ||
| case f if !ignoreLocality => | ||
| // The other constructor of LocatedFileStatus will call FileStatus.getPermission(), | ||
| // which is very slow on some file system (RawLocalFileSystem, which is launch a | ||
| // subprocess and parse the stdout). | ||
| try { | ||
| val locations = fs.getFileBlockLocations(f, 0, f.getLen).map { loc => | ||
| // Store BlockLocation objects to consume less memory | ||
| if (loc.getClass == classOf[BlockLocation]) { | ||
| loc | ||
| } else { | ||
| new BlockLocation(loc.getNames, loc.getHosts, loc.getOffset, loc.getLength) | ||
| } | ||
| } | ||
| val lfs = new LocatedFileStatus(f.getLen, f.isDirectory, f.getReplication, f.getBlockSize, | ||
| f.getModificationTime, 0, null, null, null, null, f.getPath, locations) | ||
| if (f.isSymlink) { | ||
| lfs.setSymlink(f.getSymlink) | ||
| } | ||
| Some(lfs) | ||
| } catch { | ||
| case _: FileNotFoundException if ignoreMissingFiles => | ||
| missingFiles += f.getPath.toString | ||
| None | ||
| } | ||
|
|
||
| case f => Some(f) | ||
| } | ||
|
|
||
| if (missingFiles.nonEmpty) { | ||
| logWarning( | ||
| s"the following files were missing during file scan:\n ${missingFiles.mkString("\n ")}") | ||
| } | ||
|
|
||
| resolvedLeafStatuses | ||
| doFilter(allLeafStatuses) | ||
| } | ||
| // scalastyle:on argcount | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch to listStatusIterator(path) and again, provide a remoteIterator. This will give you on paged downloads on hdfs, webhdfs, async page prefetch on latest S3A builds, and, at worst elsewhere, exactly the same performance a listStatus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg - I'll switch to
listStatusIteratorand create a wrapper class for the returnedRemoteIteratorin both cases.