[SPARK-3586][streaming]Support nested directories in Spark Streaming#2765
[SPARK-3586][streaming]Support nested directories in Spark Streaming#2765wangxiaojing wants to merge 25 commits into
Conversation
|
Can one of the admins verify this patch? |
1 similar comment
|
Can one of the admins verify this patch? |
|
Hi @wangxiaojing ,a small suggestion, why not making this improvement more flexible by adding a parameter to control the searching depth of directories, this will be more general than the current 1-depth searching implementation. Like: class FileInputDStream[K: ClassTag, V: ClassTag, F <: NewInputFormat[K,V] : ClassTag](
@transient ssc_ : StreamingContext,
directory: String,
filter: Path => Boolean = FileInputDStream.defaultFilter,
depth: Int = 1,
newFilesOnly: Boolean = true)People can use this parameter to control the searching depth, default 1 keeps the same semantics as current code. Besides some while space related code styles should be changed to align with Scala style. |
|
Hi @jerryshao,It's changing the code to use this parameter to control the searching depth,but if the depth is greater than 1,the ignore time is not reasonable,because if the secondary subdirectories has a new file,the modification time of the first subdirectories is not change.like: A files created in /tmp/spark1/spark2 2014-10-16 19:17 /tmp/spark1 If you use the ignore time to do filtering,the first subdirectories is always ignore,Can you give me some advice? |
|
Can we just check the time of file, not directory to filter out some unqualified files, I'm not sure about this. cc @tdas , mind taking a look at this? |
|
@jerryshao @tdas First,According to the depth to check all the directory ,then filter the directory if the modification time more then the ignore time.Is this method optimal? thanks. |
There was a problem hiding this comment.
- Add space after
, - Remove space before
: - Add space after
: - Add space after
=
c6f1c75 to
d1c3399
Compare
|
This feature would definitely be helpful. Thanks to @wangxiaojing and whoever continuing to work on PR! |
|
@wangxiaojing could you update this PR? It conflicts with master |
There was a problem hiding this comment.
Could you change System.currentTimeMillis to clock.getTimeMillis()?
|
Hi @wangxiaojing it seems that #6588 is an updated version of this PR. Would you mind closing this patch since it no longer merges cleanly with master? |
|
@andrewor14 ok. |
For text files, the method streamingContext.textFileStream(dataDirectory).
The improvement of the streaming to Support subdirectories,spark streaming can monitor the subdirectories dataDirectory and process any files created in that directory.
eg:
streamingContext.textFileStream(/test).
Look at the direction contents:
/test/file1
/test/file2
/test/dr/file1
if the directory "/test/dr/" have new file "file2" ,spark streaming can process the file