Skip to content

HIVE-27317: Temporary (local) session files cleanup improvements#4293

Merged
veghlaci05 merged 1 commit into
apache:masterfrom
sercanCyberVision:HIVE-27317
Jun 7, 2023
Merged

HIVE-27317: Temporary (local) session files cleanup improvements#4293
veghlaci05 merged 1 commit into
apache:masterfrom
sercanCyberVision:HIVE-27317

Conversation

@sercanCyberVision
Copy link
Copy Markdown
Contributor

@sercanCyberVision sercanCyberVision commented May 4, 2023

What changes were proposed in this pull request?

When ClearDanglingScratchDir service identifies the dangling sessions to clean HDFS, we will be cleaning files/dirs in HiveConf.ConfVars.LOCALSCRATCHDIR (local FS) as well.

Why are the changes needed?

When Hive session is killed, no chance for shutdown hook to clean-up tmp files. This causes accumulation of tmp files/dirs in local FS as below;

> ll /tmp/user/97c4ef50-5e80-480e-a6f0-4f779050852b*
drwx------ 2 user user 4096 Oct 29 10:09 97c4ef50-5e80-480e-a6f0-4f779050852b
-rw------- 1 user user    0 Oct 29 10:09 97c4ef50-5e80-480e-a6f0-4f779050852b10571819313894728966.pipeout
-rw------- 1 user user    0 Oct 29 10:09 97c4ef50-5e80-480e-a6f0-4f779050852b16013956055489853961.pipeout
-rw------- 1 user user    0 Oct 29 10:09 97c4ef50-5e80-480e-a6f0-4f779050852b4383913570068173450.pipeout
-rw------- 1 user user    0 Oct 29 10:09 97c4ef50-5e80-480e-a6f0-4f779050852b889740171428672108.pipeout 

Does this PR introduce any user-facing change?

No. The same service will have additional functionality.

How was this patch tested?

Unit test.

@sercanCyberVision
Copy link
Copy Markdown
Contributor Author

@kgyrtkirk there is only one failed test which imho is not related with the patch, please see below;

The test is org.apache.hadoop.hive.cli.TestMiniTezCliDriver#testCliDriver[tez_union_with_udf]
And the reason why it failed;

Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: java.lang.OutOfMemoryError: GC overhead limit exceeded

@sercanCyberVision
Copy link
Copy Markdown
Contributor Author

@kgyrtkirk thank you for relaunching. This time all the tests are passed;

All tests are passing
Nice one! This run fixed 45 tests and now all 47675 tests for this pipeline are passing.

But init-metastore for init@oracle failed as below;

[2023-05-05T13:59:42.088Z] waiting for oracle to be available...
[2023-05-05T14:04:48.851Z] timeout reached before the port went into state "inuse"
script returned exit code 1

Comment thread ql/src/java/org/apache/hadoop/hive/ql/session/ClearDanglingScratchDir.java Outdated
Comment thread ql/src/java/org/apache/hadoop/hive/ql/session/ClearDanglingScratchDir.java Outdated
Copy link
Copy Markdown
Contributor

@veghlaci05 veghlaci05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested some test changes, the production code looks fine for me.

Comment thread ql/src/test/org/apache/hadoop/hive/ql/cleanup/TestCleanupService.java Outdated
Comment thread ql/src/test/org/apache/hadoop/hive/ql/cleanup/TestCleanupService.java Outdated
Copy link
Copy Markdown
Contributor

@veghlaci05 veghlaci05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests.

@sercanCyberVision
Copy link
Copy Markdown
Contributor Author

@veghlaci05 thank you so much for all the suggestions and corrections.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 7, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 4 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@veghlaci05 veghlaci05 merged commit f6d63ad into apache:master Jun 7, 2023
@veghlaci05
Copy link
Copy Markdown
Contributor

@sercanCyberVision congrats on your first Apache Hive contribution!

@sercanCyberVision sercanCyberVision deleted the HIVE-27317 branch June 7, 2023 12:53
@deniskuzZ
Copy link
Copy Markdown
Member

@veghlaci05 please revert !!! builds are taking 6hr now http://ci.hive.apache.org/job/hive-precommit/job/PR-4293/9/

@sercanCyberVision
Copy link
Copy Markdown
Contributor Author

sercanCyberVision commented Jun 8, 2023

Hi @deniskuzZ
I know I am new at this, and probably missing something, but as I see there is only one build which took 3 hours before this commit.

All other previous and recent builds took around 8.5 hours. Even in one case it is 15 hours;

http://ci.hive.apache.org/job/hive-precommit/job/PR-4383/2/
http://ci.hive.apache.org/job/hive-precommit/job/PR-4381/5/
http://ci.hive.apache.org/job/hive-precommit/job/PR-4376/3/
http://ci.hive.apache.org/job/hive-precommit/job/PR-4366/6/
http://ci.hive.apache.org/job/hive-precommit/job/master/1753/

@deniskuzZ
Copy link
Copy Markdown
Member

deniskuzZ commented Jun 9, 2023

Hi @deniskuzZ I know I am new at this, and probably missing something, but as I see there is only one build which took 3 hours before this commit.

All other previous and recent builds took around 8.5 hours. Even in one case it is 15 hours;

http://ci.hive.apache.org/job/hive-precommit/job/PR-4383/2/ http://ci.hive.apache.org/job/hive-precommit/job/PR-4381/5/ http://ci.hive.apache.org/job/hive-precommit/job/PR-4376/3/ http://ci.hive.apache.org/job/hive-precommit/job/PR-4366/6/ http://ci.hive.apache.org/job/hive-precommit/job/master/1753/

master is broken and some of the jobs are failing with:

java.lang.RuntimeException: The dir: /home/jenkins/agent/workspace/hive-precommit_master/itests/hive-unit/target/tmp/scratchdir on HDFS should be writable. Current permissions are: rwxr-xr-x
	at org.apache.hadoop.hive.ql.exec.Utilities.ensurePathIsWritable(Utilities.java:4955)
	at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:827)
	at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:768)
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:694)

http://ci.hive.apache.org/job/hive-precommit/job/master/1758/testReport/junit/org.apache.hadoop.hive.ql.txn.compactor/TestCrudCompactorOnTez/Testing___split_22___PostProcess___testRebalanceCompactionWithParallelDeleteAsSecondPessimisticLock/

@sercanCyberVision
Copy link
Copy Markdown
Contributor Author

sercanCyberVision commented Jun 9, 2023

@deniskuzZ can you help fixing this? I think we can clean the scratch dir which is created for the unit test in this PR after test is completed;
https://github.com/sercanCyberVision/hive/blob/00c63e9645021923dbbef1d868873dbf5a9fc2f2/itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/session/TestClearDanglingScratchDir.java#L153

So, the failed tests will create their own scratchdir instead of dealing with the existing one;
https://github.com/sercanCyberVision/hive/blob/7c02f733d6251fec3a0fa9537e48817094204dad/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L4952-L4957

Asking your help because I am not able to reproduce the issue.

@deniskuzZ
Copy link
Copy Markdown
Member

deniskuzZ commented Jun 10, 2023

@deniskuzZ can you help fixing this? I think we can clean the scratch dir which is created for the unit test in this PR after test is completed; https://github.com/sercanCyberVision/hive/blob/00c63e9645021923dbbef1d868873dbf5a9fc2f2/itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/session/TestClearDanglingScratchDir.java#L153

So, the failed tests will create their own scratchdir instead of dealing with the existing one; https://github.com/sercanCyberVision/hive/blob/7c02f733d6251fec3a0fa9537e48817094204dad/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L4952-L4957

Asking your help because I am not able to reproduce the issue.

Looks like there was an issue with the agents, so job was running 2x longer.
We should use custom scratch dir in the unit tests. I think issue is that when you create a SCRATCHDIR it doesn't have write permissions.

Utilities.createDirsWithPermission(hiveConf, scratchDir, WRITE_ALL_PERM, true);

@sercanCyberVision
Copy link
Copy Markdown
Contributor Author

@deniskuzZ
Followed your suggestion and created both session scratch dirs with all the permissions, and checked them;

> ll ql/target/tmp
drwxrwxrwx  3 sercan sercan 4096 Jun 10 13:10 localscratchdir/
drwxrwxrwx  3 sercan sercan 4096 Jun 10 13:10 scratchdir/

The unit test itself is okay anyway, it passed;

[INFO] Results:
[INFO] 
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

But I am not sure how to proceed now :)
Should we revert this solution and create another PR? Or can you add attached patch to your open PR?
scratchDir_permissions.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants