[ZEPPELIN-1587] (WIP) Add impersonation routine in SparkInterpreter for current user by khalidhuseynov · Pull Request #1566 · apache/zeppelin

khalidhuseynov · 2016-10-28T09:14:50Z

What is this PR for?

This is to add impersonation routine for SparkInterpreter, meaning any communication with hadoop hdfs should be done with current user credentials

What type of PR is it?

Improvement

Todos

- add privilege mode
- add tests

What is the Jira issue?

ZEPPELIN-1587

How should this be tested?

executing hdfs related file write should be done by your logged in username, e.g. in the example i used the following code:

val file = sc.textFile("hdfs:/inputFilePath")
file.cache()
val wordCount = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wordCount.saveAsTextFile("hdfs:/outputFolderPath")

Screenshots (if appropriate)

Questions:

Does the licenses files need update? no
Is there breaking changes for older versions? no
Does this needs documentation? possibly

zjffdu · 2016-10-30T08:00:49Z

I don't know how much this helps in impersonation. At least it doesn't affect executor. Do you see any impact on the driver side ?

khalidhuseynov · 2016-11-22T08:59:56Z

@zjffdu i updated description and pushed some changes. as you can see from screenshot it makes usage of current username when writing to hdfs. So in the example above, although out_user1 folder is owned by user1, still it contains inside files (e.g. part0000) with different owner khalidhuseynov. So my guess is that it affects only driver and not the executors, and executors are the ones writing part0000. let me know what you think on it.
also feedback from @Leemoonsoo @prabhjyotsingh @felixcheung would be appreciated on this one as well

Leemoonsoo · 2016-11-29T23:18:32Z

Using --proxy-user instead of doAs would work? https://issues.apache.org/jira/browse/ZEPPELIN-1730

khalidhuseynov · 2016-11-30T02:56:26Z

@Leemoonsoo yes that's possible; moreover I originally implemented it using proxy-user. the only thing is that your hdfs for example in that case should be configured with

<property> 
<name>hadoop.proxyuser.<username>.hosts</name> 
<value>*</value> 
</property> 
 
<property> 
<name>hadoop.proxyuser.<username>.groups</name> 
<value>*</value> 
</property>

in hdfs-site.xml so that you can't run without that configuration, which makes sense.

zjffdu · 2016-11-30T03:10:04Z

@khalidhuseynov I think proxy-user is the correct way to implement impersonation for spark. Your approach in this PR only apply to driver side. If you look at the RM UI, the yarn app owner is the still the origin user. The reason that the owner of file is user1 in your screenshot is that the hdfs commit operation is invoked in driver side. And the change in hdfs-site.xml is necessary for impersonation.

khalidhuseynov · 2016-11-30T03:19:04Z

@zjffdu makes sense. then i'll change it back to use proxy-user with additional flag, in order not to affect first time users not using impersonation.

astroshim · 2016-12-01T03:28:58Z

@khalidhuseynov How about using System.setProperty("HADOOP_USER_NAME", getUserName()); instead of proxy-user setting?
Could you review astroshim#15?
It might work as you expected.

khalidhuseynov · 2016-12-05T08:11:55Z

@astroshim thanks for help, I'll test with proxy-user first since it's most recommended way, and possibly we can use your configuration there.

khalidhuseynov · 2016-12-13T12:01:51Z

So I've done some research and --proxy-user argument is used with either spark-submit or spark-shell normally, and each of them creates new spark context. The problem is that Zeppelin calls spark-submit only once in shared and scoped modes when initializing interpreter, and afterwards single spark context is used. Thus it makes having impersonation through --proxy-user on spark interpreter in shared and scoped mode complicated. For isolated mode, it's easier since we can pass user with SparkConf for each spark context on initialization. any suggestions /opinions are welcome.

and thanks @astroshim for PR and help, that would work in isolated mode.

zjffdu · 2016-12-13T12:15:39Z

That's correct, impersonation for spark interpreter can only be applied to isolated mode. It is due to capability of spark.

astroshim · 2016-12-13T12:58:59Z

I agree with @khalidhuseynov's opinion.
I was keep wondering it's possible impersonation on shared or scoped mode.
My PR is just created for the testing and only tested on 'yarn' mode so @khalidhuseynov please just refer to it.

…tion ### What is this PR for? This is to add spark impersonation using --proxy-user option. note that it enables also to use spark impersonation without having logged user as system user with configured ssh. ### What type of PR is it? Improvement ### Todos * [x] - add `--proxy-user` * [x] - try on standalone spark 1.6.2 * [x] - try on yarn-client mode spark 2.0.1 ### What is the Jira issue? Directly solves [ZEPPELIN-1730](https://issues.apache.org/jira/browse/ZEPPELIN-1730) and also solves [ZEPPELIN-1587](https://issues.apache.org/jira/browse/ZEPPELIN-1587) according to discussion in #1566 since using `--proxy-user` in `spark-submit` is preferable method. ### How should this be tested? 1. switch your spark cluster to `per user` and `isolated` mode 2. set up `user impersonation` flag 3. run some job using that spark interpreter 4. spark context should be created with currently logged in user credentials on behalf of system user ### Screenshots (if appropriate) standalone ![spark_sc_impersonation](https://cloud.githubusercontent.com/assets/1642088/21639292/24240286-d224-11e6-8099-9bc74a06f0c2.gif) yarn-client <img width="997" alt="screen shot 2017-01-04 at 10 00 13 am" src="https://cloud.githubusercontent.com/assets/1642088/21653117/75410fde-d264-11e6-886f-11d8b5dbd29e.png"> ### Questions: * Does the licenses files need update? no * Is there breaking changes for older versions? no * Does this needs documentation? yes Author: Khalid Huseynov <khalidhnv@gmail.com> Closes #1840 from khalidhuseynov/feat/spark-proxy-user and squashes the following commits: e4251de [Khalid Huseynov] update doc with env var dc61cae [Khalid Huseynov] check for env spark_proxy in interpreter.sh 8b66740 [Khalid Huseynov] add spark_proxy_user to env.sh 892b7e4 [Khalid Huseynov] add note in docs 4c3dba9 [Khalid Huseynov] add --proxy-user option for spark

…tion ### What is this PR for? This is to add spark impersonation using --proxy-user option. note that it enables also to use spark impersonation without having logged user as system user with configured ssh. ### What type of PR is it? Improvement ### Todos * [x] - add `--proxy-user` * [x] - try on standalone spark 1.6.2 * [x] - try on yarn-client mode spark 2.0.1 ### What is the Jira issue? Directly solves [ZEPPELIN-1730](https://issues.apache.org/jira/browse/ZEPPELIN-1730) and also solves [ZEPPELIN-1587](https://issues.apache.org/jira/browse/ZEPPELIN-1587) according to discussion in #1566 since using `--proxy-user` in `spark-submit` is preferable method. ### How should this be tested? 1. switch your spark cluster to `per user` and `isolated` mode 2. set up `user impersonation` flag 3. run some job using that spark interpreter 4. spark context should be created with currently logged in user credentials on behalf of system user ### Screenshots (if appropriate) standalone ![spark_sc_impersonation](https://cloud.githubusercontent.com/assets/1642088/21639292/24240286-d224-11e6-8099-9bc74a06f0c2.gif) yarn-client <img width="997" alt="screen shot 2017-01-04 at 10 00 13 am" src="https://cloud.githubusercontent.com/assets/1642088/21653117/75410fde-d264-11e6-886f-11d8b5dbd29e.png"> ### Questions: * Does the licenses files need update? no * Is there breaking changes for older versions? no * Does this needs documentation? yes Author: Khalid Huseynov <khalidhnv@gmail.com> Closes #1840 from khalidhuseynov/feat/spark-proxy-user and squashes the following commits: e4251de [Khalid Huseynov] update doc with env var dc61cae [Khalid Huseynov] check for env spark_proxy in interpreter.sh 8b66740 [Khalid Huseynov] add spark_proxy_user to env.sh 892b7e4 [Khalid Huseynov] add note in docs 4c3dba9 [Khalid Huseynov] add --proxy-user option for spark (cherry picked from commit 5e0aacf) Signed-off-by: Jongyoul Lee <jongyoul@apache.org>

khalidhuseynov force-pushed the feat/spark-hdfs-impersonation branch from c496c97 to 8bbb64f Compare November 22, 2016 03:50

khalidhuseynov added 3 commits December 5, 2016 17:10

add remote user priviliged exec

4f22c63

transfer credentials

e5c51b7

fix checkstyle

4547781

khalidhuseynov force-pushed the feat/spark-hdfs-impersonation branch from fc2ee6a to 4547781 Compare December 6, 2016 06:56

remote user -> proxy user

82545d1

khalidhuseynov mentioned this pull request Jan 4, 2017

[ZEPPELIN-1730, 1587] add spark impersonation through --proxy-user option #1840

Closed

3 tasks

khalidhuseynov closed this Jan 8, 2017

khalidhuseynov deleted the feat/spark-hdfs-impersonation branch January 20, 2017 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ZEPPELIN-1587] (WIP) Add impersonation routine in SparkInterpreter for current user#1566

[ZEPPELIN-1587] (WIP) Add impersonation routine in SparkInterpreter for current user#1566
khalidhuseynov wants to merge 4 commits into
apache:masterfrom
khalidhuseynov:feat/spark-hdfs-impersonation

khalidhuseynov commented Oct 28, 2016 •

edited

Loading

Uh oh!

zjffdu commented Oct 30, 2016

Uh oh!

khalidhuseynov commented Nov 22, 2016

Uh oh!

Leemoonsoo commented Nov 29, 2016

Uh oh!

khalidhuseynov commented Nov 30, 2016

Uh oh!

zjffdu commented Nov 30, 2016

Uh oh!

khalidhuseynov commented Nov 30, 2016

Uh oh!

astroshim commented Dec 1, 2016

Uh oh!

khalidhuseynov commented Dec 5, 2016

Uh oh!

khalidhuseynov commented Dec 13, 2016 •

edited

Loading

Uh oh!

zjffdu commented Dec 13, 2016

Uh oh!

astroshim commented Dec 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

khalidhuseynov commented Oct 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

Uh oh!

zjffdu commented Oct 30, 2016

Uh oh!

khalidhuseynov commented Nov 22, 2016

Uh oh!

Leemoonsoo commented Nov 29, 2016

Uh oh!

khalidhuseynov commented Nov 30, 2016

Uh oh!

zjffdu commented Nov 30, 2016

Uh oh!

khalidhuseynov commented Nov 30, 2016

Uh oh!

astroshim commented Dec 1, 2016

Uh oh!

khalidhuseynov commented Dec 5, 2016

Uh oh!

khalidhuseynov commented Dec 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjffdu commented Dec 13, 2016

Uh oh!

astroshim commented Dec 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

khalidhuseynov commented Oct 28, 2016 •

edited

Loading

khalidhuseynov commented Dec 13, 2016 •

edited

Loading