Skip to content

[ZEPPELIN-1587] (WIP) Add impersonation routine in SparkInterpreter for current user#1566

Closed
khalidhuseynov wants to merge 4 commits into
apache:masterfrom
khalidhuseynov:feat/spark-hdfs-impersonation
Closed

[ZEPPELIN-1587] (WIP) Add impersonation routine in SparkInterpreter for current user#1566
khalidhuseynov wants to merge 4 commits into
apache:masterfrom
khalidhuseynov:feat/spark-hdfs-impersonation

Conversation

@khalidhuseynov
Copy link
Copy Markdown
Member

@khalidhuseynov khalidhuseynov commented Oct 28, 2016

What is this PR for?

This is to add impersonation routine for SparkInterpreter, meaning any communication with hadoop hdfs should be done with current user credentials

What type of PR is it?

Improvement

Todos

  • - add privilege mode
  • - add tests

What is the Jira issue?

ZEPPELIN-1587

How should this be tested?

executing hdfs related file write should be done by your logged in username, e.g. in the example i used the following code:

val file = sc.textFile("hdfs:/inputFilePath")
file.cache()
val wordCount = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wordCount.saveAsTextFile("hdfs:/outputFolderPath")

Screenshots (if appropriate)

impersonate user

Questions:

  • Does the licenses files need update? no
  • Is there breaking changes for older versions? no
  • Does this needs documentation? possibly

@zjffdu
Copy link
Copy Markdown
Contributor

zjffdu commented Oct 30, 2016

I don't know how much this helps in impersonation. At least it doesn't affect executor. Do you see any impact on the driver side ?

@khalidhuseynov khalidhuseynov force-pushed the feat/spark-hdfs-impersonation branch from c496c97 to 8bbb64f Compare November 22, 2016 03:50
@khalidhuseynov
Copy link
Copy Markdown
Member Author

@zjffdu i updated description and pushed some changes. as you can see from screenshot it makes usage of current username when writing to hdfs. So in the example above, although out_user1 folder is owned by user1, still it contains inside files (e.g. part0000) with different owner khalidhuseynov. So my guess is that it affects only driver and not the executors, and executors are the ones writing part0000. let me know what you think on it.
also feedback from @Leemoonsoo @prabhjyotsingh @felixcheung would be appreciated on this one as well

@Leemoonsoo
Copy link
Copy Markdown
Member

Using --proxy-user instead of doAs would work? https://issues.apache.org/jira/browse/ZEPPELIN-1730

@khalidhuseynov
Copy link
Copy Markdown
Member Author

@Leemoonsoo yes that's possible; moreover I originally implemented it using proxy-user. the only thing is that your hdfs for example in that case should be configured with

<property> 
<name>hadoop.proxyuser.<username>.hosts</name> 
<value>*</value> 
</property> 
 
<property> 
<name>hadoop.proxyuser.<username>.groups</name> 
<value>*</value> 
</property> 

in hdfs-site.xml so that you can't run without that configuration, which makes sense.

@zjffdu
Copy link
Copy Markdown
Contributor

zjffdu commented Nov 30, 2016

@khalidhuseynov I think proxy-user is the correct way to implement impersonation for spark. Your approach in this PR only apply to driver side. If you look at the RM UI, the yarn app owner is the still the origin user. The reason that the owner of file is user1 in your screenshot is that the hdfs commit operation is invoked in driver side. And the change in hdfs-site.xml is necessary for impersonation.

@khalidhuseynov
Copy link
Copy Markdown
Member Author

@zjffdu makes sense. then i'll change it back to use proxy-user with additional flag, in order not to affect first time users not using impersonation.

@astroshim
Copy link
Copy Markdown
Contributor

@khalidhuseynov How about using System.setProperty("HADOOP_USER_NAME", getUserName()); instead of proxy-user setting?
Could you review astroshim#15?
It might work as you expected.

@khalidhuseynov
Copy link
Copy Markdown
Member Author

@astroshim thanks for help, I'll test with proxy-user first since it's most recommended way, and possibly we can use your configuration there.

@khalidhuseynov khalidhuseynov force-pushed the feat/spark-hdfs-impersonation branch from fc2ee6a to 4547781 Compare December 6, 2016 06:56
@khalidhuseynov
Copy link
Copy Markdown
Member Author

khalidhuseynov commented Dec 13, 2016

So I've done some research and --proxy-user argument is used with either spark-submit or spark-shell normally, and each of them creates new spark context. The problem is that Zeppelin calls spark-submit only once in shared and scoped modes when initializing interpreter, and afterwards single spark context is used. Thus it makes having impersonation through --proxy-user on spark interpreter in shared and scoped mode complicated. For isolated mode, it's easier since we can pass user with SparkConf for each spark context on initialization. any suggestions /opinions are welcome.

and thanks @astroshim for PR and help, that would work in isolated mode.

@zjffdu
Copy link
Copy Markdown
Contributor

zjffdu commented Dec 13, 2016

That's correct, impersonation for spark interpreter can only be applied to isolated mode. It is due to capability of spark.

@astroshim
Copy link
Copy Markdown
Contributor

I agree with @khalidhuseynov's opinion.
I was keep wondering it's possible impersonation on shared or scoped mode.
My PR is just created for the testing and only tested on 'yarn' mode so @khalidhuseynov please just refer to it.

asfgit pushed a commit that referenced this pull request Jan 12, 2017
…tion

### What is this PR for?
This is to add spark impersonation using --proxy-user option. note that it enables also to use spark impersonation without having logged user as system user with configured ssh.

### What type of PR is it?
Improvement

### Todos
* [x] - add `--proxy-user`
* [x] - try on standalone spark 1.6.2
* [x] - try on yarn-client mode spark 2.0.1

### What is the Jira issue?
Directly solves [ZEPPELIN-1730](https://issues.apache.org/jira/browse/ZEPPELIN-1730) and also solves [ZEPPELIN-1587](https://issues.apache.org/jira/browse/ZEPPELIN-1587) according to discussion in #1566 since using `--proxy-user` in `spark-submit` is preferable method.

### How should this be tested?
1. switch your spark cluster to `per user` and `isolated` mode
2. set up `user impersonation` flag
3. run some job using that spark interpreter
4. spark context should be created with currently logged in user credentials on behalf of system user

### Screenshots (if appropriate)
standalone
![spark_sc_impersonation](https://cloud.githubusercontent.com/assets/1642088/21639292/24240286-d224-11e6-8099-9bc74a06f0c2.gif)

yarn-client
<img width="997" alt="screen shot 2017-01-04 at 10 00 13 am" src="https://cloud.githubusercontent.com/assets/1642088/21653117/75410fde-d264-11e6-886f-11d8b5dbd29e.png">

### Questions:
* Does the licenses files need update? no
* Is there breaking changes for older versions? no
* Does this needs documentation? yes

Author: Khalid Huseynov <khalidhnv@gmail.com>

Closes #1840 from khalidhuseynov/feat/spark-proxy-user and squashes the following commits:

e4251de [Khalid Huseynov] update doc with env var
dc61cae [Khalid Huseynov] check for env spark_proxy in interpreter.sh
8b66740 [Khalid Huseynov] add spark_proxy_user to env.sh
892b7e4 [Khalid Huseynov] add note in docs
4c3dba9 [Khalid Huseynov] add --proxy-user option for spark
asfgit pushed a commit that referenced this pull request Jan 12, 2017
…tion

### What is this PR for?
This is to add spark impersonation using --proxy-user option. note that it enables also to use spark impersonation without having logged user as system user with configured ssh.

### What type of PR is it?
Improvement

### Todos
* [x] - add `--proxy-user`
* [x] - try on standalone spark 1.6.2
* [x] - try on yarn-client mode spark 2.0.1

### What is the Jira issue?
Directly solves [ZEPPELIN-1730](https://issues.apache.org/jira/browse/ZEPPELIN-1730) and also solves [ZEPPELIN-1587](https://issues.apache.org/jira/browse/ZEPPELIN-1587) according to discussion in #1566 since using `--proxy-user` in `spark-submit` is preferable method.

### How should this be tested?
1. switch your spark cluster to `per user` and `isolated` mode
2. set up `user impersonation` flag
3. run some job using that spark interpreter
4. spark context should be created with currently logged in user credentials on behalf of system user

### Screenshots (if appropriate)
standalone
![spark_sc_impersonation](https://cloud.githubusercontent.com/assets/1642088/21639292/24240286-d224-11e6-8099-9bc74a06f0c2.gif)

yarn-client
<img width="997" alt="screen shot 2017-01-04 at 10 00 13 am" src="https://cloud.githubusercontent.com/assets/1642088/21653117/75410fde-d264-11e6-886f-11d8b5dbd29e.png">

### Questions:
* Does the licenses files need update? no
* Is there breaking changes for older versions? no
* Does this needs documentation? yes

Author: Khalid Huseynov <khalidhnv@gmail.com>

Closes #1840 from khalidhuseynov/feat/spark-proxy-user and squashes the following commits:

e4251de [Khalid Huseynov] update doc with env var
dc61cae [Khalid Huseynov] check for env spark_proxy in interpreter.sh
8b66740 [Khalid Huseynov] add spark_proxy_user to env.sh
892b7e4 [Khalid Huseynov] add note in docs
4c3dba9 [Khalid Huseynov] add --proxy-user option for spark

(cherry picked from commit 5e0aacf)
Signed-off-by: Jongyoul Lee <jongyoul@apache.org>
@khalidhuseynov khalidhuseynov deleted the feat/spark-hdfs-impersonation branch January 20, 2017 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants