SPARK-2201 Improve FlumeInputDStream's stability and make it scalable by joyyoj · Pull Request #1310 · apache/spark

joyyoj · 2014-07-06T14:09:54Z

No description provided.

… read properly

AmplabJenkins · 2014-07-06T14:10:59Z

Can one of the admins verify this patch?

tdas · 2014-07-30T19:02:58Z

Hey @joyyoj
This is a very interesting patch, and can be very useful! But its a little hard to understand the architecture from the code. Could you provide us with a simple design doc that explains whats the architecture, and how each class and architectural components are related?

PS: Apologies for the not having commented on this earlier. Fell through the cracks I guess.

joyyoj · 2014-07-31T03:12:05Z

@tdas, Thanks for noticing the PR. I’m pleased to share my design idea. I'll update it this weekend.

tdas · 2014-07-31T08:22:37Z

@harishreedharan Can you take a look? This looks really interesting for Flume.

harishreedharan · 2014-07-31T17:03:51Z

Hmm, I don't see any code. Shows +0, -0 lines. Something went wrong in the last merge?

tdas · 2014-07-31T17:57:42Z

@joyyoj Something went wrong in your last merge. Its an empty patch now!

joyyoj · 2014-08-01T05:58:35Z

Sorry, I'll soon send a PR.
The problem of the original implementation is that the config(host:port) is static and allows only one host:port. Once host or port changed, the flume agent should be restarted to reload the conf.
To solve it, one solution is to set a virtual address instead of a real address in the flume conf. Meanwhile, a address router was introduced that can tell us all the real addresses are bound to a virtual address and notify such events that a real address is added to or removed from the virtual address.
I found the router can be easily implemented by the zookeeper. In such scenario:

A spark receiver selects a free port and creates a tmp node with the path /path/to/logicalhost/host:port to zookeeper when started.
If three receivers started, three nodes (host1:port1, host2:port2, host3:port3) will be created under /path/to/logicalhost;
On the side of flume agent, the flume sink gets the children nodes (host1:port1, host2:port2, host3:port3) from /path/to/logicalhost and buffers them into a ClientPool.
When append called, it selects a client from ClientPool in a round-robin manner and call client.append to send events.
If any receiver crashed/started, the tmp zk node will be removed/added, and then ClientPool will remove/add the client from the buffer since it watched those zk children events.
In my implementation:
LogicalHostRouter is the implementation of the address router. You know, the spark or flume should not know the existence of zk.
The ZkProxy is an encapsulation of the zk curator client.

harishreedharan · 2014-08-01T17:58:47Z

@joyyoj Thanks for the explanation. This makes quite a lot of sense. I recently added a new Dstream + an associated Flume sink to fix the issue of receivers being hard-coded on the Flume config. Basically solves the same issue, by telling the Spark receiver where the Flume agents are running. So even if the executors die, they can come back and simply poll the same Flume agents for data. In my experience, the hosts on which the agents are running rarely change - so this solution works nicely. PR #807 - let me know what you think.

joyyoj · 2014-08-03T17:01:45Z

@harishreedharan The time I am confronted with this problem, PR #807 is not merged to trunk. I think PR #807 is another solution to solve the same problem and quiet good.
I think it is still valuable to solve it by introducing a host router level since this problem seems a common issue, and a host router level can be reused.
PR #1755 resubmitted
How do you think of it?

joyyoj · 2014-08-03T17:08:40Z

To PR #807, if some flume agent crashed and restarted from another host, spark should be restarted to reload conf ?

harishreedharan · 2014-08-03T17:28:38Z

@joyyoj I will take a look at it in the next couple days. As far as #807 is concerned - yes, if the flume agent's location changes, the config needs to change. In my experience (I work for a company that has a large number of Flume customers), Flume agents are usually deployed on specific nodes and if they crash - they are restarted on the same node - since Flume has no concept of workers (every agent is a worker), so that was not a concern in my design.

The ZK-based config seems interesting. I will take a look at it soon. Thanks!

SparkQA · 2014-09-05T23:45:39Z

Can one of the admins verify this patch?

harishreedharan · 2014-09-07T03:20:30Z

I still don't see any code. Did a merge fail somewhere?

JoshRosen · 2014-10-02T02:50:43Z

Hi @joyyoj,

Since this pull request doesn't show any code / changes, do you mind closing it? Feel free to update / re-open if you have code that you'd like us to review. Thanks!

AmplabJenkins · 2014-11-28T16:32:11Z

Can one of the admins verify this patch?

pwendell · 2014-11-28T17:39:26Z

Let's close this issue

[SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not…

f4660c5

… read properly

Merge remote-tracking branch 'remotes/apache/master'

3f4a602

Merge remote-tracking branch 'apache/master'

22e7821

asfgit closed this in 047ff57 Nov 29, 2014

Uh oh!

Conversation

joyyoj commented Jul 6, 2014

Uh oh!

AmplabJenkins commented Jul 6, 2014

Uh oh!

tdas commented Jul 30, 2014

Uh oh!

joyyoj commented Jul 31, 2014

Uh oh!

tdas commented Jul 31, 2014

Uh oh!

harishreedharan commented Jul 31, 2014

Uh oh!

tdas commented Jul 31, 2014

Uh oh!

joyyoj commented Aug 1, 2014

Uh oh!

harishreedharan commented Aug 1, 2014

Uh oh!

joyyoj commented Aug 3, 2014

Uh oh!

joyyoj commented Aug 3, 2014

Uh oh!

harishreedharan commented Aug 3, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

harishreedharan commented Sep 7, 2014

Uh oh!

JoshRosen commented Oct 2, 2014

Uh oh!

AmplabJenkins commented Nov 28, 2014

Uh oh!

pwendell commented Nov 28, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants