Skip to content

Commit 33c3617

Browse files
committed
editing readme.org.
1 parent e6280fd commit 33c3617

File tree

3 files changed

+49
-0
lines changed

3 files changed

+49
-0
lines changed

README.org

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
* Parameter Server
2+
3+
Parameter server is distributed machine learning framework. It targets
4+
cloud-computing situations where machines are possibly unreliable,
5+
jobs may get preempted, data may be lost, and where network latency
6+
and temporary workloads lead to a much more diverse performance
7+
profile. In other words, we
8+
target real cloud computing scenarios applicable to Google, Baidu,
9+
Amazon, Microsoft, etc. rather than low utilization-rate, exclusive
10+
use, high performance supercomputer clusters.
11+
12+
** Features
13+
- *Ease of use*. The globally shared parameters are represented as
14+
(potentially sparse) *vectors and matrices*, which are more convenient
15+
data structures for machine learning applications than the widely
16+
used (key,value) store or tables. High-performance and convenient
17+
multi-threaded linear algebra operations, such as vector-matrix
18+
multiplication between parameters and local training data, are
19+
provided to facilitate developing applications.
20+
21+
- *Efficiency*. Communication between nodes is
22+
*asynchronous*. Importantly, synchronization does not block
23+
computation. This framework allows the algorithm designer to
24+
balance algorithmic convergence rate and system efficiency, where
25+
the best trade-off depends on data, algorithm, and hardware.
26+
27+
- *Elastic Scalability*. New nodes can be added without restarting
28+
the running framework. This property is desirable, e.g.
29+
for streaming sketches or when deploying a parameter server as an
30+
online service that must remain available for a long time.
31+
32+
- *Fault Tolerance and Durability*. Conversely, node failure is inevitable,
33+
particularly at large scale using commodity servers. We use an optimized data
34+
replication architecture that efficiently stores data on multiple server nodes
35+
to enable fast (in less than 1 second) recovery from node failure.
36+
37+
** Architecture Overview
38+
39+
The parameter server architecture has two classes of
40+
nodes: Each *server* node maintains a partition of the globally
41+
shared parameters. They communicate with each other to replicate and/or to
42+
migrate parameters for reliability and scaling. The =client=
43+
nodes perform the bulk of the computation. Each client
44+
typically stores locally a portion of the training data, computing
45+
local statistics such as gradients. Clients communicate only with the
46+
server nodes, updating and retrieving the shared parameters. Clients
47+
may be added or removed.
48+
49+
[[./doc/img/arch2.png]]

doc/img/arch.png

106 KB
Loading

doc/img/arch2.png

40.1 KB
Loading

0 commit comments

Comments
 (0)