|
| 1 | +* Parameter Server |
| 2 | + |
| 3 | +Parameter server is distributed machine learning framework. It targets |
| 4 | +cloud-computing situations where machines are possibly unreliable, |
| 5 | +jobs may get preempted, data may be lost, and where network latency |
| 6 | +and temporary workloads lead to a much more diverse performance |
| 7 | +profile. In other words, we |
| 8 | +target real cloud computing scenarios applicable to Google, Baidu, |
| 9 | +Amazon, Microsoft, etc. rather than low utilization-rate, exclusive |
| 10 | +use, high performance supercomputer clusters. |
| 11 | + |
| 12 | +** Features |
| 13 | +- *Ease of use*. The globally shared parameters are represented as |
| 14 | + (potentially sparse) *vectors and matrices*, which are more convenient |
| 15 | + data structures for machine learning applications than the widely |
| 16 | + used (key,value) store or tables. High-performance and convenient |
| 17 | + multi-threaded linear algebra operations, such as vector-matrix |
| 18 | + multiplication between parameters and local training data, are |
| 19 | + provided to facilitate developing applications. |
| 20 | + |
| 21 | +- *Efficiency*. Communication between nodes is |
| 22 | + *asynchronous*. Importantly, synchronization does not block |
| 23 | + computation. This framework allows the algorithm designer to |
| 24 | + balance algorithmic convergence rate and system efficiency, where |
| 25 | + the best trade-off depends on data, algorithm, and hardware. |
| 26 | + |
| 27 | +- *Elastic Scalability*. New nodes can be added without restarting |
| 28 | + the running framework. This property is desirable, e.g. |
| 29 | + for streaming sketches or when deploying a parameter server as an |
| 30 | + online service that must remain available for a long time. |
| 31 | + |
| 32 | +- *Fault Tolerance and Durability*. Conversely, node failure is inevitable, |
| 33 | + particularly at large scale using commodity servers. We use an optimized data |
| 34 | + replication architecture that efficiently stores data on multiple server nodes |
| 35 | + to enable fast (in less than 1 second) recovery from node failure. |
| 36 | + |
| 37 | +** Architecture Overview |
| 38 | + |
| 39 | +The parameter server architecture has two classes of |
| 40 | +nodes: Each *server* node maintains a partition of the globally |
| 41 | +shared parameters. They communicate with each other to replicate and/or to |
| 42 | +migrate parameters for reliability and scaling. The =client= |
| 43 | +nodes perform the bulk of the computation. Each client |
| 44 | +typically stores locally a portion of the training data, computing |
| 45 | +local statistics such as gradients. Clients communicate only with the |
| 46 | +server nodes, updating and retrieving the shared parameters. Clients |
| 47 | +may be added or removed. |
| 48 | + |
| 49 | +[[./doc/img/arch2.png]] |
0 commit comments