dmlc
diff --git a/‎README.org‎
Lines changed: 49 additions & 0 deletions b/‎README.org‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎doc/img/arch.png‎
106 KB b/‎doc/img/arch.png‎
106 KB
diff --git a/‎doc/img/arch2.png‎
40.1 KB b/‎doc/img/arch2.png‎
40.1 KB
@@ -0,0 +1,49 @@
+* Parameter Server
+
+Parameter server is distributed machine learning framework. It  targets
+cloud-computing situations where machines are possibly unreliable,
+jobs may get preempted, data may be lost, and where network latency
+and temporary workloads lead to a much more diverse performance
+profile. In other words, we
+target real cloud computing scenarios applicable to Google, Baidu,
+Amazon, Microsoft, etc. rather than low utilization-rate, exclusive
+use, high performance supercomputer clusters.
+
+** Features
+- *Ease of use*. The globally shared parameters are represented as
+  (potentially sparse) *vectors and matrices*, which are more convenient
+  data structures for machine learning applications than the widely
+  used (key,value) store or tables.  High-performance and convenient
+  multi-threaded linear algebra operations, such as vector-matrix
+  multiplication between parameters and local training data, are
+  provided to facilitate developing applications.
+
+- *Efficiency*. Communication between nodes is
+  *asynchronous*. Importantly, synchronization does not block
+  computation. This framework allows the algorithm designer to
+  balance algorithmic convergence rate and system efficiency, where
+  the best trade-off depends on data, algorithm, and hardware.
+
+- *Elastic Scalability*. New nodes can be added without restarting
+  the running framework. This property is desirable, e.g.
+  for streaming sketches or when deploying a parameter server as an
+  online service that must remain available for a long time.
+
+- *Fault Tolerance and Durability*. Conversely, node failure is inevitable,
+  particularly at large scale using commodity servers.  We use an optimized data
+  replication architecture that efficiently stores data on multiple server nodes
+  to enable fast (in less than 1 second) recovery from node failure.
+
+** Architecture Overview
+
+The parameter server architecture has two classes of
+nodes: Each *server* node maintains a partition of the globally
+shared parameters.  They communicate with each other to replicate and/or to
+migrate parameters for reliability and scaling.  The =client=
+nodes perform the bulk of the computation. Each client
+typically stores locally a portion of the training data, computing
+local statistics such as gradients.  Clients communicate only with the
+server nodes, updating and retrieving the shared parameters.  Clients
+may be added or removed.
+
+[[./doc/img/arch2.png]]