Popularity rankings is the app made for cakculating statistics for products based in their ratings. It consists of two programs:
- Runner, to run calculations from command line
- WebServer, to start a web server with single route, which enabled us to pass valid CSV file in request body and get JSON output containing calculated statistics.
Clone this repo, install needed tools and run it (see below)
- Java 1.8
- SBT 0.13.5 or higher
- Scala 2.12
If you're on Mac, to install those tools use Homebrew. Another option is to use Docker containers.
After cloning project, navigate to root directory and pack FAT jar (with all dependencies) with
sbt assembly
This task will also execute spec tests. Then following options are available:
java -cp target/scala-2.12/PopularityRankings-assembly-0.1.jar Runner <path to CSV file>
java -cp target/scala-2.12/PopularityRankings-assembly-0.1.jar api.WebServer
Another way of running locally is to import project in IntelliJ (or IDE or your choice). To do this follow the next procedure:
- import project as SBT project
- refresh project in SBT tool window to make all sbt dependencies available
- create run configuration
- run project with created configuration
For running with Docker, we first need to create Dockerfile on project root. Then build image:
docker build -t <image_name> .
Then run container with:
docker run -p 8080:8080 -t -i <image_name>
Server shoould be accessible on port 8080.
There are no specific configurations needed to run command line program or server.
Run sbt test task from SBT console or create test configuration inside favourite IDE and run it. Project is containing both unit tests and API routes tests. Not all functions and scenarios are convered with tests currently.
To deploy the program or server, one should create Dockerfile and Jenkinsefile with appropriate build and run steps (check Installing and running_section)
Currently there are 3 implementations of calculation running:
- naive implementation that uses Scala collection chaining, without paying to much attention to performances (CollectionChainingRanker)
- implementation using recursion (RecursionRanker)
- implementation using fold (FoldRanker)
During first run in IntelliJ (all caches clear), implementations showed the following perormance:
Statistics calculated using collection chaining in 1098 ms: {"bestRatedProducts":["blu-ray-01","fixie-01","widetv-03"],"invalidLines":1,"lessRatedProduct":"saddle-01","mostRatedProduct":"wifi-projector-01","validLines":48,"worstRatedProducts":["endura-01","smarttv-01","patagonia-01"]}
Statistics calculated using recursion in 408 ms: {"bestRatedProducts":["blu-ray-01","fixie-01","widetv-03"],"invalidLines":2,"lessRatedProduct":"saddle-01","mostRatedProduct":"wifi-projector-01","validLines":48,"worstRatedProducts":["endura-01","smarttv-01","patagonia-01"]}
Statistics calculated using fold in 474 ms: {"bestRatedProducts":["blu-ray-01","fixie-01","widetv-03"],"invalidLines":2,"lessRatedProduct":"saddle-01","mostRatedProduct":"wifi-projector-01","validLines":48,"worstRatedProducts":["endura-01","smarttv-01","patagonia-01"]}
API contains only one route to make testing easier (through Postman or any other tool of your choice).
curl --location --request POST 'http://localhost:8080/api/v1/statistics' \
--form 'csv=@"<path to your CSV file>"'
Response body
{
"bestRatedProducts": [
"blu-ray-01",
"fixie-01",
"widetv-03"
],
"invalidLines": 1,
"lessRatedProduct": "saddle-01",
"mostRatedProduct": "wifi-projector-01",
"validLines": 48,
"worstRatedProducts": [
"endura-01",
"smarttv-01",
"patagonia-01"
]
}
For schema validation, we can use some online schema validator, such as this
This project contains only a POC, with a huge room for improvement. If we want to Scale the process so it works in reasonable amound of time for big files and streams, these are some of the options we can use:
- use FS2 library that is meant for working with streams
- use Apache Spark
- experiment with other data structures, such as BinaryTree or similar which may be more suitable for sorting hugh data sets
- use more optimised algorithms for sorting/searching in the process, based on the data types and structures
To write a code more in a functional way, we may use Typelevel stack (cats.io and http4s) instead of Lightbend(Akka).