-
Notifications
You must be signed in to change notification settings - Fork 769
Streaming Benchmarks
Our testing framework consists of the following parts:
- Data generator
The role of
Data Generatoris to generate a steady stream of data toKafka Cluster. The data are laebled with timestamp and the feed name isTopic A.
- Kafka cluster
Kafka is a message queue or messaging system. The
Topic Ais flowing fromKafka ClustertoTest Cluster.
- Test cluster
Get records(Ra) from
Kafka Cluster, run some simple tests and generate records(Rb) ofTopic B. The records consist of the timestamp of Ra and the time generation of Rb. And asynchronously send Rb toKafka Cluster.
- Metrics reader
Get records of
Topic BfromKafka Cluster, calculate the time difference and generate reports.
After you finished configuration described in Getting Started, the following steps are necessary:
-
Download & setup ZooKeeper (3.4.8 is preferred).
-
Download & setup Apache Kafka (0.8.2.2, scala version 2.10 is preferred).
You can choose the framework you want to test.
-
Download & setup Apache Spark (1.6.1, scala version 2.11 is preferred).
-
Download & setup Apache Storm (1.0.1 is preferred).
-
Download & setup Apache Flink (1.0.3 is prefered).
-
Download & setup Apache Gearpump (0.8.1 is prefered)
-
Download & setup Apache Samza (For running Samza, Hadoop YARN cluster is needed)
The configuration file for Streaming Benchmarks consists of two directories: the conf and the workloads/streambench/conf/, the latter is overwrite the former.
-
The first file you need to configure is:
conf/01-default-streamingbench.confParam Name Param Meaning hibench.streambench.testCase Available benchname:identity,repartition,wordcount,fixwindow.hibench.streambench.zkHost Zookeeper address for Kafka server. Written in mode HOSTNAME:HOSTPORT.hibench.streambench.sampleProbability Probability used in sample test case. hibench.streambench.debugMode Indicate whether in debug mode for correctness verification (default as false).hibench.streambench.kafka.home /PATH/TO/KAFKA/HOME hibench.streambench.kafka.topicPartitions The number of partitions of generated topic (default as: 20). hibench.streambench.kafka.consumerGroup The consumer group of the consumer for Kafka (default as: HiBench). hibench.streambench.kafka.brokerList Kafka broker lists, written in mode "host:port,host:port,..." (default: HOSTNAME:HOSTPORT). hibench.streambench.kafka.offsetReset Set the starting offset of kafkaConsumer (default: largest). hibench.streambench.datagen.intervalSpan Interval span in millisecond (default: 50). hibench.streambench.datagen.recordsPerInterval Number of records to generate per interval span (default: 5). hibench.streambench.datagen.totalRecords Number of total records that will be generated (default: -1 means infinity). hibench.streambench.datagen.totalRounds Total round count of data send (default: -1 means infinity). hibench.streambench.datagen.dir Default path to store seed files (default: ${hibench.hdfs.data.dir}/Streaming). hibench.streambench.datagen.recordLength Fixed length of record (default: 200). hibench.streambench.datagen.producerNumber Number of KafkaProducerrunning on different thread (default: 1). The limitation of a singleKafkaProduceris about100Mb/s.hibench.streambench.fixWindowDuration The duration of window (in ms). hibench.streambench.fixWindowSlideStep The slide step of window (in ms). hibench.streambench.spark.receiverNumber Number of nodes that will receive kafka input (default: 4). hibench.streambench.spark.batchInterval Spark streaming Batchnterval in millisecond (default 100). hibench.streambench.spark.storageLevel Indicate RDD storage level. (default: 2). 0 means StorageLevel.MEMORY_ONLY. 1 meansStorageLevel.MEMORY_AND_DISK_SER. other meansStorageLevel.MEMORY_AND_DISK_SER_2.hibench.streambench.spark.enableWAL Indicate whether to test the write ahead log new feature (default: false). hibench.streambench.spark.checkpointPath If testWALis true, this path to store stream context in hdfs shall be specified. If false, it can be empty (default: /var/tmp)hibench.streambench.spark.useDirectMode Whether to use direct approach or not (dafault: true). hibench.streambench.flink.home /PATH/TO/FLINK/HOME hibench.streambench.flink.parallelism Default parallelism of Flink job. hibench.streambench.flink.bufferTimeout hibench.streambench.flink.checkpointDuration hibench.streambench.storm.home /PATH/TO/STORM/HOME hibench.streambench.storm.nimbus Nimbus of storm cluster. hibench.streambench.storm.nimbusAPIPort Nimbus port (default as 6627). hibench.streambench.storm.nimbusContactInterval Time interval to contact nimbus to judge if finished. hibench.streambench.storm.worker_count Number of workers of Storm. Number of most bolt threads is also equal to this param. hibench.streambench.storm.spout_threads Number of kafka spout threads of Storm. hibench.streambench.storm.bolt_threads Number of bolt threads altogether. hibench.streambench.storm.read_from_start Kafka arg indicating whether to read data from kafka from the start or go on to read from last position (default as true).hibench.streambench.storm.ackon Whether to run on ack (default as true).hibench.streambench.gearpump.home /PATH/TO/GEARPUMP/HOME hibench.streambench.gearpump.executors hibench.streambench.gearpump.parallelism
Usually you need to run the streaming data generation scripts to push data to kafka while running the streaming job. Please create the kafka topics first, generate the seed file and then generate the real data. You can run the following 3 scripts.
workloads/stremingbench/prepare/initTopics.sh
workloads/stremingbench/prepare/genSeedDataset.sh
workloads/stremingbench/prepare/gendata.sh
While the data are being sent to kafka, start the streaming job like Spark Streaming to process the data:
workloads/stremingbench/spark/bin/run.sh