Hibench-Terasort¶
TL;DR: HiBench is a big data benchmark suite. This workload implements two of the benchmarks from HiBench Terasort on a Hadoop/Spark stack.
Source: workload/Hibench-Terasort/README.md
Note: The Workload Services Framework is a benchmarking framework and is not intended to be used for the deployment of workloads in production environments. It is recommended that users consider any adjustments which may be necessary for the deployment of these workloads in a production environment including those necessary for implementing software best practices for workload scalability and security.
Introduction¶
HiBench is a big data benchmark suite. This workload implements two of the benchmarks from HiBench Terasort on a Hadoop/Spark stack.
TeraSort is a standard map-reduce benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program.
Test Cases¶
For testcase, two engines supported,Spark and MapReduce
please use HIBENCH_SCALE_PROFILE to defines test datasize, default is tiny
| value | datasize |
|---|---|
| tiny | 20GB |
| small | 50GB |
| large | 100GB |
| huge | 200GB |
| gigantic | 500GB |
| bigdata | 1TB |
Tunable Parameters¶
Tunable Parameters are defined in validate.sh. There are 6 sets of configuration involved: HiBench settings, Spark Settings, YARN settings, MapReduce Settings, DFS settings. In tables below, parameters with default value auto are assigned values according to system condition(either CPU or memory).
CPU_CORES refers to number of CPUs in system; MEM_TOTAL refers to free memory (minus 5) in system in following Algorithms.
1. HiBench settings¶
| Parameters | Default | Algorithm/Comments |
|---|---|---|
| hibench.scale.profile | tiny |
Available settings: tiny, small, large, huge, gigantic, and bigdata |
| hibench.default.map.parallelism | auto | hibench.yarn.executor.cores * hibench.yarn.executor.num |
| hibench.default.shuffle.parallelism | auto | hibench.yarn.executor.cores * hibench.yarn.executor.num |
3. Spark settings¶
| Parameters | Default | Algorithm |
|---|---|---|
| hibench.yarn.executor.cores | auto | if CPU_CORES < 8; hibench.yarn.executor.cores=2; else hibench.yarn.executor.cores=5 |
| hibench.yarn.executor.num | auto | CPU_CORES_IN_CLUSTER=CPU_CORES * WORKERNODE_NUM hibench.yarn.executor.num=CPU_CORES_IN_CLUSTER / hibench.yarn.executor.cores |
| spark.executor.memory | auto | MEM_TOTAL_IN_CLUSTER=MEM_TOTAL * WORKERNODE_NUM spark.executor.memory=MEM_TOTAL_IN_CLUSTER / hibench.yarn.executor.num * 80% |
| spark.executor.memoryOverhead | auto | spark.executor.memoryOverhead=MEM_TOTAL_IN_CLUSTER / hibench.yarn.executor.num * 20% |
| spark.driver.memory | auto | if MEM_TOTAL > 32 spark.driver.memory=20; if MEM_TOTAL > 8, spark.driver.memory=8; else spark.driver.memory=2 |
| spark.default.parallelism | auto | spark.default.parallelism = CPU_CORES_IN_CLUSTER |
| spark.sql.shuffle.partitions | auto | spark.sql.shuffle.partitions= CPU_CORES_IN_CLUSTER |
| hibench.spark.master | yarn |
4. Yarn settings¶
| Parameters | Default | Algorithm |
|---|---|---|
| yarn.scheduler.minimum-allocation-mb | 1024 | |
| yarn.scheduler.maximum-allocation-mb | auto | MEM_TOTAL * 1024 |
| yarn.scheduler.minimum-allocation-vcores | 1 | |
| yarn.scheduler.maximum-allocation-vcores | auto | CPU_CORES |
| yarn.nodemanager.vmem-pmem-ratio | 2.1 | |
| yarn.nodemanager.resource.percentage-physical-cpu-limit | 100 | |
| yarn.nodemanager.resource.memory-mb | auto | MEM_TOTAL * 1024 |
| yarn.nodemanager.resource.cpu-vcores | auto | CPU_CORES |
| yarn.resourcemanager.scheduler.client.thread-count | 50 |
5. MapReduce settings¶
| Parameters | Default |
|---|---|
| mapreduce.map.cpu.vcores | 1 |
| mapreduce.reduce.cpu.vcores | 1 |
| mapreduce.task.io.sort.factor | 64 |
| mapreduce.task.io.sort.mb | 512 |
| mapreduce.map.sort.spill.percent | 0.8 |
| mapreduce.job.reduce.slowstart.completedmaps | 1 |
| mapreduce.job.reduce.slowstart.completedmaps | org.apache.hadoop.io.compress.SnappyCodec |
6. DFS settings¶
| Parameters | Default |
|---|---|
| dfs.namenode.handler.count | 30 |
| dfs.datanode.handler.count | 30 |
| dfs.blocksize | 128m |
CSP Settings¶
For cloud tests, data directory is mounted on OS disk by default. If additional storage space is required, set ENABLE_MOUNT_DIR to true to enable DISK_SPEC_1 and attach additional disks to workers; Default MOUNT_DIR is /mnt/disk{N}, N is the number of disks. Please keep DISK_COUNT the same value as <CSP>_DISK_SPEC_1_DISK_COUNT and CLIENT_INSTANCE_TYPE the same as WORKER_INSTANCE_TYPE for cloud tests.
| Parameters | Comments | |
|---|---|---|
| GCP_CONTROLLER_OS_DISK_SIZE | Controller OS disk size | |
| GCP_CLIENT_OS_DISK_SIZE | Client OS disk size | |
| GCP_WORKER_OS_DISK_SIZE | Worker OS disk size | |
| GCP_CONTROLLER_OS_DISK_TYPE | Controller OS disk type | |
| GCP_CLIENT_OS_DISK_TYPE | Client OS disk type | |
| GCP_WORKER_OS_DISK_TYPE | Worker OS disk type | |
| GCP_DISK_SPEC_1_DISK_COUNT | Number of disks (apart from OS disk) that will be attached to Worker | |
| GCP_DISK_SPEC_1_DISK_SIZE | Disk size that will be attached to Worker | |
| GCP_DISK_SPEC_1_DISK_TYPE | Disk type that will be attached to Worker | |
| ENABLE_MOUNT_DIR | Enable/Disable disk attach to Worker | |
| GCP_CLIENT_INSTANCE_TYPE | Client instance type | |
| GCP_WORKER_INSTANCE_TYPE | Worker instance type | |
| GCP_ZONE | ||
| DISK_COUNT | Number of disks (apart from OS disk) expected to be attached to Worker |
General Settings¶
| Parameters | Comments |
|---|---|
| WORKERNODE_NUM | Number of worker nodes |
| NODE_NUM | Total number of nodes (workers+client); for gated testcase, NODE_NUM=1 as client and worker are on the same node |
Test Config¶
All Tunable Parameters, CSP Settings, and General Settings can be specified by users through test config file and applied by ./ctest.sh -R testcase_name --config path_to_test_config.yaml -VV . Using test config file is helpful for users to reproduce test results and keep track of performance.
*gcp_hibench_terasort_mapreduce_pkm:
# disk spec
GCP_CONTROLLER_OS_DISK_SIZE: 120
GCP_CLIENT_OS_DISK_SIZE: 120
GCP_WORKER_OS_DISK_SIZE: 120
GCP_CONTROLLER_OS_DISK_TYPE: pd-standard
GCP_CLIENT_OS_DISK_TYPE: pd-standard
GCP_WORKER_OS_DISK_TYPE: pd-ssd
GCP_DISK_SPEC_1_DISK_COUNT: 2
GCP_DISK_SPEC_1_DISK_SIZE: 120
GCP_DISK_SPEC_1_DISK_TYPE: pd-ssd
# platform
GCP_CLIENT_INSTANCE_TYPE: n2d-standard-8
GCP_WORKER_INSTANCE_TYPE: n2d-standard-8
GCP_WORKER_MIN_CPU_PLATFORM: AMD%20Milan
GCP_ZONE: us-east4-c
# workload config
# DISK_COUNT needs to be equal to GCP_DISK_SPEC_1_DISK_COUNT
DISK_COUNT: 2
HIBENCH_SCALE_PROFILE: huge
WORKERNODE_NUM: 4
NODE_NUM: 5
ENABLE_MOUNT_DIR: true
Docker Image¶
The workload defines two Docker images.
* hibench is the cluster node image. It is a basic image used for various roles in the Hadoop cluster.
* hibench-client is the main benchmark image based on cluster node image. It contains configuration files and scripts for instrumenting the workload.
The workload must be run using Kubernetes.
KPI¶
Run the kpi.sh script to parse the KPIs from the HiBench run report.
The following KPIs are exposed:
- Duration (s): Total running duration.
- Throughput (bytes/s): Workload throughput.
- Throughput per node (bytes/s): Workload throughput divided by number of execution nodes.
- Known Issues:
- None
Index Info¶
- Name:
HiBench-Kmeans - Category:
DataServices - Platform:
SPR,ICX,EMR,SRF - keywords:
- Permission:
- Supported Labels : HAS-SETUP-DISK-SPEC