Hibench-Kmeans¶
TL;DR: HiBench is a big data benchmark suite. This workload implements two of the benchmarks from HiBench KMeans on a Hadoop/Spark stack.
Source: workload/Hibench-Kmeans/README.md
Note: The Workload Services Framework is a benchmarking framework and is not intended to be used for the deployment of workloads in production environments. It is recommended that users consider any adjustments which may be necessary for the deployment of these workloads in a production environment including those necessary for implementing software best practices for workload scalability and security.
IntroductionAdd commentMore actions¶
HiBench is a big data benchmark suite. This workload implements two of the benchmarks from HiBench KMeans on a Hadoop/Spark stack.
K-Means is a well-known clustering algorithm for knowledge discovery and data mining. This workload benchmarks the clustering implementation in spark.mllib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.
Test Cases¶
test_gcp_hibench_kmeans_gated- Single-node gated testcase
- Scale profile is set to
tinyfor fast test test_gcp_hibench_kmeans_pkm(default 4 nodes)- 3 Worker Nodes + 1 Client Node testcase
- Default Scale profile is
tinyfor disk protection test_gcp_hibench_kmeans_default(default 4 nodes)- Has no difference with pkm testcase so far
- Designed as PCOM requirements
Tunable Parameters¶
Tunable Parameters are defined in validate.sh. There are 6 sets of configuration involved: HiBench settings, Kmeans settings, Spark Settings, YARN settings, MapReduce Settings, DFS settings. In tables below, parameters with default value auto are assigned values according to system condition(either CPU or memory). These auto-decided parameters are also open to be defined by users.
eg. hibench.default.map.parallelism and hibench.default.shuffle.parallelism can be assigned values directly with the below config.
./ctest.sh -R test_gcp_hibench_kmeans_pkm --config test-confi.yaml -VV
*gcp_hibench_kmeans_pkm:
HIBENCH_DEFAULT_MAP_PARALLELISM: 200
HIBENCH_DEFAULT_SHUFFLE_PARALLELISM: 100
*Note: CPU_CORES refers to number of CPUs in system; MEM_TOTAL refers to free memory (minus 5) in system in following Algorithms. *
1. HiBench settings¶
Please note that hibench.scale.profile is not auto-decided while it has significant effect on benchmark results. It decides a group of Kmeans related parameters, including num_of_clusters, dimension, num_of_samples, samples_per_inputfile, max_iteration and k. Please refer to Kmeans settings for their default settings.
User will need to decide scale profile according to SUT capability.
| Parameters | Default | Algorithm/Comments |
|---|---|---|
| hibench.scale.profile | tiny |
Available settings: tiny, small, large, huge, gigantic, and bigdata |
| hibench.default.map.parallelism | auto | hibench.yarn.executor.cores * hibench.yarn.executor.num |
| hibench.default.shuffle.parallelism | auto | hibench.yarn.executor.cores * hibench.yarn.executor.num |
2. Kmeans settings¶
Kmeans settings are connected to hibench.scale.profile; Default values for each scale profile is listed below. All settings are open to defined by user through ctest.
| Parameters | tiny | small | large | huge | gigantic | bigdata |
|---|---|---|---|---|---|---|
| hibench.kmeans.num_of_clusters | 5 | 5 | 5 | 5 | 5 | 5 |
| hibench.kmeans.dimensions | 100 | 300 | 400 | 600 | 800 | 1000 |
| hibench.kmeans.num_of_samples | 31000000 | 31000000 | 31000000 | 31000000 | 31000000 | 31000000 |
| hibench.kmeans.samples_per_inputfile | 10000 | 10000 | 10000 | 10000 | 10000 | 20000 |
| hibench.kmeans.tiny.max_iteration | 40 | 40 | 40 | 40 | 40 | 40 |
| hibench.kmeans.k | 300 | 300 | 300 | 300 | 300 | 300 |
| hibench.kmeans.convergedist | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
eg. The following test will have 9000000 samples with default 2500 samples per input file.
./ctest.sh -R test_gcp_hibench_kmeans_pkm --config test-confi_bigdata.yaml -VV
*gcp_hibench_kmeans_pkm:
HIBENCH_SCALE_PROFILE: bigdata
HIBENCH_KMEANS_NUM_OF_SAMPLES: 9000000
3. Spark settings¶
| Parameters | Default | Algorithm |
|---|---|---|
| hibench.yarn.executor.cores | auto | if CPU_CORES < 8; hibench.yarn.executor.cores=2; else hibench.yarn.executor.cores=5 |
| hibench.yarn.executor.num | auto | CPU_CORES_IN_CLUSTER=CPU_CORES * WORKERNODE_NUM hibench.yarn.executor.num=CPU_CORES_IN_CLUSTER / hibench.yarn.executor.cores |
| spark.executor.memory | auto | MEM_TOTAL_IN_CLUSTER=MEM_TOTAL * WORKERNODE_NUM spark.executor.memory=MEM_TOTAL_IN_CLUSTER / hibench.yarn.executor.num * 80% |
| spark.executor.memoryOverhead | auto | spark.executor.memoryOverhead=MEM_TOTAL_IN_CLUSTER / hibench.yarn.executor.num * 20% |
| spark.driver.memory | auto | if MEM_TOTAL > 32 spark.driver.memory=20; if MEM_TOTAL > 8, spark.driver.memory=8; else spark.driver.memory=2 |
| spark.default.parallelism | auto | spark.default.parallelism = CPU_CORES_IN_CLUSTER |
| spark.sql.shuffle.partitions | auto | spark.sql.shuffle.partitions= CPU_CORES_IN_CLUSTER |
| hibench.spark.master | yarn |
4. Yarn settings¶
| Parameters | Default | Algorithm | Comments |
|---|---|---|---|
| yarn.scheduler.minimum-allocation-mb | 1024 | user do not change | |
| yarn.scheduler.maximum-allocation-mb | auto | MEM_TOTAL * 1024 | |
| yarn.scheduler.minimum-allocation-vcores | 1 | user do not change | |
| yarn.scheduler.maximum-allocation-vcores | auto | CPU_CORES | |
| yarn.nodemanager.vmem-pmem-ratio | 2.1 | user do not change | |
| yarn.nodemanager.resource.percentage-physical-cpu-limit | 100 | user do not change | |
| yarn.nodemanager.resource.memory-mb | auto | MEM_TOTAL * 1024 | |
| yarn.nodemanager.resource.cpu-vcores | auto | CPU_CORES | |
| yarn.resourcemanager.scheduler.client.thread-count | 50 | user do not change |
5. MapReduce settings¶
| Parameters | Default |
|---|---|
| mapreduce.map.cpu.vcores | 1 |
| mapreduce.reduce.cpu.vcores | 1 |
| mapreduce.task.io.sort.factor | 64 |
| mapreduce.task.io.sort.mb | 512 |
| mapreduce.map.sort.spill.percent | 0.8 |
| mapreduce.job.reduce.slowstart.completedmaps | 1 |
| mapreduce.job.reduce.slowstart.completedmaps | org.apache.hadoop.io.compress.SnappyCodec |
6. DFS settings¶
| Parameters | Default | Comments |
|---|---|---|
| dfs.namenode.handler.count | 30 | user do not change |
| dfs.datanode.handler.count | 30 | user do not change |
| dfs.blocksize | 128m | user do not change |
CSP Settings¶
For cloud tests, data directory is mounted on OS disk by default. If additional storage space is required, set ENABLE_MOUNT_DIR to true to enable DISK_SPEC_1 and attach additional disks to workers; Default MOUNT_DIR is /mnt/disk{N}, N is the number of disks. Please keep DISK_COUNT the same value as <CSP>_DISK_SPEC_1_DISK_COUNT, and CLIENT_INSTANCE_TYPE the same as WORKER_INSTANCE_TYPE for cloud tests.
Suggestion forDISK_COUNT is minimum 3 to maximum 8.
| Parameters | Comments |
|---|---|
| GCP_CONTROLLER_OS_DISK_SIZE | Controller OS disk size |
| GCP_CLIENT_OS_DISK_SIZE | Client OS disk size |
| GCP_WORKER_OS_DISK_SIZE | Worker OS disk size |
| GCP_CONTROLLER_OS_DISK_TYPE | Controller OS disk type |
| GCP_CLIENT_OS_DISK_TYPE | Client OS disk type |
| GCP_WORKER_OS_DISK_TYPE | Worker OS disk type |
| GCP_DISK_SPEC_1_DISK_COUNT | Number of disks (apart from OS disk) that will be attached to Worker |
| GCP_DISK_SPEC_1_DISK_SIZE | Disk size that will be attached to Worker |
| GCP_DISK_SPEC_1_DISK_TYPE | Disk type that will be attached to Worker |
| ENABLE_MOUNT_DIR | Enable/Disable disk attach to Worker |
| GCP_CLIENT_INSTANCE_TYPE | Client instance type |
| GCP_WORKER_INSTANCE_TYPE | Worker instance type |
| GCP_ZONE | |
| DISK_COUNT | Number of disks (apart from OS disk) expected to be attached to Worker |
General Settings¶
| Parameters | Comments |
|---|---|
| WORKERNODE_NUM | Number of worker nodes |
| NODE_NUM | Total number of nodes (workers+client); for gated testcase, NODE_NUM=1 as client and worker are on the same node |
Test Config¶
All Tunable Parameters, CSP Settings, and General Settings can be specified by users through test config file and applied by ./ctest.sh -R testcase_name --config path_to_test_config.yaml -VV . Using test config file is helpful for users to reproduce test results and keep track of performance.
Please check whether the title of test-config.yaml matches with testcase name before starting a test. If user execute./ctest.sh -R test_gcp_hibench_kmeans_pkm -V with test title *gcp_hibench_kmeans_gated:, any config will have no effect on test.
Test configuration to saturate CLX/ICX/SPR/Milan high memory series on GCP are provided under test-config folder.
# test tile need to match with your testcase name
*gcp_hibench_kmeans_pkm:
# disk spec
GCP_CONTROLLER_OS_DISK_SIZE: 120
GCP_CLIENT_OS_DISK_SIZE: 120
GCP_WORKER_OS_DISK_SIZE: 120
GCP_CONTROLLER_OS_DISK_TYPE: pd-standard
GCP_CLIENT_OS_DISK_TYPE: pd-standard
GCP_WORKER_OS_DISK_TYPE: pd-ssd
GCP_DISK_SPEC_1_DISK_COUNT: 2
GCP_DISK_SPEC_1_DISK_SIZE: 120
GCP_DISK_SPEC_1_DISK_TYPE: pd-ssd
ENABLE_MOUNT_DIR: true
# DISK_COUNT must be equal to GCP_DISK_SPEC_1_DISK_COUNT
DISK_COUNT: 2
# platform
GCP_CLIENT_INSTANCE_TYPE: n2d-standard-8
GCP_WORKER_INSTANCE_TYPE: n2d-standard-8
GCP_ZONE: us-east4-c
# general config
WORKERNODE_NUM: 2
NODE_NUM: 3
# tunable parameters
HIBENCH_SCALE_PROFILE: huge
HIBENCH_KMEANS_NUM_OF_SAMPLES: 1200000
HIBENCH_KMEANS_SAMPLES_PER_INPUTFILE: 1000
HIBENCH_YARN_EXECUTOR_NUM: 40
SPARK_DRIVER_MEMORY: 18g
YARN_SCHEDULER_MAXIMUM_ALLOCATION_MB: 409600
DFS_NAMENODE_HANDLER_COUNT: 29
Docker Image¶
The workload defines two Docker images.
hibench-kmeans-workeris the base image with Hadoop setup and used for worker role.hibench-kmeans-clientis the client image based onhibench-kmeans-workerimage with HiBench tools setup
This workload must be run using Kubernetes. Worker pods are named with hadoop-hdfs-* and Client pod are named with hibench-benchmark-*
System Requirements¶
- Minimum 200GB per write per disk required
KPI¶
Run the kpi.sh script to parse the KPIs from the HiBench run report.
The following KPIs are exposed:
Duration (s): Total running duration.Throughput (bytes/s): Workload throughput.Throughput per node (bytes/s): Workload throughput divided by number of execution nodes.
Index Info¶
- Name:
HiBench-Kmeans - Category:
DataServices - Platform:
SPR,ICX,EMR,SRF - keywords:
- Permission:
- Supported Labels : HAS-SETUP-DISK-SPEC
See Also¶
Add commentMore actions - HiBench on GitHub