LLMs-IPEX-Public¶
TL;DR: In the very beginning, please follow the terraform setup guide to setup terraform environment first.
Source: workload/LLMs-IPEX-Public/README.md
Note: The Workload Services Framework is a benchmarking framework and is not intended to be used for the deployment of workloads in production environments. It is recommended that users consider any adjustments which may be necessary for the deployment of these workloads in a production environment including those necessary for implementing software best practices for workload scalability and security.
Introduction This workload is targeting for Large Language Models (LLMs) benchmarking using Intel® Extension for PyTorch (IPEX) solution on Intel® Xeon® processors.
In the very beginning, please follow the terraform setup guide to setup terraform environment first.
Below is an execution example terraform static test:
Create a build dir
cd
mkdir -p
Note:
Precision woq_int4 does not support for chatglm2, chatglm3, flan-t5, baichuan-13b, baichuan2-7b, baichuan2-13b, mpt-30b. Precision woq_int8 does not support for mpt-30b. Precision static_int8 does not support for mistral-7b, mpt-30b, Qwen-7B, Qwen-14B. MODEL_PATH: Specify the root path which stores the pre-downloaded model files. Please use the huggingface model cache as the value of this parameter, e.g. /root/.cache/huggingface/hub. We highly recommand to use WSF's big image solution (pls refer to the steps in section Workload Execution). The default path is: /opt/dataset/chatglm2/6b. USE_DEEPSPEED: Specify whether using deepspeed. You can choose True or False. The default value is False. Note:
Only precisions bfloat16 and woq_int8 support DeepSpeed. THUDM/chatglm2-6b and THUDM/chatglm3-6b do not support DeepSpeed. meta-llama/Llama-2-7b-chat-hf and meta-llama/Llama-2-13b-chat-hf do not support accuracy+DeepSpeed. CORES_PER_INSTANCE: Specify how many cores will be used for single instance. The default value is number of cores per numa node. (This parameter only works when USE_DEEPSPEED=False) GREEDY: Specify whether to use greedy search. False: beam number is 4. True: beam number is 1. ONEDNN_VERBOSE: Specify if print the oneDNN information. 1 means on, 0 means off. The default value is 0. RANK_USE: Specify the socket you want to use for DeepSpeed: 0: Use numa nodes on 1st socket 1: Use numa nodes on 2nd socket all: Use all numa nodes in system. BENCHMARKING_TRACE: Specify when to stat/stop traces (emon, sar, vtune, ...). Default is True: True: Collect traces during benchmarking time. False: Collect traces during whole workload process, including preconditonal check, model loading, warmup and benchmarking. CSP VM image preparasion for large dataset and model See AI Setup for coresponding part
BM preparasion for large dataset and model See AI Setup for coresponding part
Docker Image The LLMs-IPEX-Public workload provides 2 docker images: llms-ipex-public-base and llms-ipex-public-inference-lite
build docker image from scrach do cmake and make to build a specific workload Please refer to cmake doc You can also run the workload using docker run directly, providing the set of environment variables described in the Parameters section as follows:
mkdir -p logs-llms-ipex-public_inference_latency id=$(docker run --network host --privileged -e MODE=latency -e WORKLOAD=llms_ipex_public -e PRECISION=bfloat16 -e NUMA_NODES_USE=0 -e FUNCTION=inference -e DATA_TYPE=real -e BATCH_SIZE=1 -e INPUT_TOKENS=32 -e OUTPUT_TOKENS=32 -e STEPS=20 -e GREEDY=False -e WARMUP_STEPS=2 -e MODEL_NAME=THUDM/chatglm2-6b -e MODEL_PATH=/opt/dataset/chatglm2/6b -e USE_DEEPSPEED=False -e ONEDNN_VERBOSE=0 -e TARGET_PLATFORM=SPR -e RANK_USE=all -e CORES_PER_INSTANCE= -v /opt/dataset/chatglm2/6b:/root/.cache/huggingface/hub --rm --detach llms-ipex-public-inference-lite:latest) docker exec $id cat /export-logs | tar xf - -C logs-llms-ipex-public_inference_latency docker rm -f $id KPI Run the list-kpi.sh script to parse the KPIs from the validation logs.
For example:
./list-kpi.sh --all logs-llms-ipex-public_inference_latency Please refer to AI for more KPI details.
Furthermore, in order to reduce the workload docker image size and save cloud execution time, Big image solution is implemented on this Workload.
This dataset name is build_dataset_ai_llama2-7b. The dataset will be located at /opt/dataset/llama2/7b, it will occupy 28G disk space. This dataset name is build_dataset_ai_llama2-13b. The dataset will be located at /opt/dataset/llama2/13b, it will occupy 50G disk space. This dataset name is build_dataset_ai_gptj-6b. The dataset will be located at /opt/dataset/gptj/6b, it will occupy 38G disk space.
Index Info Name: LLMs, IPEX, Pytorch Category: ML/DL/AI Platform: SPR, EMR, GNR Keywords: LLMs, BIGIMAGE, CPU