March 6, 2024

Running Langchain and LLM Model on AMD ROCM

System Requirements

An AMD GPU that supports ROCm (check the compatibility list on docs.amd.com page)
A Linux-based operating system, preferably Ubuntu 20.04, 22.04
Conda and Docker environment
Python 3.6 or higher

ROCm Installation Can Refer to this link

https://rocm.docs.amd.com/en/latest/

Docker Runtime Installation Can Refer to this link

https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository

Conda Environment Installation Can Refer to this link (Inside Container)

https://vegastack.com/tutorials/how-to-install-anaconda-on-ubuntu-22-04/

Current Test System

At this guide I use this following spec :

AMD GPU Instinct MI210

ROCm 6.0

$ apt show rocm-libs -a
Package: rocm-libs
Version: 6.0.0.60000-91~20.04

Ubuntu 22.04

$ lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal

Docker 25

$ docker -v
Docker version 25.0.1, build 29cf629

Using Docker Image for ROCm

Now after ROCm Installed on the Host OS, we can run a container using specific ROCm, Python, and Pytorch Version.
We use -d -it option to keep the Container Running so we can do our task inside.
Change the –shm-size to your specific system memory which this image can use.

sudo docker run -d -it \
--network=host \
--device=/dev/kfd \
--device=/dev/dri \
--ipc=host \
--shm-size 32G \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--workdir=/dockerx \
--name=llm-with-langchain-rocm \
-v $HOME/dockerx:/dockerx rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 /bin/bash

For other image version and pre-installed software, you can refer the docker hub.

https://hub.docker.com/r/rocm/pytorch/tags

$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3e88bafb5844 rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 "/bin/bash" 11 seconds ago Up 11 seconds llm-with-langchain-rocm

Go inside the container and Verify Docker Image

$ sudo docker exec -it 3e88bafb5844 /bin/bash
root@amdserver:/dockerx#

root@amdserver:/dockerx# conda --version
conda 23.7.4
root@amdserver:/dockerx# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
root@amdserver:/dockerx# apt show rocm-libs -a
Package: rocm-libs
Version: 5.7.0.50700-63~22.04
root@amdserver:/dockerx# python --version
Python 3.10.13

Now we create and use conda Environment

conda create --name env_llm
conda init

Re-login console and activate the Environment

conda activate env_llm  
   
(env_llm) root@amdserver:/dockerx#

Install basic library for langchain and LLM

pip install langchain openai tiktoken transformers accelerate cohere

**We are using this version at the installation, you can use it if hit the issue with other version.
(env_llm) root@amdserver:/dockerx# python --version
Python 3.10.13
(env_llm) root@amdserver:/dockerx# pip --version
pip 24.0 from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/pip (python 3.10)

Langchain LLM Script

Use this sample script to test if it works.

llm.py

##Import Library
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.prompts import PromptTemplate
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from getpass import getpass
import warnings
import torch
import gc


def run_myllm():
    ##Template prompt for chain
    template = """Question: {question}
    Answer: Let's think step by step."""
    prompt = PromptTemplate.from_template(template)

    ##Model used, We use Yi-6B since it light weight,, we will try using Mixtral. This model can be found in huggingface website
    model_id = "01-ai/Yi-6B"
    #model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
    
    ##Tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        model_id
    )

    ##AutoModel
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto" ##To use CPU only use "-1"
    )
    
    ##Combine tokenizer and automodel to pipeline
    pipe = pipeline(
        "text-generation", 
        model=model, 
        tokenizer=tokenizer,
        repetition_penalty=1.2,
        top_p=0.4,
        temperature=0.4,
        max_new_tokens=1000
    )
    gpu_llm = HuggingFacePipeline(pipeline=pipe)

    ##Finally combine them all with prompt
    gpu_chain = prompt | gpu_llm
    question = "Write a report the life of Thomas Jefferson and a separate report John Hopkins. Each report must be at least 1000 words. The report must be complete. Provide a critique of the report and out line areas of improvements. Based on the critique of the report, rewrite the report based on the critique. Note format the output so that there are no more then 20 words per-line. Do not break words across lines. Complete this task in full"
    print(gpu_chain.invoke({"question": question}),end="")
    print("\n----------------------\n")

if __name__ == '__main__':
    warnings.filterwarnings("ignore")

    run_myllm()

This script will take some times to download the model from HuggingFace,

Result:

Quantize large memory model

Generally, big parameter model needs big NV Memory. But if you want to test big model LLM on smal NV Memory GPU, you can quantize it.

The error when we force using big Model in limited GPU memory can be as follows

torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 172.00 MiB (GPU 0; 63.98 GiB total capacity; 12.58 GiB already allocated; 0 bytes free; 12.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

To check on required memory for specific model, you can use accelerate “estimate-memory” command:

(env_llm) root@amdserver:/dockerx/llm# accelerate estimate-memory 01-ai/Yi-6B
Loading pretrained config for 01-ai/Yi-6B from transformers…
┌────────────────────────────────────────────────────┐
│ Memory Usage for loading                           |01-ai/Yi-6B├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│ 1000.0 MB   │ 21.73 GB │ 86.91 GB          │
│float16│ 500.0 MB    │ 10.86 GB │ 43.46 GB          │
│ int8  │ 250.0 MB    │ 5.43 GB  │ 21.73 GB          │
│ int4  │ 125.0 MB    │ 2.72 GB  │ 10.86 GB          │
└───────┴─────────────┴──────────┴───────────────────┘
*NVMemory required 21.73GB

(env_llm) root@amdserver:/dockerx/llm# accelerate estimate-memory mistralai/Mixtral-8x7B-Instruct-v0.1
Loading pretrained config for mistralai/Mixtral-8x7B-Instruct-v0.1 from transformers…
┌──────────────────────────────────────────────────────────────────────┐
│ Memory Usage for loading                                             |mistralai/Mixtral-8x7B-Instruct-v0.1                                   ├───────┬─────────────┬──────────┬─────────────────────────────────────┤
│ dtype │Largest Layer│Total Size│ Training using Adam                 │
├───────┼─────────────┼──────────┼─────────────────────────────────────┤
│float32│ 5.44 GB     │174.49 GB │ 697.97 GB                           │
│float16│ 2.72 GB     │ 87.25 GB │ 348.99 GB                           │
│ int8  │ 1.36 GB     │ 43.62 GB │ 174.49 GB                           │
│ int4  │ 696.02 MB   │ 21.81 GB │ 87.25 GB                            │
└───────┴─────────────┴──────────┴─────────────────────────────────────┘
*NVMemory required 174.49 GB

Some of the pros of quantized model are as follows:

Reduced Memory Footprint
Faster Inference
Deployment Flexibility
Compatibility with Hardware Acceleration

While some of the cons are as follows:

Decreased Model Accuracy
Quantized models have a limited dynamic range due to the reduced number of available values to represent numbers.
Quantization introduces quantization error, which is the discrepancy between the original floating-point values and their quantized representations.
Training quantized LLM models can be more challenging compared to full-precision models.
Developing and optimizing quantized LLM models can be more complex and time-consuming compared to full-precision models.

Generally we can use bitsandbytes python library to do quantize, but since it required CUDA we can not do it on our Hardware.

But there is new alternative with bitsandbytes for rocm here:

https://git.ecker.tech/mrq/bitsandbytes-rocm

We will try it, first clone it:

git clone https://git.ecker.tech/mrq/bitsandbytes-rocm.git

Compile and Install it inside Conda Environment

cd bitsandbytes
make hip ROCM_TARGET=gfx90a
pip install .

For alternatives devices, for ROCM_TARGET=gfx… you can refer this link:

https://www.llvm.org/docs/AMDGPUUsage.html#processors

Verify it

python -m bitsandbytes


COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['9.0']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Running a quick check that:
    + library is importable
    + CUDA function is callable


WARNING: Please be sure to sanitize sensible info from any such env vars!

SUCCESS!
Installation was successful!

Do bitsandbyte quantization script :

Alternatively for the Model Quantization (Reducing Model Memory Use), we can use autogptq.

https://github.com/AutoGPTQ/AutoGPTQ

https://huggingface.github.io/autogptq-index/whl/rocm573/

Install autogptq library

pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/

System Requirements

Current Test System

Using Docker Image for ROCm

Langchain LLM Script

Quantize large memory model

Leave a Reply Cancel reply