Run a Llama2 app
In this guide we create and deploy a llama2 inference server and expose an API to it. To run this example, follow these steps:
-
Install the
kraft
CLI tool and a container runtime engine, e.g. Docker. -
Clone the
examples
repository andcd
into theexamples/llama2/
directory:
git clone https://github.com/kraftcloud/examplescd examples/llama2/
Make sure to log into Unikraft Cloud by setting your token and a metro close to you.
We use fra0
(Frankfurt, 🇩🇪) in this guide:
# Set Unikraft Cloud access tokenexport UKC_TOKEN=token# Set metro to Frankfurt, DEexport UKC_METRO=fra0
When done, invoke the following command to deploy this application on Unikraft Cloud:
kraft cloud deploy -p 443:8080 -M 1024 .
Note that in this example we assign 1GB of memory. The amount required will vary depending on the model (we’ll cover how to deploy different models below)
The output shows the instance URL and other details:
[●] Deployed successfully! │ ├────────── name: llama2-cl5bw ├────────── uuid: eddb16d4-44e7-48d6-a226-328a18745d13 ├───────── state: running ├─────────── url: https://funky-rain-xds8dxbg.fra0.kraft.host ├───────── image: llama2@sha256:5af77e7381931c9f5b8f605789a238a64784b631d4b3308c5948b681c862f25a ├───── boot time: 38.29 ms ├──────── memory: 1024 MiB ├─────── service: funky-rain-xds8dxbg ├── private fqdn: llama2-cl5bw.internal ├──── private ip: 172.16.6.3 └────────── args: 8080
In this case, the instance name is llama2-cl5bw
and the URL is https://funky-rain-xds8dxbg.fra0.kraft.host
.
They are different for each run.
We can retrieve a story through the llama2
API endpoint:
curl -o - https://funky-rain-xds8dxbg.fra0.kraft.host/api/llama2
Once upon a time, there was a little girl named Lily. She loved to eat grapes. One day, she saw a big grape on the table. Lily wanted to eat it, but she was too small. She thought, "I will try to get it when no one is looking."The next day, Lily saw a big rock near the tower. She thought, "Maybe I can move the rock." She tried to push the rock, but it was too heavy. Lily did not give up. She tried again and again. Finally, she had a big idea. She would use a long stick to push the rock.Lily went to the tower and pushed the rock with the stick. The rock moved! She was so happy. She picked up the grape and said, "Thank you, Rock!" Lily learned that if you are persistent and try hard, you can do anything.
At any point in time, you can list information about the instance:
kraft cloud instance list
NAME FQDN STATE CREATED AT IMAGE MEMORY ARGS BOOT TIMEllama2-cl5bw funky-rain-xds8dxbg.fra0.kraft.host running 1 minute ago llama2@sha256:5af77e73819... 1.0 GiB 8080 38286us
When done, you can remove the instance:
kraft cloud instance remove llama2-cl5bw
Customize your Application
To customize the application, update the files in the repository, listed below:
Kraftfile
: the Unikraft Cloud specificationDockerfile
: the Docker-specified application filesystemtokenizer.bin
: Exposes an API for the modelstories15M.bin
: The LLM model.
spec: v0.6
runtime: llama2:latest
rootfs: ./Dockerfile
cmd: ["8080"]
FROM alpine:3.14 as base
WORKDIR /
# Create a symlink for default modelRUN set -xe; \mkdir -p /models/stories15M && \ln -sfn /models/stories15M /models/DEFAULT
FROM scratch
COPY --from=base ./models /modelsCOPY ./stories15M.bin /models/stories15M/model.binCOPY ./tokenizer.bin /models/stories15M/tokenizer.bin
Lines in the Kraftfile
have the following roles:
-
spec: v0.6
: The currentKraftfile
specification version is0.6
. -
runtime: llama2
: The Unikraft runtime kernel to use is llama2. -
rootfs: ./Dockerfile
: Build the application root filesystem using theDockerfile
. -
cmd: ["8080"]
: Expose the service via port 8080
Lines in the Dockerfile
have the following roles:
-
FROM alpine:3.14 as base
: Build the filesystem from thealpine:3.14
, to create a base image. -
COPY
: Copy the model and tokenizer to the Docker filesystem (to/models
).
The following options are available for customizing the application:
-
You can replace the model with others, for example from Hugging Face
-
The tokenizer we took from here, but feel free to replace it.
You can customize parameters for your story through a POST request on the same API endpoint. The following parameters are recognized:
prompt
: seed the LLM with a specific stringmodel
: use specific model instead of DEFAULTtemperature
: valid range 0.0 - 1.0; 0.0 is deterministic, 1.0 is original (default 1.0)topp
: valid range 0.0 - 1.0; top-p in nucleus sampling; 1.0 = off, 0.9 works well, but slower (default 0.9)
For example:
curl -o - https://funky-rain-xds8dxbg.fra0.kraft.host/api/llama2 -d '{ "model": "stories15M", "temperature": 0.95, "topp": 0.8, "prompt": "There once was a monkey named Bobo." }'
Learn More
Use the --help
option for detailed information on using Unikraft Cloud:
kraft cloud --help
Or visit the CLI Reference.