text-generation-inference

Last changed

Request a free trial

Contact our team to test out this image for free. Please also indicate any other images you would like to evaluate.

Chainguard Container for text-generation-inference

Chainguard Containers are regularly-updated, secure-by-default container images.

Download this Container Image

For those with access, this container image is available on cgr.dev:


docker pull cgr.dev/ORGANIZATION/text-generation-inference:latest

Be sure to replace the ORGANIZATION placeholder with the name used for your organization's private repository within the Chainguard Registry.

The text-generation-inference image is based on Hugging Face's Text Generation Inference (TGI) toolkit. This Chainguard image provides the same functionality as the upstream TGI container with the following key differences:

Like all Chainguard Images, this image features a minimal design with only essential dependencies
Built on Wolfi/Chainguard OS with few-to-zero CVEs
Regular automated builds ensure the image stays up-to-date with security patches
Includes the same high-performance features: Flash Attention, Paged Attention, tensor parallelism, and continuous batching
Compatible with popular open-source LLMs including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and T5
Provides OpenAI-compatible API endpoints for easy integration

These differences provide enhanced security and maintainability while preserving full compatibility with TGI's production-ready capabilities for serving Large Language Models.

This image runs as non-root by default. Official instructions from the upstream repository involve mounting /data from the node into /data on the container. This can be probelmatic in the case that /data is not writable by the user with UID 65532.

Prerequisites

To use this image effectively, you should have:

Docker installed on your system (with GPU support configured if using GPU acceleration)
Access to Hugging Face models (either public or private repositories)
Sufficient disk space for model storage (models are downloaded on first run)
For GPU workloads: NVIDIA GPU with appropriate drivers and NVIDIA Container Toolkit

For more information on TGI capabilities and requirements, refer to the official Hugging Face TGI documentation.

Getting Started

The text-generation-inference image provides a production-ready server for deploying Large Language Models. The following examples demonstrate common usage patterns.

Basic Usage

Start by checking the available options:


docker run --rm cgr.dev/ORGANIZATION/text-generation-inference --help

Verify the installed version:


docker run --rm cgr.dev/ORGANIZATION/text-generation-inference --version

Serving a Model

To start a TGI server with a Hugging Face model, run:


docker run -d \
  --name tgi-server \
  -p 8080:80 \
  -v $PWD/data:/data \
  cgr.dev/ORGANIZATION/text-generation-inference \
  --model-id microsoft/DialoGPT-medium \
  --num-shard 1

This command starts the server in the background, maps port 8080 on your host to port 80 in the container, and mounts a local data directory for model caching.

Making API Requests

Once the server is running, you can interact with it through multiple API endpoints. TGI provides OpenAI-compatible endpoints for easy integration with existing tools:


curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "stream": false
  }'

The server responds with a JSON object containing the generated text. For streaming responses, set "stream": true in the request.

Using with Python

You can interact with the server using the Hugging Face huggingface_hub library:


cat > tgi_client.py <<EOF
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://localhost:8080")

response = client.text_generation(
    prompt="What is the capital of France?",
    max_new_tokens=50
)

print(response)
EOF

Run the Python script:


python tgi_client.py

GPU Acceleration

For production workloads requiring GPU acceleration, use the --gpus flag:


docker run -d \
  --gpus all \
  --name tgi-server-gpu \
  -p 8080:80 \
  -v $PWD/data:/data \
  cgr.dev/ORGANIZATION/text-generation-inference \
  --model-id meta-llama/Llama-2-7b-chat-hf \
  --num-shard 2

This distributes the model across multiple GPUs using tensor parallelism for improved performance with larger models.

Configuration

The text-generation-inference image can be configured through command-line arguments and environment variables. This section demonstrates a common production configuration for serving a model with specific resource constraints.

Command-Line Arguments

Key configuration options include:

--model-id: Hugging Face Hub model identifier (e.g., meta-llama/Llama-2-7b-chat-hf)
--num-shard: Number of GPU shards for tensor parallelism (default: 1)
--port: Server listening port (default: 80)
--max-concurrent-requests: Maximum number of concurrent requests (default: 128)
--max-input-length: Maximum input token length (default: 1024)
--max-total-tokens: Maximum total tokens including input and output (default: 2048)
--max-batch-prefill-tokens: Maximum tokens for prefill batching (default: 4096)
--max-batch-total-tokens: Maximum tokens for total batch (default: 16384)

Production Configuration Example

For a production deployment serving a Llama 2 model with custom resource limits:


docker run -d \
  --gpus all \
  --name tgi-production \
  -p 8080:80 \
  -v $PWD/models:/data \
  cgr.dev/ORGANIZATION/text-generation-inference \
  --model-id meta-llama/Llama-2-7b-chat-hf \
  --num-shard 2 \
  --max-concurrent-requests 64 \
  --max-input-length 2048 \
  --max-total-tokens 4096 \
  --max-batch-prefill-tokens 8192 \
  --max-batch-total-tokens 32768

This configuration distributes the model across 2 GPUs, limits concurrent requests to 64, and sets appropriate token limits for handling longer conversations. The model cache is persisted to the local models directory for faster subsequent startups.

Environment Variables

Alternatively, you can configure TGI using environment variables:


docker run -d \
  --gpus all \
  --name tgi-production \
  -p 8080:80 \
  -e MODEL_ID=meta-llama/Llama-2-7b-chat-hf \
  -e NUM_SHARD=2 \
  -e MAX_CONCURRENT_REQUESTS=64 \
  -e MAX_INPUT_LENGTH=2048 \
  -e MAX_TOTAL_TOKENS=4096 \
  -v $PWD/models:/data \
  cgr.dev/ORGANIZATION/text-generation-inference

This approach is useful for container orchestration platforms where environment variables are easier to manage than command-line arguments.

Documentation and Resources

For more information on working with Text Generation Inference and Large Language Models:

Hugging Face TGI Documentation - Official documentation for Text Generation Inference
Hugging Face Model Hub - Browse available models compatible with TGI
TGI GitHub Repository - Source code and issue tracking
Chainguard Academy - Learn more about securing your container deployments

What are Chainguard Containers?

Chainguard's free tier of Starter container images are built with Wolfi, our minimal Linux undistro.

All other Chainguard Containers are built with Chainguard OS, Chainguard's minimal Linux operating system designed to produce container images that meet the requirements of a more secure software supply chain.

The main features of Chainguard Containers include:

Minimal design, without unnecessary software bloat
Daily builds to ensure container images are up-to-date with available security patches
High quality build-time SBOMs attesting to the provenance of all artifacts within the image
Verifiable signatures provided by Sigstore
Reproducible builds with Cosign and apko (read more about reproducibility)

For cases where you need container images with shells and package managers to build or debug, most Chainguard Containers come paired with a development, or -dev, variant.

In all other cases, including Chainguard Containers tagged as :latest or with a specific version number, the container images include only an open-source application and its runtime dependencies. These minimal container images typically do not contain a shell or package manager.

Although the -dev container image variants have similar security features as their more minimal versions, they include additional software that is typically not necessary in production environments. We recommend using multi-stage builds to copy artifacts from the -dev variant into a more minimal production image.

Chainguard container images contain software packages that are direct or transitive dependencies. The following licenses were found in the "latest" tag of this image: