/
DirectorySecurity AdvisoriesPricing
Sign in
Directory
text-generation-inference logo

text-generation-inference

Last changed

Request a free trial

Contact our team to test out this image for free. Please also indicate any other images you would like to evaluate.

Tags
Overview
Comparison
Provenance
Specifications
SBOM
Vulnerabilities
Advisories

Chainguard Container for text-generation-inference

Chainguard Containers are regularly-updated, secure-by-default container images.

Download this Container Image

For those with access, this container image is available on cgr.dev:

docker pull cgr.dev/ORGANIZATION/text-generation-inference:latest

Be sure to replace the ORGANIZATION placeholder with the name used for your organization's private repository within the Chainguard Registry.

Compatibility Notes

The text-generation-inference image is based on Hugging Face's Text Generation Inference (TGI) toolkit. This Chainguard image provides the same functionality as the upstream TGI container with the following key differences:

  • Like all Chainguard Images, this image features a minimal design with only essential dependencies
  • Built on Wolfi/Chainguard OS with few-to-zero CVEs
  • Regular automated builds ensure the image stays up-to-date with security patches
  • Includes the same high-performance features: Flash Attention, Paged Attention, tensor parallelism, and continuous batching
  • Compatible with popular open-source LLMs including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and T5
  • Provides OpenAI-compatible API endpoints for easy integration

These differences provide enhanced security and maintainability while preserving full compatibility with TGI's production-ready capabilities for serving Large Language Models.

Prerequisites

To use this image effectively, you should have:

  • Docker installed on your system (with GPU support configured if using GPU acceleration)
  • Access to Hugging Face models (either public or private repositories)
  • Sufficient disk space for model storage (models are downloaded on first run)
  • For GPU workloads: NVIDIA GPU with appropriate drivers and NVIDIA Container Toolkit

For more information on TGI capabilities and requirements, refer to the official Hugging Face TGI documentation.

Getting Started

The text-generation-inference image provides a production-ready server for deploying Large Language Models. The following examples demonstrate common usage patterns.

Basic Usage

Start by checking the available options:

docker run --rm cgr.dev/ORGANIZATION/text-generation-inference --help

Verify the installed version:

docker run --rm cgr.dev/ORGANIZATION/text-generation-inference --version

Serving a Model

To start a TGI server with a Hugging Face model, run:

docker run -d \
  --name tgi-server \
  -p 8080:80 \
  -v $PWD/data:/data \
  cgr.dev/ORGANIZATION/text-generation-inference \
  --model-id microsoft/DialoGPT-medium \
  --num-shard 1

This command starts the server in the background, maps port 8080 on your host to port 80 in the container, and mounts a local data directory for model caching.

Making API Requests

Once the server is running, you can interact with it through multiple API endpoints. TGI provides OpenAI-compatible endpoints for easy integration with existing tools:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "stream": false
  }'

The server responds with a JSON object containing the generated text. For streaming responses, set "stream": true in the request.

Using with Python

You can interact with the server using the Hugging Face huggingface_hub library:

cat > tgi_client.py <<EOF
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://localhost:8080")

response = client.text_generation(
    prompt="What is the capital of France?",
    max_new_tokens=50
)

print(response)
EOF

Run the Python script:

python tgi_client.py

GPU Acceleration

For production workloads requiring GPU acceleration, use the --gpus flag:

docker run -d \
  --gpus all \
  --name tgi-server-gpu \
  -p 8080:80 \
  -v $PWD/data:/data \
  cgr.dev/ORGANIZATION/text-generation-inference \
  --model-id meta-llama/Llama-2-7b-chat-hf \
  --num-shard 2

This distributes the model across multiple GPUs using tensor parallelism for improved performance with larger models.

Configuration

The text-generation-inference image can be configured through command-line arguments and environment variables. This section demonstrates a common production configuration for serving a model with specific resource constraints.

Command-Line Arguments

Key configuration options include:

  • --model-id: Hugging Face Hub model identifier (e.g., meta-llama/Llama-2-7b-chat-hf)
  • --num-shard: Number of GPU shards for tensor parallelism (default: 1)
  • --port: Server listening port (default: 80)
  • --max-concurrent-requests: Maximum number of concurrent requests (default: 128)
  • --max-input-length: Maximum input token length (default: 1024)
  • --max-total-tokens: Maximum total tokens including input and output (default: 2048)
  • --max-batch-prefill-tokens: Maximum tokens for prefill batching (default: 4096)
  • --max-batch-total-tokens: Maximum tokens for total batch (default: 16384)

Production Configuration Example

For a production deployment serving a Llama 2 model with custom resource limits:

docker run -d \
  --gpus all \
  --name tgi-production \
  -p 8080:80 \
  -v $PWD/models:/data \
  cgr.dev/ORGANIZATION/text-generation-inference \
  --model-id meta-llama/Llama-2-7b-chat-hf \
  --num-shard 2 \
  --max-concurrent-requests 64 \
  --max-input-length 2048 \
  --max-total-tokens 4096 \
  --max-batch-prefill-tokens 8192 \
  --max-batch-total-tokens 32768

This configuration distributes the model across 2 GPUs, limits concurrent requests to 64, and sets appropriate token limits for handling longer conversations. The model cache is persisted to the local models directory for faster subsequent startups.

Environment Variables

Alternatively, you can configure TGI using environment variables:

docker run -d \
  --gpus all \
  --name tgi-production \
  -p 8080:80 \
  -e MODEL_ID=meta-llama/Llama-2-7b-chat-hf \
  -e NUM_SHARD=2 \
  -e MAX_CONCURRENT_REQUESTS=64 \
  -e MAX_INPUT_LENGTH=2048 \
  -e MAX_TOTAL_TOKENS=4096 \
  -v $PWD/models:/data \
  cgr.dev/ORGANIZATION/text-generation-inference

This approach is useful for container orchestration platforms where environment variables are easier to manage than command-line arguments.

Documentation and Resources

For more information on working with Text Generation Inference and Large Language Models:

What are Chainguard Containers?

Chainguard's free tier of Starter container images are built with Wolfi, our minimal Linux undistro.

All other Chainguard Containers are built with Chainguard OS, Chainguard's minimal Linux operating system designed to produce container images that meet the requirements of a more secure software supply chain.

The main features of Chainguard Containers include:

For cases where you need container images with shells and package managers to build or debug, most Chainguard Containers come paired with a development, or -dev, variant.

In all other cases, including Chainguard Containers tagged as :latest or with a specific version number, the container images include only an open-source application and its runtime dependencies. These minimal container images typically do not contain a shell or package manager.

Although the -dev container image variants have similar security features as their more minimal versions, they include additional software that is typically not necessary in production environments. We recommend using multi-stage builds to copy artifacts from the -dev variant into a more minimal production image.

Need additional packages?

To improve security, Chainguard Containers include only essential dependencies. Need more packages? Chainguard customers can use Custom Assembly to add packages, either through the Console, chainctl, or API.

To use Custom Assembly in the Chainguard Console: navigate to the image you'd like to customize in your Organization's list of images, and click on the Customize image button at the top of the page.

Learn More

Refer to our Chainguard Containers documentation on Chainguard Academy. Chainguard also offers VMs and Librariescontact us for access.

Trademarks

This software listing is packaged by Chainguard. The trademarks set forth in this offering are owned by their respective companies, and use of them does not imply any affiliation, sponsorship, or endorsement by such companies.

Licenses

Chainguard container images contain software packages that are direct or transitive dependencies. The following licenses were found in the "latest" tag of this image:

  • Apache-2.0

  • BSD-2-Clause

  • BSD-3-Clause

  • CC-BY-4.0

  • FTL

  • GCC-exception-3.1

  • GPL-2.0

For a complete list of licenses, please refer to this Image's SBOM.

Software license agreement

Category
AI

The trusted source for open source

Talk to an expert
© 2025 Chainguard. All Rights Reserved.
Private PolicyTerms of Use

Product

Chainguard ContainersChainguard LibrariesChainguard VMsIntegrationsPricing