How to Deploy an ML Model on Hugging Face Spaces with GPU 2026
Practical tutorial: Deploy an ML model on Hugging Face Spaces with GPU
How to Deploy an ML Model on Hugging Face Spaces with GPU 2026
Table of Contents
- How to Deploy an ML Model on Hugging Face Spaces with GPU 2026
- Use an official Python runtime as a parent image
- Set environment variables
- Install dependencies
- Copy the application code to the container image.
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
Introduction & Architecture
Deploying machine learning models efficiently and securely is a critical aspect of modern AI development. In this tutorial, we will focus on deploying a pre-trained model from the Hugging Face Model Hub onto Hugging Face Spaces using a GPU for inference acceleration. This process involves several steps: setting up your environment, configuring the deployment settings, and ensuring optimal performance.
The architecture behind this approach leverag [5]es Docker containers to encapsulate the entire runtime environment, including dependencies and configuration files. By utilizing GPUs via Hugging Face's paid plans, we can significantly reduce latency and improve throughput for inference tasks. This setup is particularly beneficial for models that require substantial computational resources, such as large language models or transformer-based architectures.
As of 2026, the popularity of deploying ML models on cloud platforms like Hugging Face Spaces has surged due to their ease of use and integration with other services. According to a study published in ArXiv titled "Exploring the Carbon Footprint of Hugging Face's ML Models: A Repository Mining Study," there is an increasing awareness among developers about the environmental impact of deploying large models, which underscores the importance of optimizing resource usage.
Prerequisites & Setup
To follow this tutorial, you need to have Python installed on your machine along with huggingface_hub and transformers [8]. These packages provide essential functionality for interacting with Hugging Face's Model Hub and Spaces. Additionally, ensure that Docker is properly set up on your system as it will be used to create the deployment container.
pip install huggingface [8]_hub transformers docker-compose
The choice of these specific dependencies over alternatives like torchserve or sagemaker lies in their seamless integration with Hugging Face's ecosystem and broader community support. Docker is chosen for its ability to package applications into lightweight, portable containers that can run consistently across different environments.
Core Implementation: Step-by-Step
Step 1: Initialize the Project
First, create a new directory for your project and navigate into it:
mkdir hf_spaces_deploy && cd hf_spaces_deploy
Next, initialize a Docker Compose file to define the services required for deployment. This includes setting up an environment with Python dependencies and specifying GPU access.
Step 2: Define the Dockerfile
Create a Dockerfile in your project directory:
# Use an official Python runtime as a parent image
FROM python:3.9-slim
# Set environment variables
ENV PYTHONUNBUFFERED=1 \
HF_HUB_TOKEN=<YOUR_HF_TOKEN> \
MODEL_NAME=<MODEL_NAME>
# Install dependencies
RUN pip install --upgrade pip && pip install huggingface_hub transformers torch
# Copy the application code to the container image.
COPY . /app
WORKDIR /app
# Define entrypoint script
CMD ["python", "run_inference.py"]
This Dockerfile sets up a Python environment with necessary libraries and copies your local project files into the container. The HF_HUB_TOKEN variable is used for authentication when accessing private models or repositories.
Step 3: Create an Entry Script
Create a script named run_inference.py to handle model loading and inference:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load the tokenizer and model from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("MODEL_NAME")
model = AutoModelForSequenceClassification.from_pretrained("MODEL_NAME")
def predict(text):
inputs = tokenizer(text, return_tensors="pt").to('cuda')
outputs = model(**inputs)
logits = outputs.logits
probabilities = torch.nn.functional.softmax(logits, dim=-1).detach().cpu().numpy()
return probabilities
if __name__ == "__main__":
while True:
text = input("Enter your sentence: ")
print(predict(text))
This script loads the model and tokenizer from Hugging Face's Model Hub and performs inference on user-provided text. The to('cuda') method ensures that the model is loaded onto the GPU.
Step 4: Configure Docker Compose
Create a docker-compose.yml file to define services:
version: '3'
services:
app:
build: .
ports:
- "8000:8000"
environment:
HF_HUB_TOKEN: ${HF_HUB_TOKEN}
This configuration builds the Docker image from your Dockerfile, exposes port 8000, and sets an environment variable for authentication.
Step 5: Deploy to Hugging Face Spaces
To deploy your application to Hugging Face Spaces, follow these steps:
- Create a new space on Hugging Face Hub.
- Push your Docker image to the newly created space using
huggingface_hublibrary commands or manually via Docker CLI. - Configure environment variables and expose necessary ports in the Space settings.
Configuration & Production Optimization
To optimize this setup for production, consider the following configurations:
- Batch Processing: Implement batch processing logic within your entry script to handle multiple requests at once, reducing overhead.
- Asynchronous Processing: Use asynchronous frameworks like
uvicornorfastapito manage concurrent requests efficiently. - GPU Utilization Monitoring: Monitor GPU usage and adjust model parameters or batch sizes based on observed performance metrics.
# Example of using uvicorn for async processing
from fastapi import FastAPI, Request
from transformers import AutoModelForSequenceClassification, AutoTokenizer
app = FastAPI()
tokenizer = AutoTokenizer.from_pretrained("MODEL_NAME")
model = AutoModelForSequenceClassification.from_pretrained("MODEL_NAME")
@app.post("/predict/")
async def predict(request: Request):
data = await request.json()
text = data['text']
inputs = tokenizer(text, return_tensors="pt").to('cuda')
outputs = model(**inputs)
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1).detach().cpu().numpy()
return {"probabilities": probabilities.tolist()}
Advanced Tips & Edge Cases (Deep Dive)
Error Handling
Implement robust error handling in your entry script to manage exceptions gracefully:
try:
inputs = tokenizer(text, return_tensors="pt").to('cuda')
except Exception as e:
print(f"Error processing input: {e}")
Security Risks
Be cautious of security risks such as prompt injection attacks. Ensure that user inputs are sanitized and validated before being passed to the model.
Scaling Bottlenecks
Monitor CPU/GPU usage and adjust resource allocation accordingly. Consider implementing load balancing if your application scales horizontally across multiple instances.
Results & Next Steps
By following this tutorial, you have successfully deployed an ML model on Hugging Face Spaces with GPU support. This setup provides a robust foundation for serving models in production environments while optimizing performance and minimizing costs.
Next steps could include:
- Monitoring: Set up monitoring tools to track application performance.
- Scalability: Explore auto-scaling options based on demand.
- Security Enhancements: Implement additional security measures like rate limiting or IP whitelisting.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Build a Knowledge Graph from Documents with Large Language Models (LLMs) 2026
Practical tutorial: Build a knowledge graph from documents with LLMs
How to Build a Knowledge Graph from Documents with LLMs
Practical tutorial: Build a knowledge graph from documents with LLMs
How to Build a Neural Network for Predicting Particle Decay with Humor 2026
Practical tutorial: It focuses on a niche and somewhat humorous application of AI, lacking broad industry impact.