Back to Tutorials
tutorialstutorialaiml

How to Deploy an ML Model on Hugging Face Spaces with GPU

Practical tutorial: Deploy an ML model on Hugging Face Spaces with GPU

BlogIA AcademyApril 20, 20266 min read1 161 words
This article was generated by Daily Neural Digest's autonomous neural pipeline — multi-source verified, fact-checked, and quality-scored. Learn how it works

How to Deploy an ML Model on Hugging Face Spaces with GPU

Table of Contents

📺 Watch: Neural Networks Explained

Video by 3Blue1Brown


Introduction & Architecture

Deploying machine learning models is a critical step in transitioning from research and development to practical applications. This tutorial focuses on deploying a model using Hugging Face's Spaces, which offers a straightforward way to host your models online. The use of GPUs can significantly enhance the performance of inference tasks, especially for large language models or computationally intensive tasks.

Hugging Face Spaces is built on top of Vercel and leverag [2]es Docker containers for deployment. This setup allows you to run complex applications with minimal overhead. For this tutorial, we will deploy a pre-trained model from Hugging Face's Model Hub using a GPU-enabled environment. The architecture involves setting up a Dockerfile that specifies the necessary dependencies and configurations, then deploying it via Spaces.

Prerequisites & Setup

To follow along with this tutorial, you need to have Python installed on your machine (version 3.8 or higher is recommended). Additionally, ensure you have Docker installed for building container images and Hugging Face CLI (huggingface [6]_hub) for interacting with the Model Hub and Spaces.

Required Packages

The following packages are necessary for setting up our environment:

  • transformers [6]: The library from Hugging Face that provides pre-trained models.
  • torch: A fundamental package for deep learning in Python, which is essential for running neural network models.
  • docker: For building Docker images.
  • huggingface_hub: To interact with the Model Hub and Spaces.

Install these packages using pip:

pip install transformers torch docker huggingface_hub

Creating a New Repository on Hugging Face

Before deploying your model, create a new repository on Hugging Face. This will serve as the base for your Spaces deployment. Ensure you have set up authentication by following the instructions provided in the Hugging Face documentation.

Core Implementation: Step-by-Step

The core of our implementation involves creating a Dockerfile and writing a Python script that loads and runs an inference task on the model. We will use a pre-trained language model from Hugging Face's Model Hub as an example.

Step 1: Create a Dockerfile

Create a Dockerfile in your project directory to define the environment for running your application. The following is a basic example:

# Use the official Python image as the base image.
FROM python:3.8-slim

# Set the working directory inside the container.
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Make port 80 available to the world outside this container
EXPOSE 80

# Define environment variable
ENV NAME World

# Run app.py when the container launches
CMD ["python", "app.py"]

Step 2: Write Your Application Code (app.py)

In app.py, you will load your model and define an endpoint for inference. Here is a basic example using a pre-trained BERT model:

from transformers import BertTokenizer, BertModel
import torch

# Load the tokenizer and model from Hugging Face Model Hub
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def predict(input_text):
    # Tokenize input text
    inputs = tokenizer(input_text, return_tensors='pt', truncation=True, padding=True)

    # Move to GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Get model outputs
    with torch.no_grad():
        output = model(**inputs).last_hidden_state

    return output

if __name__ == "__main__":
    input_text = "Hello, how are you?"
    print(predict(input_text))

Step 3: Build and Push Docker Image to Hugging Face Spaces

Once your application is ready, build the Docker image and push it to Hugging Face Spaces. First, ensure you have logged in:

huggingface-cli login

Then, build and push your Docker image:

docker build -t hf_username/my_spaces_app .
docker tag hf_username/my_spaces_app huggingface/username/my_spaces_app
docker push huggingface/username/my_spaces_app

Replace hf_username with your actual Hugging Face username.

Configuration & Production Optimization

To optimize the deployment for production, consider the following configurations:

Batching and Asynchronous Processing

For large-scale applications, batch processing can significantly reduce latency. Implementing asynchronous requests using libraries like asyncio or aiohttp can further enhance performance.

Hardware Optimization (GPU/CPU)

Ensure your Dockerfile specifies GPU support if you plan to use a GPU-enabled environment:

# Use the official CUDA image as the base image.
FROM nvidia/cuda:11.7-cudnn8-runtime-ubuntu20.04

# Install Python and necessary packages
RUN apt-get update && \
    apt-get install -y python3-pip && \
    pip install --no-cache-dir transformers torch

Environment Variables

Use environment variables to manage configurations such as model paths, API keys, or other settings that might change between environments.

Advanced Tips & Edge Cases (Deep Dive)

When deploying models in production, several edge cases and challenges must be addressed:

Error Handling

Implement robust error handling mechanisms to catch exceptions during inference. This is crucial for maintaining service availability and reliability.

try:
    output = model(**inputs).last_hidden_state
except Exception as e:
    print(f"An error occurred: {str(e)}")

Security Risks (Prompt Injection)

If your model involves user-generated input, be cautious of prompt injection attacks. Validate and sanitize all inputs before processing.

Scaling Bottlenecks

Monitor resource usage to identify potential bottlenecks. Use tools like Prometheus for monitoring CPU/GPU utilization and response times.

Results & Next Steps

By following this tutorial, you have successfully deployed an ML model on Hugging Face Spaces with GPU support. This setup provides a robust foundation for serving models in production environments.

What's Next?

  • Monitoring and Logging: Integrate logging services like Sentry or Logz.io to monitor application performance.
  • Scaling Up: Explore Kubernetes for managing multiple instances of your service, especially if you expect high traffic.
  • Model Updates: Automate the process of updating your model as new versions become available.

References

1. Wikipedia - Hugging Face. Wikipedia. [Source]
2. Wikipedia - Rag. Wikipedia. [Source]
3. Wikipedia - Transformers. Wikipedia. [Source]
4. GitHub - huggingface/transformers. Github. [Source]
5. GitHub - Shubhamsaboo/awesome-llm-apps. Github. [Source]
6. GitHub - huggingface/transformers. Github. [Source]
tutorialaimldocker
Share this article:

Was this article helpful?

Let us know to improve our AI generation.

Related Articles