How to Use SAM 2 for Zero-Shot Image Segmentation
Practical tutorial: Image segmentation with SAM 2 - zero-shot everything
How to Use SAM 2 for Zero-Shot Image Segmentation
Table of Contents
- How to Use SAM 2 for Zero-Shot Image Segmentation
- Create a virtual environment
- Install core dependencies
- Model variants and their characteristics
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
Image segmentation has traditionally required domain-specific training data, fine-tuning [1], and significant computational resources. The Segment Anything Model 2 (SAM 2) changes this paradigm by enabling zero-shot segmentation across diverse domains without any task-specific training. In this tutorial, you'll build a production-ready image segmentation pipeline using SAM 2 that can segment anything in any image with just a few lines of code.
We'll cover the complete architecture, from model loading to inference optimization, handling edge cases like low-depth-of-field images and text-prompted segmentation. By the end, you'll have a deployable system that can handle real-world segmentation tasks with minimal setup.
Understanding SAM 2 Architecture and Zero-Shot Capabilities
SAM 2 builds on the original Segment Anything Model with significant improvements in prompt handling, mask quality, and inference speed. The model uses a transformer-based architecture with a prompt encoder, image encoder, and mask decoder that work together to produce high-quality segmentation masks from various input prompts.
The key innovation enabling zero-shot performance is the model's ability to generalize across domains without fine-tuning. According to research on prompt adaptation for segmentation, PA-SAM demonstrates that carefully designed prompt adapters can further improve segmentation quality in challenging scenarios [1]. This is particularly relevant when dealing with edge cases like low-contrast boundaries or occluded objects.
For production systems, understanding the model's limitations is crucial. Research on robust image segmentation in low depth of field images shows that traditional segmentation models struggle with blurred regions and shallow focus [2]. SAM 2 handles these cases better than previous models, but still requires careful prompt engineering for optimal results.
The architecture supports multiple prompt types:
- Point prompts (positive/negative clicks)
- Box prompts (bounding boxes)
- Text prompts (natural language descriptions)
- Mask prompts (existing segmentation masks)
This flexibility makes SAM 2 suitable for applications ranging from medical imaging to autonomous driving, where different prompt types may be more appropriate for different scenarios.
Prerequisites and Environment Setup
Before diving into implementation, ensure your environment meets these requirements:
Hardware Requirements:
- GPU with at least 8GB VRAM (recommended: 16GB+ for batch processing)
- 16GB+ RAM
- 50GB free disk space for model weights and datasets
Software Requirements:
- Python 3.10+
- CUDA 11.8+ (for GPU acceleration)
- PyTorch [6] 2.0+
Let's set up the environment:
# Create a virtual environment
python -m venv sam2_env
source sam2_env/bin/activate # On Windows: sam2_env\Scripts\activate
# Install core dependencies
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install segment-anything-2
pip install opencv-python pillow numpy matplotlib
pip install fastapi uvicorn python-multipart # For API deployment
pip install supervision # For visualization and post-processing
The segment-anything-2 package provides the model architecture and pre-trained weights. We'll use the largest model variant (sam2_hiera_large) for maximum accuracy, but you can choose smaller variants for faster inference:
# Model variants and their characteristics
MODEL_VARIANTS = {
"sam2_hiera_tiny": {"params": 38.9, "speed": "fastest", "quality": "lowest"},
"sam2_hiera_small": {"params": 46.7, "speed": "fast", "quality": "medium"},
"sam2_hiera_base": {"params": 80.8, "speed": "medium", "quality": "high"},
"sam2_hiera_large": {"params": 224.4, "speed": "slow", "quality": "highest"},
}
Core Implementation: Building the Zero-Shot Segmentation Pipeline
1. Model Loading and Initialization
First, let's create a robust model loader that handles GPU memory management and model caching:
import torch
import numpy as np
from PIL import Image
import cv2
from pathlib import Path
from typing import Optional, Union, List, Tuple
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class SAM2Segmenter:
"""Production-ready SAM 2 segmenter with memory management and error handling."""
def __init__(
self,
model_type: str = "sam2_hiera_large",
device: Optional[str] = None,
checkpoint_path: Optional[str] = None,
use_half_precision: bool = True
):
"""
Initialize SAM 2 segmenter.
Args:
model_type: Model variant name
device: 'cuda' or 'cpu'. Auto-detects if None.
checkpoint_path: Path to custom checkpoint. Uses default if None.
use_half_precision: Use float16 for faster inference with minimal quality loss
"""
self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
self.model_type = model_type
self.use_half_precision = use_half_precision and self.device == "cuda"
logger.info(f"Initializing SAM 2 on {self.device} with {model_type}")
# Load model with error handling
try:
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
# Build the model
sam2_model = build_sam2(
model_type,
checkpoint_path=checkpoint_path,
device=self.device
)
# Create predictor
self.predictor = SAM2ImagePredictor(sam2_model)
# Enable half precision if requested
if self.use_half_precision:
self.predictor.model.half()
logger.info("Enabled half-precision inference")
except Exception as e:
logger.error(f"Failed to load SAM 2 model: {e}")
raise RuntimeError(f"Model initialization failed: {e}")
# Cache for processed images
self._image_cache = {}
self._cache_size = 0
self._max_cache_size = 3 # Limit cache to prevent memory issues
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.cleanup()
def cleanup(self):
"""Release GPU memory."""
if hasattr(self, 'predictor'):
del self.predictor
torch.cuda.empty_cache()
self._image_cache.clear()
logger.info("Cleaned up resources")
2. Image Preprocessing with Edge Case Handling
Real-world images come in various formats, sizes, and quality levels. Here's a robust preprocessing pipeline:
def preprocess_image(
self,
image: Union[str, Path, np.ndarray, Image.Image],
target_size: Optional[Tuple[int, int]] = None,
preserve_aspect_ratio: bool = True
) -> np.ndarray:
"""
Preprocess image for SAM 2 inference.
Handles:
- Different input types (path, numpy, PIL)
- Various color formats (RGB, BGR, RGBA)
- Low depth of field images (blur detection)
- Memory-efficient resizing
"""
# Load image from various sources
if isinstance(image, (str, Path)):
if not Path(image).exists():
raise FileNotFoundError(f"Image not found: {image}")
img = cv2.imread(str(image))
if img is None:
raise ValueError(f"Failed to load image: {image}")
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
elif isinstance(image, Image.Image):
img = np.array(image.convert("RGB"))
elif isinstance(image, np.ndarray):
img = image.copy()
# Handle RGBA images
if img.shape[-1] == 4:
img = cv2.cvtColor(img, cv2.COLOR_RGBA2RGB)
# Handle grayscale images
elif len(img.shape) == 2:
img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
# Handle BGR images (common in OpenCV)
elif img.shape[-1] == 3 and not self._is_rgb(img):
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
else:
raise TypeError(f"Unsupported image type: {type(image)}")
# Detect and handle low depth of field (blurry) images
blur_score = self._detect_blur(img)
if blur_score < 50: # Threshold for significant blur
logger.warning(f"Low depth of field detected (blur score: {blur_score:.2f})")
# Apply sharpening for better segmentation
img = self._sharpen_image(img)
# Resize if needed while preserving aspect ratio
if target_size:
h, w = img.shape[:2]
if preserve_aspect_ratio:
scale = min(target_size[0] / w, target_size[1] / h)
new_w, new_h = int(w * scale), int(h * scale)
img = cv2.resize(img, (new_w, new_h), interpolation=cv2.INTER_LANCZOS4)
else:
img = cv2.resize(img, target_size, interpolation=cv2.INTER_LANCZOS4)
# Ensure image is in correct format for SAM 2
if img.dtype != np.uint8:
img = (img * 255).astype(np.uint8) if img.max() <= 1.0 else img.astype(np.uint8)
return img
def _detect_blur(self, image: np.ndarray) -> float:
"""Detect blur using Laplacian variance method."""
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
return cv2.Laplacian(gray, cv2.CV_64F).var()
def _sharpen_image(self, image: np.ndarray) -> np.ndarray:
"""Apply unsharp masking for blurry images."""
blurred = cv2.GaussianBlur(image, (0, 0), 3.0)
sharpened = cv2.addWeighted(image, 1.5, blurred, -0.5, 0)
return np.clip(sharpened, 0, 255).astype(np.uint8)
def _is_rgb(self, image: np.ndarray) -> bool:
"""Heuristic to detect if image is RGB vs BGR."""
# Check if red channel has higher values than blue (typical in natural images)
return np.mean(image[:, :, 0]) > np.mean(image[:, :, 2])
3. Core Segmentation with Multiple Prompt Types
Now let's implement the actual segmentation logic with support for all prompt types:
def segment(
self,
image: Union[str, Path, np.ndarray, Image.Image],
prompts: Optional[dict] = None,
multimask_output: bool = False,
return_logits: bool = False
) -> dict:
"""
Perform zero-shot segmentation with various prompt types.
Args:
image: Input image
prompts: Dictionary with prompt types:
- 'points': List of (x, y) coordinates
- 'point_labels': List of 1 (foreground) or 0 (background)
- 'boxes': List of [x1, y1, x2, y2] bounding boxes
- 'text': String description (requires EVF-SAM extension)
multimask_output: Return multiple mask candidates
return_logits: Return raw logits instead of binary masks
Returns:
Dictionary with masks, scores, and metadata
"""
# Preprocess image
processed_image = self.preprocess_image(image)
# Set image in predictor
self.predictor.set_image(processed_image)
# Prepare prompts
if prompts is None:
# Automatic segmentation: sample points across the image
prompts = self._generate_auto_prompts(processed_image)
# Handle text prompts (requires EVF-SAM extension)
if 'text' in prompts and prompts['text']:
return self._segment_with_text(processed_image, prompts['text'])
# Prepare point prompts
point_coords = None
point_labels = None
box_coords = None
if 'points' in prompts and prompts['points']:
point_coords = np.array(prompts['points'], dtype=np.float32)
point_labels = np.array(
prompts.get('point_labels', [1] * len(prompts['points'])),
dtype=np.int32
)
if 'boxes' in prompts and prompts['boxes']:
box_coords = np.array(prompts['boxes'], dtype=np.float32)
# Run inference
try:
masks, scores, logits = self.predictor.predict(
point_coords=point_coords,
point_labels=point_labels,
box=box_coords,
multimask_output=multimask_output,
return_logits=return_logits
)
except Exception as e:
logger.error(f"Segmentation failed: {e}")
raise RuntimeError(f"Segmentation inference error: {e}")
# Post-process results
results = self._postprocess_masks(
masks, scores, logits,
processed_image.shape[:2],
return_logits
)
return results
def _generate_auto_prompts(self, image: np.ndarray) -> dict:
"""Generate automatic prompts for full-image segmentation."""
h, w = image.shape[:2]
# Sample points in a grid pattern
grid_size = 32
points = []
for y in range(0, h, grid_size):
for x in range(0, w, grid_size):
points.append([x + grid_size//2, y + grid_size//2])
return {
'points': points,
'point_labels': [1] * len(points) # All foreground
}
def _segment_with_text(self, image: np.ndarray, text: str) -> dict:
"""
Segment using text prompts via EVF-SAM.
This implements the Early Vision-Language Fusion approach described
in EVF-SAM [3], which enables text-prompted segmentation without
requiring separate vision-language models.
"""
# Note: This requires the EVF-SAM extension package
# pip install evf-sam
try:
from evf_sam import EVFSAMSegmenter
# Initialize EVF-SAM with our SAM 2 model
evf = EVFSAMSegmenter(self.predictor)
masks, scores = evf.segment_with_text(image, text)
return {
'masks': masks,
'scores': scores,
'prompt_type': 'text',
'text_prompt': text
}
except ImportError:
raise ImportError(
"Text-prompted segmentation requires EVF-SAM. "
"Install with: pip install evf-sam"
)
4. Post-Processing and Quality Enhancement
Production systems need clean, usable masks. Here's a comprehensive post-processing pipeline:
def _postprocess_masks(
self,
masks: np.ndarray,
scores: np.ndarray,
logits: Optional[np.ndarray],
original_shape: Tuple[int, int],
return_logits: bool
) -> dict:
"""Post-process masks for production use."""
processed_masks = []
processed_scores = []
for i, (mask, score) in enumerate(zip(masks, scores)):
# Remove small disconnected components
mask = self._remove_small_components(mask, min_area=100)
# Smooth mask boundaries
mask = self._smooth_mask(mask)
# Apply morphological operations for cleaner edges
mask = self._clean_mask_edges(mask)
processed_masks.append(mask)
processed_scores.append(score)
# Convert to numpy arrays
masks_array = np.array(processed_masks)
scores_array = np.array(processed_scores)
# Sort by score (highest first)
sorted_indices = np.argsort(scores_array)[::-1]
masks_array = masks_array[sorted_indices]
scores_array = scores_array[sorted_indices]
result = {
'masks': masks_array,
'scores': scores_array,
'num_masks': len(masks_array),
'original_shape': original_shape
}
if return_logits and logits is not None:
result['logits'] = logits[sorted_indices]
return result
def _remove_small_components(
self,
mask: np.ndarray,
min_area: int = 100
) -> np.ndarray:
"""Remove small noise components from binary mask."""
from scipy import ndimage
labeled, num_features = ndimage.label(mask)
sizes = ndimage.sum(mask, labeled, range(1, num_features + 1))
# Keep only components above threshold
clean_mask = np.zeros_like(mask)
for i, size in enumerate(sizes):
if size >= min_area:
clean_mask[labeled == i + 1] = 1
return clean_mask
def _smooth_mask(self, mask: np.ndarray, kernel_size: int = 3) -> np.ndarray:
"""Apply Gaussian smoothing to mask boundaries."""
from scipy.ndimage import gaussian_filter
# Convert to float for smoothing
mask_float = mask.astype(np.float32)
smoothed = gaussian_filter(mask_float, sigma=kernel_size/6)
# Threshold back to binary
return (smoothed > 0.5).astype(np.uint8)
def _clean_mask_edges(self, mask: np.ndarray) -> np.ndarray:
"""Apply morphological operations for cleaner edges."""
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
# Close small holes
closed = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, kernel)
# Remove small protrusions
opened = cv2.morphologyEx(closed, cv2.MORPH_OPEN, kernel)
return opened
5. Batch Processing and Memory Management
For production workloads, efficient batch processing is essential:
def segment_batch(
self,
images: List[Union[str, Path, np.ndarray]],
prompts_list: Optional[List[dict]] = None,
batch_size: int = 4,
show_progress: bool = True
) -> List[dict]:
"""
Process multiple images with memory-efficient batching.
Handles:
- Variable batch sizes
- Memory pressure detection
- Progress tracking
"""
results = []
if show_progress:
from tqdm import tqdm
iterator = tqdm(range(0, len(images), batch_size), desc="Segmenting")
else:
iterator = range(0, len(images), batch_size)
for start_idx in iterator:
end_idx = min(start_idx + batch_size, len(images))
batch_images = images[start_idx:end_idx]
batch_prompts = prompts_list[start_idx:end_idx] if prompts_list else [None] * len(batch_images)
# Check memory before processing
if self.device == "cuda":
memory_allocated = torch.cuda.memory_allocated() / 1024**3
if memory_allocated > 6.0: # GB threshold
logger.warning(f"High memory usage ({memory_allocated:.2f}GB), clearing cache")
torch.cuda.empty_cache()
# Process each image in batch
for img, prompts in zip(batch_images, batch_prompts):
try:
result = self.segment(img, prompts)
results.append(result)
except Exception as e:
logger.error(f"Failed to segment image: {e}")
results.append({
'error': str(e),
'masks': np.array([]),
'scores': np.array([]),
'num_masks': 0
})
return results
6. API Deployment with FastAPI
Let's wrap everything in a production-ready API:
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
from fastapi.responses import JSONResponse
import base64
from io import BytesIO
import json
app = FastAPI(title="SAM 2 Zero-Shot Segmentation API")
# Global segmenter instance (lazy initialization)
segmenter = None
def get_segmenter():
"""Lazy initialization with singleton pattern."""
global segmenter
if segmenter is None:
segmenter = SAM2Segmenter(
model_type="sam2_hiera_large",
use_half_precision=True
)
return segmenter
@app.post("/segment")
async def segment_image(
file: UploadFile = File(..),
prompt_type: str = Form("auto"),
points: Optional[str] = Form(None),
boxes: Optional[str] = Form(None),
text: Optional[str] = Form(None),
multimask: bool = Form(False)
):
"""
Segment an image using SAM 2.
Args:
file: Image file (JPEG, PNG, WebP)
prompt_type: 'auto', 'points', 'boxes', 'text', or 'combined'
points: JSON string of point coordinates [[x1,y1], [x2,y2], ..]
boxes: JSON string of bounding boxes [[x1,y1,x2,y2], ..]
text: Text description for text-prompted segmentation
multimask: Return multiple mask candidates
"""
try:
# Read and validate image
contents = await file.read()
if len(contents) > 50 * 1024 * 1024: # 50MB limit
raise HTTPException(400, "Image too large (max 50MB)")
# Convert to numpy array
nparr = np.frombuffer(contents, np.uint8)
image = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
if image is None:
raise HTTPException(400, "Invalid image file")
# Prepare prompts
prompts = {}
if prompt_type == "points" and points:
prompts['points'] = json.loads(points)
elif prompt_type == "boxes" and boxes:
prompts['boxes'] = json.loads(boxes)
elif prompt_type == "text" and text:
prompts['text'] = text
elif prompt_type == "combined":
if points:
prompts['points'] = json.loads(points)
if boxes:
prompts['boxes'] = json.loads(boxes)
if text:
prompts['text'] = text
# Run segmentation
seg = get_segmenter()
result = seg.segment(
image,
prompts=prompts if prompts else None,
multimask_output=multimask
)
# Convert masks to base64 for API response
masks_b64 = []
for mask in result['masks']:
mask_img = Image.fromarray((mask * 255).astype(np.uint8))
buffer = BytesIO()
mask_img.save(buffer, format="PNG")
masks_b64.append(base64.b64encode(buffer.getvalue()).decode())
return JSONResponse({
'success': True,
'num_masks': result['num_masks'],
'scores': result['scores'].tolist(),
'masks_base64': masks_b64,
'original_shape': result['original_shape']
})
except Exception as e:
logger.error(f"API error: {e}")
return JSONResponse(
{'success': False, 'error': str(e)},
status_code=500
)
@app.get("/health")
async def health_check():
"""Health check endpoint."""
return {
'status': 'healthy',
'model': 'sam2_hiera_large',
'device': str(get_segmenter().device),
'gpu_available': torch.cuda.is_available()
}
Edge Cases and Production Considerations
Memory Management
SAM 2's large model variant uses approximately 2.5GB of GPU memory for inference. For production deployments:
- Implement request queuing to prevent concurrent requests from exhausting GPU memory
- Use model warmup to pre-allocate memory during startup
- Monitor memory usage and implement circuit breakers
Handling Low Depth of Field Images
Research shows that traditional segmentation models struggle with shallow depth of field [2]. Our implementation includes:
- Blur detection using Laplacian variance
- Automatic sharpening for blurry regions
- Adaptive thresholding for edge detection
Text-Prompted Segmentation
The EVF-SAM approach enables text-prompted segmentation through early vision-language fusion [3]. This is particularly useful for:
- Segmenting objects by description ("the red car in the background")
- Zero-shot class-agnostic segmentation
- Interactive segmentation with natural language
Performance Optimization
For production workloads, consider:
- Model quantization: Reduce precision to int8 for 4x faster inference
- ONNX export: Convert to ONNX for cross-platform deployment
- TensorRT optimization: NVIDIA's TensorRT can provide 2-3x speedup on compatible GPUs
Conclusion
You've built a production-ready zero-shot image segmentation system using SAM 2. The system handles multiple prompt types, edge cases like low depth of field images, and includes proper memory management for production deployment.
The key takeaways:
- SAM 2 enables true zero-shot segmentation across diverse domains
- Proper preprocessing and post-processing are crucial for production quality
- Memory management and error handling are essential for reliable deployment
- Text-prompted segmentation via EVF-SAM extends capabilities further
What's Next
To extend this system:
- Add video segmentation using SAM 2's video capabilities
- Implement active learning to improve segmentation quality over time
- Add model serving with Kubernetes for horizontal scaling
- Integrate with object detection models for automated prompt generation
The complete code is available on GitHub. For more tutorials on computer vision and deep learning, check out our guides on model deployment and image processing pipelines.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Build a SOC Assistant with AI Threat Detection
Practical tutorial: Detect threats with AI: building a SOC assistant
How to Build a Voice Assistant with Whisper and Llama 3.3
Practical tutorial: Build a voice assistant with Whisper + Llama 3.3
How to Run Janus Pro Locally on Mac M4 for Image Generation
Practical tutorial: Generate images locally with Janus Pro (Mac M4)