How to Generate Production Code with GPT-4o
Practical tutorial: Using GPT-4o for advanced code generation
How to Generate Production Code with GPT-4o
Table of Contents
- How to Generate Production Code with GPT-4o
- Create and activate virtual environment
- Install core dependencies
- For code analysis
- app/services/generator.py
- app/sandbox/executor.py
📺 Watch: Neural Networks Explained
Video by 3Blue1Brown
The gap between prototyping and production deployment has never been wider. While GPT [5]-4o can generate syntactically correct code in seconds, that code often fails under real-world conditions—missing error handling, ignoring rate limits, or leaking memory. In this tutorial, you'll learn a systematic approach to using GPT-4o for advanced code generation that produces production-ready output. We'll build a complete microservice that generates, validates, and deploys Python functions, incorporating self-verification techniques inspired by recent research on code-based verification in large language models.
Understanding the Code Generation Pipeline Architecture
Before writing any code, we need to understand why naive GPT-4o prompts fail in production. The core issue is that GPT-4o, like all language models, generates text that looks like code but lacks the structural guarantees required for reliable execution. According to research published on ArXiv, "Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification" demonstrates that incorporating verification loops significantly improves output reliability. We'll apply similar principles to code generation.
Our architecture consists of three layers:
- Prompt Engineering Layer: Structures requests with explicit constraints, type signatures, and test cases
- Generation and Validation Layer: Executes generated code in sandboxed environments, checks for syntax errors, runtime exceptions, and logical correctness
- Deployment Layer: Packages validated code with proper error handling, logging, and monitoring
The key insight is that GPT-4o should never be the final arbiter of code quality. Instead, we use it as a generator within a larger validation framework. This mirrors the approach used in JaCoText, a pretrained model for Java code-text generation described in another ArXiv paper, which emphasizes the importance of structured generation pipelines.
Prerequisites and Environment Setup
We'll build this system using Python 3.11+, FastAPI for the API layer, and Docker for sandboxed execution. You'll need:
- Python 3.11 or higher
- Docker installed and running
- OpenAI [7] API key with GPT-4o access
- Basic familiarity with async Python
Set up your environment:
# Create and activate virtual environment
python -m venv gpt4o-codegen
source gpt4o-codegen/bin/activate # On Windows: gpt4o-codegen\Scripts\activate
# Install core dependencies
pip install openai==1.12.0 fastapi==0.109.0 uvicorn==0.27.0 pydantic==2.5.3
pip install docker==7.0.0 pytest==8.0.0 httpx==0.26.0
# For code analysis
pip install pylint==3.0.3 mypy==1.8.0 black==24.1.1
Create your project structure:
mkdir gpt4o-codegen && cd gpt4o-codegen
mkdir -p app/{routers,services,models,sandbox}
touch app/__init__.py app/main.py app/config.py
touch app/routers/__init__.py app/routers/generation.py
touch app/services/__init__.py app/services/generator.py app/services/validator.py
touch app/models/__init__.py app/models/schemas.py
touch app/sandbox/__init__.py app/sandbox/executor.py
Building the Core Generation Service
The heart of our system is the generation service, which constructs structured prompts and processes GPT-4o responses. We'll implement a prompt template system that enforces production constraints.
# app/services/generator.py
import json
import logging
from typing import Optional
from openai import AsyncOpenAI
from pydantic import BaseModel, Field
logger = logging.getLogger(__name__)
class CodeGenerationRequest(BaseModel):
"""Structured request for code generation."""
task_description: str = Field(.., min_length=10, max_length=2000)
input_types: dict = Field(default_factory=dict)
output_type: str = "Any"
constraints: list[str] = Field(default_factory=list)
test_cases: list[dict] = Field(default_factory=list)
max_retries: int = Field(default=3, ge=1, le=10)
class GeneratedCode(BaseModel):
"""Validated generated code output."""
source_code: str
function_name: str
imports: list[str]
type_annotations: dict
test_results: list[dict]
is_valid: bool = False
class GPT4oCodeGenerator:
"""
Production-grade code generator using GPT-4o with self-verification.
Implements the verification loop described in ArXiv research on
code-based self-verification for LLMs.
"""
def __init__(self, api_key: str, model: str = "gpt-4o-2024-11-20"):
self.client = AsyncOpenAI(api_key=api_key)
self.model = model
self.system_prompt = self._build_system_prompt()
def _build_system_prompt(self) -> str:
"""Construct the system prompt with production constraints."""
return """You are an expert Python developer generating production-ready code.
CRITICAL RULES:
1. Always include complete type annotations for all functions and parameters
2. Add comprehensive docstrings following Google style
3. Include error handling for all edge cases
4. Never use bare except clauses
5. Always validate input parameters
6. Add logging for debugging
7. Include unit tests in the response
8. Never generate code that could cause infinite loops
9. Always close file handles and network connections
10. Use async/await for I/O operations when appropriate
Output format: Return a JSON object with keys:
- "source_code": The complete Python function
- "imports": List of required imports
- "function_name": The main function name
- "test_cases": List of test cases with expected outputs
"""
async def generate(self, request: CodeGenerationRequest) -> GeneratedCode:
"""
Generate code with self-verification loop.
Implements the verification strategy from the GPT-4 Code Interpreter paper.
"""
for attempt in range(request.max_retries):
try:
# Build the user prompt with constraints
user_prompt = self._build_user_prompt(request)
# Generate initial code
response = await self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.2, # Lower temperature for more deterministic output
max_tokens=2000,
response_format={"type": "json_object"}
)
# Parse the response
parsed = json.loads(response.choices[0].message.content)
# Validate the generated code structure
if not self._validate_structure(parsed):
logger.warning(f"Attempt {attempt + 1}: Invalid structure, retrying..")
continue
# Extract and validate the code
source_code = parsed["source_code"]
function_name = parsed["function_name"]
imports = parsed.get("imports", [])
# Perform static analysis
static_errors = await self._static_analysis(source_code)
if static_errors:
logger.warning(f"Attempt {attempt + 1}: Static analysis failed: {static_errors}")
continue
# Run test cases if provided
test_results = []
if request.test_cases:
test_results = await self._run_tests(source_code, function_name, request.test_cases)
if not all(t.get("passed", False) for t in test_results):
logger.warning(f"Attempt {attempt + 1}: Tests failed, retrying..")
continue
return GeneratedCode(
source_code=source_code,
function_name=function_name,
imports=imports,
type_annotations=self._extract_types(source_code),
test_results=test_results,
is_valid=True
)
except Exception as e:
logger.error(f"Attempt {attempt + 1} failed: {str(e)}")
if attempt == request.max_retries - 1:
raise
raise RuntimeError(f"Failed to generate valid code after {request.max_retries} attempts")
def _build_user_prompt(self, request: CodeGenerationRequest) -> str:
"""Build a structured user prompt with all constraints."""
prompt_parts = [
f"Generate a Python function that: {request.task_description}",
f"\nInput types: {json.dumps(request.input_types, indent=2)}",
f"Output type: {request.output_type}",
]
if request.constraints:
prompt_parts.append("\nConstraints:")
for constraint in request.constraints:
prompt_parts.append(f"- {constraint}")
if request.test_cases:
prompt_parts.append("\nTest cases to pass:")
for tc in request.test_cases:
prompt_parts.append(f"- Input: {tc.get('input')}, Expected: {tc.get('expected')}")
return "\n".join(prompt_parts)
async def _static_analysis(self, source_code: str) -> list[str]:
"""Run static analysis tools on generated code."""
errors = []
# Check syntax
try:
compile(source_code, '<generated>', 'exec')
except SyntaxError as e:
errors.append(f"Syntax error: {str(e)}")
# Check for common anti-patterns
dangerous_patterns = [
("eval(", "Use of eval() detected - security risk"),
("exec(", "Use of exec() detected - security risk"),
("__import__", "Dynamic import detected - security risk"),
("pickle.loads", "Unsafe deserialization detected"),
]
for pattern, message in dangerous_patterns:
if pattern in source_code:
errors.append(message)
return errors
def _validate_structure(self, parsed: dict) -> bool:
"""Validate the response structure from GPT-4o."""
required_keys = ["source_code", "function_name", "imports"]
return all(key in parsed for key in required_keys)
def _extract_types(self, source_code: str) -> dict:
"""Extract type annotations from generated code."""
types = {}
import ast
try:
tree = ast.parse(source_code)
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef):
types[node.name] = {
"args": [
(arg.arg, ast.dump(arg.annotation) if arg.annotation else "Any")
for arg in node.args.args
],
"return": ast.dump(node.returns) if node.returns else "Any"
}
except SyntaxError:
pass
return types
async def _run_tests(self, source_code: str, function_name: str, test_cases: list[dict]) -> list[dict]:
"""
Execute test cases in a sandboxed environment.
This is a simplified version - production would use Docker containers.
"""
results = []
namespace = {}
try:
exec(source_code, namespace)
func = namespace.get(function_name)
if not func:
results.append({"error": f"Function {function_name} not found", "passed": False})
return results
for tc in test_cases:
try:
input_data = tc.get("input")
expected = tc.get("expected")
# Handle both positional and keyword arguments
if isinstance(input_data, dict):
result = func(**input_data)
elif isinstance(input_data, (list, tuple)):
result = func(*input_data)
else:
result = func(input_data)
passed = result == expected
results.append({
"input": input_data,
"expected": expected,
"actual": result,
"passed": passed
})
except Exception as e:
results.append({
"input": tc.get("input"),
"expected": tc.get("expected"),
"error": str(e),
"passed": False
})
except Exception as e:
results.append({"error": f"Code execution failed: {str(e)}", "passed": False})
return results
This service implements several production-critical features:
- Structured prompting with explicit constraints and type information
- Self-verification loops that retry generation when validation fails
- Static analysis to catch syntax errors and security issues
- Test execution to verify logical correctness
- Type extraction for documentation and API generation
The retry mechanism is particularly important. According to the ArXiv paper on GPT-4 Code Interpreter, self-verification loops can improve accuracy by 15-30% on complex tasks. Our implementation goes further by incorporating multiple validation stages.
Building the Sandboxed Execution Environment
Running arbitrary generated code is dangerous. We need a sandboxed environment that prevents malicious operations and resource exhaustion. Docker provides the isolation we need.
# app/sandbox/executor.py
import asyncio
import docker
import tempfile
import os
import json
import logging
from pathlib import Path
from typing import Optional
from datetime import datetime, timedelta
logger = logging.getLogger(__name__)
class SandboxedExecutor:
"""
Executes generated code in isolated Docker containers.
Prevents resource exhaustion and malicious operations.
"""
def __init__(self, timeout: int = 30, memory_limit: str = "256m"):
self.client = docker.from_env()
self.timeout = timeout
self.memory_limit = memory_limit
self.image = "python:3.11-slim"
async def execute_code(
self,
source_code: str,
function_name: str,
test_input: dict,
requirements: Optional[list[str]] = None
) -> dict:
"""
Execute generated code in a sandboxed container.
Returns execution results or error information.
"""
# Create temporary directory for the code
with tempfile.TemporaryDirectory() as tmpdir:
# Write the source code
code_path = Path(tmpdir) / "generated_code.py"
code_path.write_text(source_code)
# Write the test harness
harness = self._build_test_harness(function_name, test_input)
harness_path = Path(tmpdir) / "test_harness.py"
harness_path.write_text(harness)
# Build Docker command
cmd = ["python", "test_harness.py"]
# Create container with resource limits
container = self.client.containers.create(
image=self.image,
command=cmd,
working_dir="/code",
volumes={tmpdir: {"bind": "/code", "mode": "ro"}},
mem_limit=self.memory_limit,
cpu_period=100000,
cpu_quota=50000, # Limit to 0.5 CPU
network_disabled=True, # No network access
read_only=True, # Read-only filesystem
security_opt=["no-new-privileges:true"],
cap_drop=["ALL"], # Drop all capabilities
)
try:
# Start container with timeout
container.start()
# Wait for completion with timeout
result = container.wait(timeout=self.timeout)
# Get logs
logs = container.logs(stdout=True, stderr=True).decode("utf-8")
# Parse output
output = self._parse_output(logs)
return {
"success": result["StatusCode"] == 0,
"output": output,
"logs": logs,
"execution_time": None # Would need timing instrumentation
}
except docker.errors.APIError as e:
logger.error(f"Docker API error: {str(e)}")
return {"success": False, "error": f"Docker error: {str(e)}"}
except Exception as e:
logger.error(f"Execution error: {str(e)}")
return {"success": False, "error": str(e)}
finally:
# Clean up container
try:
container.remove(force=True)
except Exception:
pass
def _build_test_harness(self, function_name: str, test_input: dict) -> str:
"""Build a test harness that imports and runs the generated code."""
return f"""
import json
import sys
from generated_code import {function_name}
def run_test():
try:
# Parse input
input_data = {json.dumps(test_input)}
# Execute function
if isinstance(input_data, dict):
result = {function_name}(**input_data)
elif isinstance(input_data, list):
result = {function_name}(*input_data)
else:
result = {function_name}(input_data)
# Output result as JSON
output = {{
"success": True,
"result": str(result),
"type": type(result).__name__
}}
print(json.dumps(output))
except Exception as e:
output = {{
"success": False,
"error": str(e),
"error_type": type(e).__name__
}}
print(json.dumps(output))
sys.exit(1)
if __name__ == "__main__":
run_test()
"""
def _parse_output(self, logs: str) -> Optional[dict]:
"""Parse JSON output from container logs."""
for line in logs.split("\n"):
line = line.strip()
if line:
try:
return json.loads(line)
except json.JSONDecodeError:
continue
return None
The sandbox implementation provides multiple layers of security:
- Resource limits: Memory and CPU constraints prevent resource exhaustion
- Network isolation: Disabled network access prevents data exfiltration
- Read-only filesystem: Prevents persistent modifications
- Capability dropping: Removes all Linux capabilities
- Automatic cleanup: Containers are removed after execution
This approach aligns with security best practices for running untrusted code, similar to how platforms like Replit and GitHub Codespaces handle code execution.
Creating the FastAPI API Layer
Now we'll expose our generation service through a production-grade API with proper error handling, rate limiting, and monitoring.
# app/routers/generation.py
import logging
from fastapi import APIRouter, HTTPException, Depends, BackgroundTasks
from fastapi.responses import JSONResponse
from pydantic import BaseModel, Field
from typing import Optional
import time
from datetime import datetime
from app.services.generator import GPT4oCodeGenerator, CodeGenerationRequest
from app.sandbox.executor import SandboxedExecutor
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/api/v1/codegen", tags=["code-generation"])
class GenerateRequest(BaseModel):
"""API request model for code generation."""
task_description: str = Field(
..,
min_length=10,
max_length=2000,
description="Description of the function to generate"
)
input_types: dict = Field(
default_factory=lambda: {"x": "int", "y": "int"},
description="Dictionary mapping parameter names to types"
)
output_type: str = Field(
default="int",
description="Expected return type"
)
constraints: list[str] = Field(
default_factory=list,
description="Additional constraints for code generation"
)
test_cases: list[dict] = Field(
default_factory=list,
description="Test cases to validate against"
)
sandbox_execution: bool = Field(
default=False,
description="Whether to execute code in sandbox"
)
class GenerateResponse(BaseModel):
"""API response model for code generation."""
success: bool
function_name: str
source_code: str
imports: list[str]
type_annotations: dict
test_results: list[dict]
execution_time_ms: float
sandbox_results: Optional[dict] = None
@router.post("/generate", response_model=GenerateResponse)
async def generate_code(
request: GenerateRequest,
background_tasks: BackgroundTasks,
generator: GPT4oCodeGenerator = Depends(get_generator),
executor: SandboxedExecutor = Depends(get_executor)
):
"""
Generate production-ready Python code using GPT-4o.
This endpoint implements the self-verification loop described in
recent ArXiv research on code-based verification for LLMs.
It generates code, validates it, and optionally executes it
in a sandboxed environment.
"""
start_time = time.time()
try:
# Convert API request to internal format
gen_request = CodeGenerationRequest(
task_description=request.task_description,
input_types=request.input_types,
output_type=request.output_type,
constraints=request.constraints,
test_cases=request.test_cases
)
# Generate code with self-verification
generated = await generator.generate(gen_request)
# Optionally execute in sandbox
sandbox_results = None
if request.sandbox_execution and generated.is_valid:
sandbox_results = await executor.execute_code(
source_code=generated.source_code,
function_name=generated.function_name,
test_input=request.test_cases[0] if request.test_cases else {}
)
execution_time = (time.time() - start_time) * 1000
# Log generation metrics
background_tasks.add_task(
log_generation_metrics,
function_name=generated.function_name,
is_valid=generated.is_valid,
execution_time_ms=execution_time,
test_count=len(request.test_cases)
)
return GenerateResponse(
success=generated.is_valid,
function_name=generated.function_name,
source_code=generated.source_code,
imports=generated.imports,
type_annotations=generated.type_annotations,
test_results=generated.test_results,
execution_time_ms=execution_time,
sandbox_results=sandbox_results
)
except ValueError as e:
logger.error(f"Validation error: {str(e)}")
raise HTTPException(status_code=400, detail=str(e))
except RuntimeError as e:
logger.error(f"Generation error: {str(e)}")
raise HTTPException(status_code=500, detail=str(e))
except Exception as e:
logger.error(f"Unexpected error: {str(e)}")
raise HTTPException(status_code=500, detail="Internal server error")
@router.get("/health")
async def health_check():
"""Health check endpoint for monitoring."""
return {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"version": "1.0.0"
}
# Dependency injection
async def get_generator():
"""Dependency for GPT-4o generator."""
from app.config import settings
return GPT4oCodeGenerator(api_key=settings.OPENAI_API_KEY)
async def get_executor():
"""Dependency for sandbox executor."""
return SandboxedExecutor(timeout=30, memory_limit="256m")
async def log_generation_metrics(
function_name: str,
is_valid: bool,
execution_time_ms: float,
test_count: int
):
"""Background task for logging metrics."""
logger.info(
f"Generation completed - function: {function_name}, "
f"valid: {is_valid}, time: {execution_time_ms:.2f}ms, "
f"tests: {test_count}"
)
Configuration and Main Application
# app/config.py
from pydantic_settings import BaseSettings
from typing import Optional
class Settings(BaseSettings):
"""Application configuration with environment variable support."""
# OpenAI Configuration
OPENAI_API_KEY: str
OPENAI_MODEL: str = "gpt-4o-2024-11-20"
# API Configuration
API_HOST: str = "0.0.0.0"
API_PORT: int = 8000
DEBUG: bool = False
# Rate Limiting
RATE_LIMIT_REQUESTS: int = 100
RATE_LIMIT_WINDOW: int = 60 # seconds
# Sandbox Configuration
SANDBOX_TIMEOUT: int = 30
SANDBOX_MEMORY_LIMIT: str = "256m"
# Logging
LOG_LEVEL: str = "INFO"
class Config:
env_file = ".env"
case_sensitive = True
settings = Settings()
# app/main.py
import logging
from fastapi import FastAPI, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
import time
from app.config import settings
from app.routers import generation
# Configure logging
logging.basicConfig(
level=getattr(logging, settings.LOG_LEVEL),
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
# Create FastAPI application
app = FastAPI(
title="GPT-4o Code Generation API",
description="Production-grade code generation with self-verification",
version="1.0.0",
docs_url="/docs" if settings.DEBUG else None,
redoc_url="/redoc" if settings.DEBUG else None
)
# Add CORS middleware
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Configure appropriately for production
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Add request timing middleware
@app.middleware("http")
async def add_process_time_header(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
process_time = time.time() - start_time
response.headers["X-Process-Time"] = str(process_time)
return response
# Include routers
app.include_router(generation.router)
@app.on_event("startup")
async def startup_event():
"""Initialize services on startup."""
logger.info("Starting GPT-4o Code Generation API")
logger.info(f"Model: {settings.OPENAI_MODEL}")
logger.info(f"Debug mode: {settings.DEBUG}")
@app.on_event("shutdown")
async def shutdown_event():
"""Cleanup on shutdown."""
logger.info("Shutting down GPT-4o Code Generation API")
Running the Application
Create a .env file with your OpenAI API key:
OPENAI_API_KEY=your-api-key-here
OPENAI_MODEL=gpt-4o-2024-11-20
DEBUG=true
LOG_LEVEL=INFO
Start the application:
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
Test the API with a sample request:
curl -X POST "http://localhost:8000/api/v1/codegen/generate" \
-H "Content-Type: application/json" \
-d '{
"task_description": "Calculate the Fibonacci sequence up to n terms",
"input_types": {"n": "int"},
"output_type": "list[int]",
"constraints": ["Must handle n=0 and n=1 edge cases", "Must use iterative approach"],
"test_cases": [
{"input": {"n": 0}, "expected": []},
{"input": {"n": 1}, "expected": [0]},
{"input": {"n": 5}, "expected": [0, 1, 1, 2, 3]}
],
"sandbox_execution": true
}'
Handling Edge Cases and Production Concerns
Rate Limiting and API Costs
GPT-4o API calls are expensive. According to OpenAI's published pricing, GPT-4o costs $5.00 per 1M input tokens and $15.00 per 1M output tokens. Our retry mechanism could multiply costs significantly. Implement these safeguards:
- Token budgeting: Track token usage per request and set maximum limits
- Caching: Cache generated code for identical requests using a hash of the prompt
- Fallback models: Use GPT-3.5-turbo for initial generation, GPT-4o only for validation
Memory Management
Generated code can contain memory leaks or infinite loops. Our sandbox handles this with timeouts, but you should also:
- Static analysis for resource leaks: Check for unclosed file handles, database connections
- Memory profiling: Track memory usage during test execution
- Circuit breakers: Stop processing if error rate exceeds threshold
Security Considerations
Beyond the sandbox, consider:
- Prompt injection: Malicious users might try to inject code through the task description
- Data leakage: Generated code might contain sensitive information from training data
- Supply chain attacks: Generated code might import malicious packages
What's Next
This tutorial provides a foundation for production-grade code generation with GPT-4o. To extend this system:
- Add support for multiple languages: Extend the generator to handle TypeScript, Java, or Go
- Implement continuous learning: Store successful generations and use them as few-shot examples
- Add performance benchmarking: Compare generated code against hand-written implementations
- Integrate with CI/CD pipelines: Automatically generate and test code during development
The approach described here—combining structured prompting, self-verification loops, and sandboxed execution—represents the current best practice for using LLMs in production code generation. As models improve, the validation layer becomes even more critical, ensuring that generated code meets production standards before deployment.
Remember that GPT-4o is a tool, not a replacement for human judgment. Always review generated code for correctness, security, and performance before deploying to production. The system we've built helps automate the validation process, but final responsibility rests with the developer.
References
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
How to Build a Multimodal App with Gemini 2.0 Vision API
Practical tutorial: Build a multimodal app with Gemini 2.0 Vision API
How to Build an AI Pentesting Assistant with LangChain
Practical tutorial: Build an AI-powered pentesting assistant
How to Build Autonomous Scientific Discovery Agents with EurekAgent
Practical tutorial: The story discusses a significant advancement in AI research that could impact autonomous scientific discovery.