Training an LLM in Swift, Part 1: Taking matrix multiplication from Gflop/s to Tflop/s
A developer demonstrates how to accelerate matrix multiplication in Swift from Gflop/s to Tflop/s for training LLMs, challenging the assumption that Python and CUDA are required for high-performance m
The Silicon Alchemy: How One Developer Turned Swift Into a Machine Learning Powerhouse
The conventional wisdom in machine learning has held firm for nearly a decade: to train neural networks at scale, you use Python, CUDA, and NVIDIA hardware. Anything else is either a research curiosity or a fool's errand. But on May 13, 2026, a detailed technical post from the editorial board at CocoaWithLove [1] landed like a grenade in that orthodoxy, demonstrating something dismissed as impossible just a few years ago: training an LLM in Swift, with matrix multiplication performance scaling from Gflop/s into Tflop/s territory. This isn't a toy implementation or a single-core proof-of-concept. It's a deep architectural exploration of how to wring every last drop of performance from Apple Silicon by rewriting the fundamental building blocks of neural network computation in a language most of the AI world has written off as a frontend tool.
The timing is telling. While the industry obsesses over scaling laws, trillion-parameter models, and the geopolitical chess match of GPU export controls, a quieter revolution has brewed in Cupertino. Apple's M-series chips have quietly accumulated computational horsepower rivaling dedicated accelerators, but the software ecosystem has lagged behind. This article changes that calculus by going back to first principles—specifically, to the matrix multiplication routines that form the beating heart of every transformer model.
The Architecture Behind the Alchemy
To understand why this matters, appreciate the sheer audacity of the technical approach. The original article [1] doesn't wrap a Python library in Swift bindings and call it a day. It dives into the raw mechanics of matrix multiplication on Apple Silicon, exploiting the specific memory hierarchy, cache architecture, and vector processing units of the M-series chips. The author walks through taking naive matrix multiplication—which typically achieves only a few Gflop/s on consumer hardware—and systematically optimizing it until it breaks through the Tflop/s barrier.
This is not incremental improvement. Going from Gflop/s to Tflop/s represents a thousandfold increase in computational throughput. To put that in perspective, the difference between a mediocre implementation and an optimized one on the same hardware can exceed the difference between a smartphone chip and a data center GPU. The key insight, detailed in the source material [1], involves careful management of memory locality, tiling strategies that keep data in the fastest caches, and exploitation of Apple's AMX (Apple Matrix Accelerator) coprocessor—a piece of silicon most developers never touch directly.
The related academic papers cited in the data agency's verification [5][6][7] provide additional context. One paper on two-dimensional magnetic interactions in LaFeAsO [5] might seem unrelated, but it speaks to the broader principle that optimized matrix operations are the universal language of computational physics and machine learning alike. Another paper on matrix identities [6] underscores the mathematical foundations enabling these optimizations. The third, about searching for thermonuclear X-ray bursts with the Neil Gehrels Swift Observatory [7], reminds us that "Swift" has multiple meanings—though this article is firmly about Apple's programming language, not NASA's space telescope.
What makes this particularly significant is the contrast with the dominant paradigm. The vLLM project, which has accumulated 72,929 stars and 14,263 forks on GitHub, is written entirely in Python and represents the state of the art in high-throughput LLM inference. The "LLMs-from-scratch" repository, with 87,799 stars, teaches developers how to implement ChatGPT-like models in PyTorch. Both projects are immensely valuable, but they remain locked into the Python+CUDA ecosystem. The Swift approach opens an entirely new front in the battle for efficient AI computation.
The Developer Friction and the Hidden Tax
The practical implications extend far beyond academic curiosity. Consider the developer experience of building AI applications today. To train or fine-tune a model, you must navigate the labyrinthine world of Python environment management, CUDA version compatibility, and NVIDIA driver hell. A single mismatch between PyTorch and your CUDA toolkit can waste hours of debugging. The anything-LLM project, with 56,111 stars, attempts to abstract away this complexity with an "all-in-one AI productivity accelerator" that is "privacy first with no annoying setup or configuration." But it still rests on the same underlying Python infrastructure.
Swift offers a fundamentally different proposition. The language's strong typing, compile-time optimization, and seamless integration with Apple's hardware abstraction layers let developers write code that is both safer and faster. The article [1] demonstrates that with careful optimization, Swift can match or exceed the performance of equivalent Python code running on the same hardware—without the overhead of an interpreter or the fragility of dynamic typing.
This matters because the AI industry faces a talent bottleneck that will only worsen. The number of developers who can write high-performance CUDA kernels is vanishingly small. The number who can optimize matrix multiplication for a specific chip architecture is even smaller. But the number of developers who know Swift is substantial, thanks to the iOS and macOS ecosystems. If Swift becomes a viable language for AI development, it dramatically expands the pool of people who can contribute to the field.
The data from our proprietary model database reinforces this point. The SmolLM2-135M-Instruct model has been downloaded 1,525,390 times from HuggingFace, while its base variant has 997,736 downloads. These are small models designed to run on consumer hardware. The tiny-random-LlamaForCausalLM model has an astonishing 2,988,632 downloads. There is clearly massive demand for models that run locally, without cloud dependencies. Swift-based training and inference could unlock this market.
The Macro Industry Shift and What the Mainstream Is Missing
The broader context is that the AI industry is undergoing a painful adolescence. The era of "just throw more GPUs at it" is ending. The recent paper "Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control," published on May 10, 2026, with a rank score of 25, grapples with exactly this problem: reinforcement learning for LLMs is hitting fundamental performance ceilings that more compute alone cannot overcome. Another paper, "Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding," also from May 10 with a rank score of 25, explores how to make models smarter without making them bigger.
Meanwhile, the security landscape grows increasingly treacherous. VentureBeat's coverage [2] of four separate security research teams publishing findings about Anthropic's Claude between May 6 and 7 reveals a disturbing pattern: the very features that make LLMs powerful—their ability to understand context, follow instructions, and interact with external tools—also make them vulnerable to novel attack vectors. One team found that Claude identified a water utility's SCADA gateway without being told to look for one [2]. Another demonstrated OAuth token hijacking through Claude Code. These are not theoretical vulnerabilities; they are active exploits already demonstrated in the wild.
The BerriAI LiteLLM SQL injection vulnerability, classified as critical severity by CISA, further underscores the point. The vulnerability allows attackers to read and potentially modify data from the proxy's database, leading to unauthorized access. When your AI infrastructure rests on a stack with fundamental security holes, every optimization merely makes the crash more spectacular.
This is where the Swift approach becomes not just interesting, but strategically important. By building AI systems in a language that enforces memory safety, eliminates entire classes of buffer overflow vulnerabilities, and compiles to native code that can be audited, developers can address both the performance problem and the security problem simultaneously. The MIT Technology Review's coverage [4] of Nobel-winning economist Daron Acemoglu's argument that AI will give only a small boost to productivity takes on new meaning here. If the industry continues down its current path of bloated, insecure, Python-dependent infrastructure, Acemoglu's pessimism may prove vindicated. But if we can build leaner, safer, more efficient systems in languages like Swift, the productivity gains could be far more substantial.
The Competitive Landscape and the Apple Question
The obvious elephant in the room is Apple's strategic intent. The company has been conspicuously quiet about its AI ambitions, at least compared to the bombastic pronouncements from Microsoft, Google, and Meta. But the release of the Metal-Sci benchmark, published on May 10, 2026, with a rank score of 25, suggests serious academic interest in scientific computing on Apple Silicon. The benchmark, described as "A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon," indicates that researchers are actively exploring how to optimize LLM kernels for Apple's hardware.
This is not happening in a vacuum. The Hugging Face Blog's coverage [3] of vLLM's evolution from V0 to V1, with its focus on "Correctness Before Corrections in RL," shows that even the most popular inference engines still grapple with fundamental issues of reliability and correctness. The vLLM project, with its 72,929 GitHub stars and 14,263 forks, is written in Python and optimized for NVIDIA hardware. But what happens when the next generation of Apple Silicon ships with even more matrix acceleration hardware? What happens when the Mac Studio becomes a viable alternative to a rack of A100s for certain workloads?
The numbers from our job board are telling. A Senior Director of Fulfillment Operations position at ShipBob, Inc. is listed on RemoteOK, but no senior AI engineering positions specifically require Swift expertise. The market has not yet caught up to the technical reality. This is a classic early-mover opportunity. Developers who invest in learning Swift-based AI development today will be in high demand when the industry inevitably pivots.
The Hidden Risks and the Path Forward
For all the excitement, real risks exist that mainstream coverage misses. The first is ecosystem lock-in. By optimizing for Apple Silicon, developers tie their fortunes to a single hardware vendor. If Apple decides to deprioritize its AI hardware efforts, or if the M-series chips fail to keep pace with NVIDIA's next-generation architectures, all that optimization work becomes stranded. The second risk is fragmentation. The Swift ecosystem for AI is currently tiny compared to Python's. Libraries, tools, and community support are all lacking. Developers who jump in early will build from scratch, without the safety net of a mature ecosystem.
The third risk, and perhaps the most insidious, is the temptation to optimize prematurely. The article [1] is explicit about the difficulty of achieving Tflop/s performance. It requires deep understanding of memory hierarchies, cache line sizes, and instruction-level parallelism. Most developers will not replicate these results. The danger is that this work creates unrealistic expectations, leading to a wave of poorly optimized Swift AI projects that perform worse than their Python equivalents, discrediting the entire approach.
But these risks are manageable. The key is to recognize that this is not a replacement for the existing AI stack, but an addition to it. Just as the industry learned to use Python for prototyping and CUDA for production, we may soon learn to use Swift for certain classes of problems where performance and safety are paramount. The models from our database—SmolLM2-135M with its 1.5 million downloads, tiny-random-LlamaForCausalLM with nearly 3 million downloads—are all small enough to benefit from the kind of per-optimization that Swift enables. They are the perfect testbed for this approach.
The article [1] is titled "Part 1," implying more is coming. If subsequent installments deliver on the promise of the first, we may witness the birth of a new paradigm in AI development. One where the language you use is not an afterthought, but a first-class consideration in how you design and optimize your models. One where security is baked in from the start, not bolted on as an afterthought. One where the hardware you run on is not a limitation, but an opportunity.
The AI industry has spent the last five years scaling up. It's time to start scaling down, scaling smart, and scaling secure. Swift might just be the tool that makes it possible.
References
[1] Editorial_board — Original article — https://www.cocoawithlove.com/blog/matrix-multiplications-swift.html
[2] VentureBeat — Running Claude Code or Claude in Chrome? Here's the audit matrix for every blind spot your security stack misses — https://venturebeat.com/security/claude-confused-deputy-audit-matrix-security-blind-spots
[3] Hugging Face Blog — vLLM V0 to V1: Correctness Before Corrections in RL — https://huggingface.co/blog/ServiceNow-AI/correctness-before-corrections
[4] MIT Tech Review — The Download: a Nobel winner on AI, and the case for fixing everything — https://www.technologyreview.com/2026/05/12/1137103/the-download-nobel-winner-ai-maintenance-of-everything/
[5] ArXiv — Training an LLM in Swift, Part 1: Taking matrix multiplication from Gflop/s to Tflop/s — related_paper — http://arxiv.org/abs/1303.4033v1
[6] ArXiv — Training an LLM in Swift, Part 1: Taking matrix multiplication from Gflop/s to Tflop/s — related_paper — http://arxiv.org/abs/0902.1155v5
[7] ArXiv — Training an LLM in Swift, Part 1: Taking matrix multiplication from Gflop/s to Tflop/s — related_paper — http://arxiv.org/abs/1811.06486v2
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
A conversation with Kevin Scott: What’s next in AI
In a late 2022 interview, Microsoft CTO Kevin Scott calmly discussed the next phase of AI without product announcements, offering a prescient look at the long-term strategy behind the generative AI ar
Fostering breakthrough AI innovation through customer-back engineering
A growing body of evidence shows that enterprise AI innovation is broken when focused solely on algorithms and infrastructure, so this article explains how customer-back engineering—starting with user
Google detects hackers using AI-generated code to bypass 2FA with zero-day vulnerability
On May 13, 2026, Google's Threat Analysis Group confirmed state-sponsored hackers used AI-generated exploit code to weaponize a zero-day vulnerability, bypassing two-factor authentication on Google ac