Researchers Reverse-Engineered Apple’s Neural Engine—And Trained Models on Inference-Only Hardware

Researchers cracked Apple's Neural Engine, bypassing CoreML. Discovered direct access, trained models on inference-only hardware.

 Apple doesn’t want you to know how its Neural Engine works. They don’t publish the instruction set architecture. They don’t document internal specifications. They don’t give developers a way to program it directly—everything goes through CoreML, Apple’s high-level machine learning framework that adds layers of abstraction making it nearly impossible to understand what the hardware actually does.

So two researchers—a human engineer called maderix and Claude Opus 4.6 working as a collaborative pair—reverse-engineered it. Over several days, they mapped the entire software stack from CoreML down to the kernel driver, discovered how to compile and execute programs on the Neural Engine without CoreML, cracked the binary format, measured true performance (spoiler: Apple’s “38 TOPS” claim is misleading), and ultimately trained a neural network on hardware Apple designed exclusively for inference.

This is the story of how they did it—and why it matters.

Why AI Writes Perfect Code But Can't Write a Decent Email—The Real Reason
Why AI Writes Perfect Code But Can’t Write a Decent Email—The Real Reason

Breaking Through Apple’s Black Box

The Neural Engine sits inside every recent iPhone, iPad, and Mac with Apple Silicon. It’s not a GPU or CPU—it’s a fixed-function accelerator(A fixed-function accelerator is a specialized, non-programmable hardware unit designed to perform a specific computational task much faster and more efficiently than a general-purpose CPU) that takes a compiled neural network graph and executes the entire computation as one atomic operation. You don’t issue individual instructions. You submit a compiled program describing the full computation, and the hardware runs it end-to-end. The M4’s Neural Engine has 16 cores, can queue 127 evaluation requests simultaneously, and drops to exactly zero milliwatts when idle through hard power gating.

Apple doesn‘t document how to use the Neural Engine directly. The official path is CoreML—a high-level framework that handles model optimization, compilation, and execution. CoreML is convenient for developers but adds overhead and obscures what’s actually happening at the hardware level. For researchers trying to understand Neural Engine capabilities or push it beyond Apple’s intended use cases, CoreML is a black box wrapped around another black box.

The breakthrough came through discovering _ANEClient—a private class in AppleNeuralEngine.framework that provides direct access to the compile-load-execute pipeline. Using Objective-C runtime introspection tools, the researchers found over 40 private classes including _ANEModel, _ANERequest, _ANEIOSurfaceObject, and critically, _ANEInMemoryModel which accepts neural network programs directly in memory without filesystem round-trips.

This matters because CoreML requires writing intermediate files to disk, creating directory structures, and pointing the compiler at them—acceptable for inference where you compile once and run forever, but unworkable for training where weights update every iteration and recompilation is constant. The in-memory path eliminates this bottleneck entirely.

The Neural Engine doesn’t speak ONNX(Open Neural Network Exchange} (ONNX) is an open-source, standardized file format designed to represent machine learning models, enabling seamless interoperability between different frameworks like PyTorch, TensorFlow, and SciKit-Learn. ) or TensorFlow formats. It uses MIL—Machine Learning Intermediate Language—a typed intermediate representation that looks more like compiler IR than a neural network file format. A simple matrix multiplication in MIL specifies tensor shapes, precision (fp16), and operations with explicit keyword arguments. When ANECompiler processes MIL, it produces E5 binaries—FlatBuffer-structured files typically just 2-3 kilobytes regardless of matrix size.

Here’s where it gets interesting: a 1024×1024 matrix multiplication compiles to 2,688 bytes. A 128×128 matmul compiles to 2,680 bytes. Nearly identical sizes despite 64× difference in computation. The E5 binary isn’t encoding the multiplication algorithm—it’s encoding a parameterized program whose behavior is controlled by tensor shape descriptors at runtime. The implication: Neural Engine hardware has fixed compute primitives (convolution, matmul, elementwise ops) that are parameterized rather than programmed instruction-by-instruction.

Data transfer uses IOSurfaces—the same shared memory mechanism Apple uses for GPU textures. This means zero-copy transfers between GPU and Neural Engine are theoretically possible if both accelerators share the same IOSurface reference. For ML workflows that preprocess data on GPU before running inference on Neural Engine, this could eliminate expensive memory copies entirely.

The researchers also discovered the Neural Engine supports queue depths of 127—far deeper than most accelerator queues and suggesting the hardware is optimized for high-throughput streaming inference rather than single-request latency. Independent voltage/frequency scaling means the Neural Engine can adjust power and performance separately from CPU and GPU, with sophisticated adaptive triggers for different workload characteristics.

Several mysteries remain. The exact core microarchitecture and instruction set are still unknown. How cores get assigned to operations within a graph isn’t documented. Whether hardware performance counters are accessible remains unclear. Classes like _ANEChainingRequest and _ANESharedEvents hint at capabilities for chaining multiple models or synchronizing with GPU operations that haven’t been fully explored yet.

But the core achievement stands: direct Neural Engine access without CoreML, in-memory compilation for training workflows, and successful neural network training on hardware Apple explicitly designed only for inference. In upcoming coverage, we’ll examine the performance reality—why Apple’s “38 TOPS” specification is misleading, why convolution runs 3× faster than matrix multiplication on the same hardware, and how bypassing CoreML delivers 2-4× more throughput than the official path.

For now, what matters is that Apple’s most advanced AI accelerator isn’t as locked down as the company intended. The hardware is accessible. The compilation pipeline can be driven directly. And the Neural Engine can do more than Apple publicly claims—you just have to reverse-engineer your way past the abstractions to find out.

This is Part 1 of our Apple Neural Engine series. Coming next: performance benchmarks and the training breakthrough. For discussions on Apple Silicon, AI accelerators, and systems research, join our WhatsApp community.


Discover more from WireUnwired Research

Subscribe to get the latest posts sent to your email.

WireUnwired Editorial Team
WireUnwired Editorial Team
Articles: 250

Leave a Reply