AI models like ChatGPT can write code, debug software, and answer technical questions. So can they design computer chips? That’s the promise chip companies are chasing—using AI to accelerate the brutally complex process of hardware engineering. But here’s the problem: nobody actually knows how well these AI systems work for real chip design, because the tests we’ve been using are too easy and don’t reflect what engineers actually do.
Researchers at UC San Diego and Columbia University just released ChipBench(Research Paper: arXiv:2601.21448 | Code: GitHub), a benchmark specifically designed to evaluate how well large language models perform at actual chip design tasks. The results are humbling: Claude Opus 4.5, currently one of the most capable AI models available, managed only 30% success on generating correct hardware code and just 13% on creating reference models—the simplified versions engineers use to verify chip behavior. For context, those same AI models score above 95% on existing programming benchmarks.
This gap reveals a critical problem for the semiconductor industry. Companies are racing to adopt AI tools for chip design, hoping to compress development cycles from years to months and reduce costs by millions. But if the AI can’t reliably handle the complexity of real hardware engineering tasks, those investments risk producing designs that don’t work—an expensive failure in an industry where a single chip project can cost $500 million and take three years from concept to production.
You Would Love To Read:How RISC-V’s Flexibility Became Its Biggest Verification Problem ?
⚡
WireUnwired • Fast Take
- New ChipBench benchmark tests AI models on real chip design tasks—not simplified toy problems
- Best AI model (Claude Opus 4.5) achieves only 30% success on hardware code generation vs 95%+ on software
- Covers three critical tasks: writing chip code, debugging errors, creating verification models
- Exposes massive capability gap between AI’s software skills and hardware engineering requirements

Why Existing Benchmarks Don’t Work for Chip Design ?
The problem with current AI evaluation methods is that they test general programming ability on relatively simple, isolated problems. A typical benchmark might ask: “Write a function to sort a list of numbers” or “Debug this 50-line code snippet.” AI models excel at these tasks because they’ve seen millions of similar examples during training, and the problems have clear, verifiable correct answers.
Chip design is fundamentally different. Hardware engineers write in languages like Verilog or VHDL, which describe circuits rather than step-by-step instructions. A single chip module might involve thousands of lines of code with complex timing relationships, hierarchical structures where components nest inside other components, and constraints that only become apparent when you consider how the entire system works together.
More importantly, existing benchmarks had become saturated—AI models were achieving 95%+ pass rates, making it impossible to distinguish between genuinely capable systems and those that just memorized common patterns. When tests are too easy, you can’t tell whether improvements come from better fundamental capabilities or just better training on test-like problems.
What ChipBench Actually Tests ?
The researchers designed ChipBench around three tasks that mirror real chip design workflows:
Verilog Generation: Given a specification describing what a chip module should do, can the AI write correct Verilog code that implements it? The benchmark includes 44 realistic modules with complex hierarchical structures—not simple logic gates but actual components you’d find in processors or signal processing chips. Success means the generated code compiles correctly, passes functional tests, and implements the specification accurately.
Debugging: Hardware engineers spend enormous time tracking down bugs in chip designs. ChipBench provides 89 systematically constructed debugging cases where the AI must identify what’s wrong in existing Verilog code and fix it. This tests whether AI can understand existing code, recognize incorrect behavior from test results, and make targeted corrections without breaking other functionality.
Reference Model Generation: Before building actual hardware, engineers create simplified software models in languages like Python, SystemC, or CXXRTL that simulate chip behavior. These models are critical for verification—you test your complex hardware design against the simpler reference model to catch errors. ChipBench includes 132 reference model samples testing whether AI can translate hardware specifications into correct behavioral models.
The benchmark’s difficulty comes from realistic complexity. Modules have multiple interacting components, timing constraints, state machines, and the kind of edge cases that cause real chips to fail. You can’t succeed by pattern matching against training data—you need to actually understand hardware semantics and design principles.

The Results: AI Struggles With Hardware
When the researchers tested leading AI models on ChipBench, the performance gap became starkly clear. Claude Opus 4.5—among the most capable models currently available—achieved 30.74% on Verilog generation. That means nearly 70% of the time, the AI-generated hardware code either didn’t compile, failed functional tests, or didn’t correctly implement the specification.
Reference model generation proved even harder, with just 13.33% success on Python models. This is particularly concerning because reference models are supposed to be simpler than the actual hardware—if AI struggles to create the verification models, it certainly can’t be trusted to design the hardware itself.
These numbers contrast dramatically with the 95%+ pass rates the same models achieve on standard programming benchmarks. The gap reveals that hardware design requires capabilities beyond general code generation—understanding timing, state, concurrency, and the physical constraints of actual circuits.
Why This Matters for the Chip Industry
Semiconductor companies face a critical skills shortage. Training a competent hardware engineer takes years, and experienced designers are increasingly scarce as the field becomes more complex. The industry hoped AI could partially address this by automating routine design tasks, allowing human engineers to focus on architecture and high-level decisions.
ChipBench’s results suggest we’re much further from that goal than marketing materials from AI tool vendors might imply. A 30% success rate means you can’t trust AI-generated hardware code without extensive human review and correction—which defeats the efficiency purpose. Companies investing heavily in “AI-aided chip design” tools need to understand these limitations before making their designs dependent on capabilities that don’t yet exist reliably.
The benchmark also provides a roadmap for improvement. By identifying specific weaknesses—particularly in reference model generation and debugging—it guides research toward the capabilities that matter most for real workflows. The researchers even provided an automated toolbox for generating high-quality training data, recognizing that AI models need more exposure to realistic hardware design problems.
What Needs to Happen Next
The research highlights several directions for making AI genuinely useful in chip design:
Better training data: Current AI models trained primarily on software code don’t see enough high-quality Verilog, VHDL, SystemC, and other hardware languages. The relatively small hardware design community produces less publicly available code than software developers, limiting what AI can learn from.
Hardware-specific architectures: General-purpose language models might not be ideal for chip design. Specialized models that understand hardware semantics—timing constraints, resource limits, power consumption—could outperform larger general models.
Hybrid approaches: Rather than expecting AI to generate complete designs autonomously, integrating AI tools into existing design flows as assistants might be more practical. AI could suggest optimizations, catch common bugs, or generate test cases while humans maintain architectural control.
Verification focus: Given AI’s struggles with reference model generation, improving capabilities here could have immediate impact. Better automated verification models would help catch design errors earlier, even if the AI can’t design complete chips reliably.
The semiconductor industry’s AI ambitions aren’t impossible, but ChipBench demonstrates they’re significantly harder than optimistic predictions suggested. Understanding these limitations helps companies make realistic investments in AI tools while continuing to develop the human expertise that remains irreplaceable for complex chip design.
FAQ
Q: If AI can write software well, why is hardware so much harder?
A: Software is sequential—one instruction after another, with clear cause and effect. Hardware is parallel—everything happens simultaneously with precise timing relationships. A software bug might cause one function to fail; a hardware bug can make an entire chip unusable. Additionally, hardware description languages like Verilog describe structure and behavior together, requiring you to think about both logic (what it computes) and implementation (how gates connect) simultaneously. Software engineering separates these concerns. Finally, there’s far less public Verilog code for AI models to learn from compared to the billions of lines of Python, JavaScript, and Java available online.
Q: Does 30% success mean AI is useless for chip design?
A: Not useless, but limited. A 30% success rate means AI can help with simpler modules or generate starting points that engineers refine, but it can’t be trusted for autonomous design. Think of it like spell-check versus writing an entire article—helpful for catching some issues, not ready to replace the author. The key is understanding these limitations when deploying AI tools in production workflows. Companies treating AI as a design assistant (suggesting improvements, generating test cases, finding common patterns) can extract value. Those expecting AI to replace experienced engineers will be disappointed.
Q: How does this benchmark help improve AI for chip design?
A: ChipBench provides a standardized way to measure progress on real tasks that matter for actual chip design, not artificial toy problems. Researchers can now quantify improvements: “our new model achieves 45% on Verilog generation, up from 30%.” The benchmark also identifies specific weaknesses—debugging is harder than generation, reference models are hardest of all—which guides where to focus research efforts. The included training data generation toolbox helps address the shortage of high-quality hardware design examples that AI models need to learn from. Over time, this should accelerate development of genuinely useful AI tools for chip engineering.
For discussions on AI in chip design and semiconductor engineering, join our WhatsApp community where 2,000+ hardware engineers and researchers share insights.
Research Paper: arXiv:2601.21448 | Code: GitHub
Discover more from WireUnwired Research
Subscribe to get the latest posts sent to your email.




