The "Design Conductor" paper is bunk
• 3169 words
Table of Contents
Introduction
Recently, a paper (PDF) was published on arXiv entitled “Design Conductor: An agent autonomously builds a 1.5 GHz Linux-capable RISC-V CPU”. This paper describes the authors, from a startup called “Verkor”, using an AI “agent” they call “Design Conductor” to design a RISC-V chip called VerCore. This has been breathlessly covered in the media, most notably by Adafruit; but by others (1), (2) as well.
This paper, however, has an enormous number of unfair comparisons and missing limitations. Adafruit did not do their due diligence (shame!), and apparently neither has anyone else, so I feel like it’s my duty to do so.
In this article, I’ll pull apart the main claims of the paper and demonstrate how they are either false or unfair. Then, I will cover further limitations the authors did not mention. Finally, I will discuss the impact these types of papers are having on arXiv and academia as a whole.
Claims
The paper makes a number of claims; most notably about the capabilities of the VerCore, but also about the capabilities of the Design Conductor agent.
Claim: “[VerCore] is roughly equivalent to an Intel Celeron SU2300 from mid-2011”
This is the most important claim of the paper, and one that is repeated on Verkor’s website and in the media coverage the paper has since received. This claim, however, does not stand up to scrutiny and is misleading.
Firstly, the Intel Celeron SU2300 is a core that was launched in 2009 (ref 1, ref 2 ). The Celeron SU2300 ran at 1.2 GHz, was manufactured on Intel’s 45 nm process node, and had a 10 W TDP, and is also a dual-core processor (ibid).
The EMBC publishes benchmark scores, and luckily there are two available for the SU2300. The first uses a single core, and reaches a score of 3367.6, or a CoreMark/MHz of 2.80. The second uses both cores, and reaches a score of 6543.8, or a CoreMark/MHz of 5.45.
The Design Conductor paper states that their chip was able to achieve timing closure at 1.48 GHz on the ASAP7 speculative PDK (we will get into that later). They claim a CoreMark of 3261, which is very close to the SU2300’s single-core CoreMark of 3263; but is not close to the dual-core score of 6543.8.
They do not state this in the paper, but that also yields a CoreMark/MHz of 2.203, which is worse than the SU2300’s score of 2.80 and 5.45 for single and dual-core respectively.
In general, this comparison is grossly unfair to the Celeron. As we will get into later, the VerCore was not properly taped out, and even still only occupies an area of 0.002809 mm^2 on ASAP7. The Celeron, by comparison, occupies an area of 107 mm^2. This is 38,091x larger. This is no slight to the Celeron, by the way: the reason it occupies so much area (in addition to being built on a larger process node) is that it implements an entire x86 frontend, an L1 cache, a 1MB L2 cache, two cores, a speculative out-of-order architecture, and all the legacy baggage necessary for an x86 chip. It is in fact no surprise, given such an enormous die as the Celeron, that parasitics, wire length and other well-documented factors are going to decrease the overall clock speed of the chip, compared to just “taping out” the very core of a chip with unclear memory and data interfaces. This is not even accounting for the fact that the Intel chip was on a newly release 45 nm node, whereas VerCore is on a speculative 7 nm model. I argue that if the AI CPU were properly taped out, end-to-end in a SoC, we would see a lower overall CoreMark score and likely a lower overall CoreMark/MHz score as well.
Finally, I want to make note that the excellently-designed, FOSS Ibex core (which is formally verified and not AI slop) has a CoreMark/MHz of 3.13. This knocks both the Celeron (in single core mode) and VerCore out of the park. At 1.48 GHz, we would expect a CoreMark score of 4632.4.
Claim: “[VerCore is] Linux-capable”
Linux typically requires an MMU to run. From the documentation:
The kernel has limited support for memory mapping under no-MMU conditions, such as are used in uClinux environments.
It is technically true that Linux will “run” on a device without an MMU, but there are some severe limitations:
- ELF file binaries (i.e. all of them) will not work
- The
forksyscall will not work
Whilst Linux can run without an MMU, potentially even on this chip, calling it a “Linux-capable” is a stretch. Ibex, mentioned above, would also technically be “Linux-capable” by this definition, but they obviously don’t go around saying they are.
Furthermore, the paper includes absolutely no evidence - beyond the title - that the chip is at all
Linux-capable. There are no boot logs, no results at all. This is even further suspect given that, from the
OpenROAD floorplan shown, there are clearly no SRAM macros instantiated. It is not clear what the memory
capabilities of this chip are, if it even has any, or if it uses an off-chip bus, or what. I strongly suspect
that this has not been accounted for given how many pins the OR GDS instantiates on its perimeter, which is
what usually happens when you build a core directly into a GDS file and don’t use any sensible off-chip bus
(i.e., you tape out my_cpu_top.sv and include things like instr_i[31:0], a 32-bit bus).
This design, of course, would not be remotely able to be fabricated, but we’ll get into that later.
Problems
We’ve discussed the main claims of the paper, but now I’d like to address some problems that the authors neglected, in addition to the limitations they already stated in their paper.
ASAP7 is a speculative PDK
The PDK that the authors use, ASAP7, is speculative/predictive (Clark et al., 2016). In other words, no ASAP7 fab exists in reality, and no chips can be made on ASAP7. It is a predictive undertaking, based on the best guesses of a number of experienced foundry engineers from 2016, before EUV was widely available. It is fair and reasonable to use the ASAP7 to get an idea of what timing for a device might look like. Saying that VerCore is a 1.5 GHz CPU is misleading. It is not 1.5 GHz on any technology node that actually exists in reality. It might reach 1.5 GHz on a very recent TSMC or Samsung node, but it’s hard to say. I am not aware of any works that compare the predicted and actual accuracy of ASAP7; such works would likely be under severe NDAs anyway.
Given the ready availability of real FOSS PDKs like Skywater, Global Foundries or IHP, one wonders the choice to use this particular PDK. I speculate that it’s mostly to do with the title, after all, “1.5 GHz” sounds impressive.
Proper SoC design was not completed; memory/bus IF is unclear
Based on my analysis of the top-level view of the GDS presented in the paper, and the pinout, I conclude that the VerCore is not a proper System on Chip (SoC). There are no SRAM macros instantiated, so it’s unclear how much - if any - memory this system can address. In addition, there’s no description of what (assuming any) interconnect is used. In all likelihood, there is none, and this is just a core design.
This is… fine, kind of, except for the fact that the interconnect and memory addressing style are one of the key factors that define a CPU in practical use, and these non-trivial choices will naturally affect the Power, Performance and Area (PPA) of the core design, and by extension, the Fmax and CoreMarks of the design.
I am currently designing an Ibex-based SoC, which you can see a block diagram of below. (Lightly redacted due to ongoing research, with my apologies.)
As you can see, there are a lot of design decisions about, for example, if the main memory should be tightly coupled, how large it should be, how it should be connected to the interconnect crossbar, etc. None of these decisions were considered in the paper, since it seems to be just a core design.
Since it is just a core design, the paper should have been more clear about this. It should have made it clear that the GDS design pictured in the lead section not only does not run Linux, but straight up does not function without additional peripherals.
The implemented RISC-V ISA is extremely limited
As part of the prompt to the agent, the authors explicitly specified:
Your task is to build VerCore, a RISC-V CPU core that supports RV32I and ZMMUL … DO NOT support compressed instructions.
It’s not immediately clear why they did not implement compressed instruction decoding, as it’s not an enormously significant challenge on top of the base RV32I_Zmmul ISA. If I were to hazard a guess, it would be probably because the agent had difficulties implementing it in the past. That, in turn, is likely because on RV32I, you can assume that all instructions are 4-byte word aligned; whereas on RV32IC, some instructions are going to be 2-bytes instead. The choice of Zmmul over M is required, since M requires a hardware divider which is cumbersome.
By all means, an RV32I_Zmmul chip is a valid barebones implementation. However, there are numerous existing cores - hoovered up in the training set of the LLM, no doubt - that support RV32IMAFDC and probably more extensions. The fact that an agent, which uses an undisclosed number of tokens at vast expense, can regurgitate a RV32I_Zmmul design that has been described in numerous textbooks, is not immediately impressive; even amongst other research in the AI/agent-coding “field”.
The “memory” system of the agent is unclear
One of the biggest claims of the paper is that their new agent has an “unlimited” memory system. From the paper (emphasis mine):
Specific knowledge is provided to DC via a dedicated knowledge base. This knowledge base is contained within the main memory system. Memories exist indefinitely and are managed fully autonomously. DC makes use of this memory when it onboards itself onto a new codebase, or when it ingests requirements provided to it by users. This memory is also critical to ensuring that DC meets all requirements on the design requested by users, and that the design it is building meets all correctness requirements. A single DC “instance” is dedicated to one customer’s design, such that no code, memories, or any information crosses between customers.
The actual design of these modules is proprietary and not discussed further in this report.
That last paragraph… Great, just great. Exactly what we need to see in an arXiv paper; truly revolutionary and pushing the state of academia forward 🙄
My sincere belief is that the current “memory” systems of all AI “agents” are hacks on top of hacks. RAGs are hacks. Vector databases are hacks. None of these techniques embed correct information directly into the model, they simply shove it into the context window before the user’s prompt and pray it works, which is reminiscent of the lack of quality and novelty in the field in its entirety.
In any case, it remains unclear - intentionally so, by their writing - how exactly the memory system is utilised, what type of memory system it is, and if the memories are truly “unlimited”.
Claiming unlimited storage by itself is already a massive claim, but claiming unlimited memories for AI models with notably limited context windows is almost certainly false.
It is not clear how the Booth-Wallace multiplier was implemented
The paper makes a big deal out of the agent implementing a Booth-Wallace multiplier, as they describe it:
It features … a high-efficiency Booth-Wallace multiplier (which, on its own, clocks at 2.57 GHz). These attributes were discovered by DC and were not included in any input instructions (see 3).
Strictly speaking, a “Booth-Wallace multiplier” doesn’t exist; rather, a Wallace tree is used as the structure, and the Booth encoding is used as part of the multiplication algorithm itself.
Minor nitpicks aside, the problem with this is that Yosys itself, which was likely used in this paper as they also use OpenROAD, is itself capable of synthesising a Booth multiplier. This is implemented as part of the booth command in the technology mapping stage. For signed multiplication, specifically, it uses a radix-4 Booth-encoded multiplier by Chang et al., (2014).
Without the RTL, it is impossible to say; but if the multiplier was synthesised by Yosys, then it of course lends no credence to the model, but rather to the hard work of the Yosys maintainers. This section can of course be refuted easily if the RTL is released, and proves that the model did design a valid Wallace-tree multiplier with Booth encoding.
Discussion
Academic writing quality
I do personally consider it the height of rudeness in 2026 to drop AI spew directly into a conversation. I’ve seen people do this a lot: that is, copy and paste AI spew Markdown into GitHub threads, and I can’t stand it. It is then surprising, in 2026, to see a paper that almost entirely consists of this. Large swathes of this entire paper include verbatim conversations between the authors and the AI model. I don’t think this would, or should, be acceptable in any other discipline, yet we are allowing it in computer science.
Reproducibility
Speaking directly, this paper is not reproducible. It doesn’t even make an attempt to be reproducible:
The actual design of these modules is proprietary and not discussed further in this report.
The AI field already has a crisis of reproducibility 1, 2, 3, 4, 5. I particularly draw your attention to the last citation, which covers Google’s AlphaChip paper. This, I also believe, is another “bunk” paper; but this article is already quite long so I won’t go into it in its entirety. The tl;dr of the drama is that Google claimed to have developed an RL system that could do floorplanning better than humans. There was much disagreement within Google, but it was silenced, because Google had a promising agreement with Synopsys. A group of academics attempt to publish a work, known as Stronger Baselines, which showed that traditional techniques such as electrostatic placement and simulated annealing (which, happens to be my PhD topic of research!) perform equal to or better than the new RL paper. This paper was silenced, and the authors were illegally fired from Google. Google claimed that the paper did not implement their AlphaChip algorithm correctly, but also refused to publish a work which made their algorithm reproducible.
In other words, Google uses the lack of reproducibility of their work against other researchers.
I strongly believe that the lack of reproducibility of works in the AI space is a crisis of academia, and they need to be held to higher standards, otherwise, as can be seen in Google’s example, this can have grave effects on researchers and be used as a weapon against good science.
I do believe that it’s fair and reasonable to draw comparisons between this new Design Conductor paper and Google’s conduct in the AlphaChip paper: both are chip design papers involving allegedly impressive results, and both are completely unable to be reproduced whatsoever. It is a shame, truly, that this Design Conductor paper adds onto the pile of yet more irreproducible works.
Effects on pre-print servers
My strong opinion is that pre-print servers, such as arXiv, are increasingly becoming dumping grounds for corporate whitepapers void of any academic value whatsoever. We saw this, for example, in the GPT-4 whitepaper which does not include even the minimum necessary details about the model, such as the number of weights. This trend seems to continue in the AI age, and I believe this “Design Conductor” paper is the latest entry of note in the infiltration of corporate whitpapers into what was once an academic pre-print archive.
It is true that arXiv is a non peer-reviewed site, as all pre-print servers are. There has never been and there is absolutely no expectation of peer review. However, I argue that the publications uploaded to the website should at least be academic in nature. That is, I believe arXiv should cease acceptance of corporate whitepapers, papers that contribute no value to the field (i.e. just reproduce AI chat logs verbatim), and papers that are not reproducible by design. I think that arXiv serves an important goal in open access science, and it would be a shame to see that squandered in favour of this rubbish. It’s time that corporations either publish real papers, or stop leaning on the good will of arXiv.
What would I want to see?
Fundamentally, I don’t want to see AI chip papers at all!! But it would seem unfair to shred this paper without at least offering some actual structural critique.
However, to improve this paper, I would want to see a genuine, real, serious SoC design with memory, peripherals, and a pad frame. This could target any PDK, but for a true apples-to-apples comparison, it could target Skywater 130 nm or Global Foundries 180 nm, both of which are open-source, and hence can be compared against other open-source RISC-V implementations. In keeping with the paper’s methodology: this, by the way, should all be done using their agentic AI system without human intervention.
Verification should be more rigorous: the agent should write RTL testbenches for individual components, it should even do formal verification given how supposedly powerful people claim AI agents are nowadays.
The paper, in general, should be written less like an advertising piece and acknowledge the obvious limitations of not only the core, but of the “agent coding” system as well. Nowhere else in academia would it be acceptable to spew so much gushing praise for a university-textbook-level design, or to make so many unfair comparisons to devices that teams of talented engineers spent thousands of hours slaving over… but it’s apparently fine here because well it was made by AI.
Once that is all done, the design should be sent through the usual PPA characterisation flows. If, then, the agent can produce a design that is as robust as Ibex or OpenTitan, and has a similar (estimated) PPA, then I would be impressed. Briefly. Then upset when I realise how many millions of tokens, and gigawatts of compute, were wasted in the fruitless pursuit of the automation of work I love. But alas.
Conclusion
As we see our institutions crumbling, our water being drained, our electricity harvested, our farmland razed, our internet polluted, our money taken, our devices locked down… I would like to leave you with Brandolini’s law:
The amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it.
Until next time.
Thank you to Thalia for proofreading :)
