Student, School of Information Technology and Engineering Kazakhstan-British Technical University, Republic of Kazakhstan, Almaty
A STREAMING FPGA SOC FOR INVERSE COLLATZ PREDECESSOR GENERATION
УДК 004.9
ABSTRACT
The Collatz conjecture attracts attention due to its simplicity of formulation and complexity of analysis. This paper describes a lightweight system-on-chip (SoC) for streaming inverse Collatz graph generation and subsequent writing of the results to an SD card. The design combines a PicoRV32 (RV32I) core with a hardware accelerator that expands the task list using inverse-preimage rules and communicates with the firmware via memory-mapped FIFO queues. Multi-block writes over SPI were used to reduce data transfer overhead and increase throughput. Experimental evaluation showed that the bottleneck of the system is not the hardware accelerator, but rather the CPU overhead and peripheral servicing. The final configuration achieves 375,590 B/s (93,897 words/s), while the throughput of the SD card alone is 659,820 B/s.
АННОТАЦИЯ
Гипотеза Коллатца привлекает внимание своей простотой формулировки и сложностью анализа. В данной статье описывается лёгкая система на кристалле для потоковой генерации обратного графа Коллатца с последующей записью результатов на SD-карту. Конструкция объединяет ядро PicoRV32 (RV32I) с аппаратным ускорителем, который расширяет список задач по правилам обратного прообраза и взаимодействует с микропрограммой через FIFO-очереди с отображением в память. Для снижения накладных расходов на передачу данных и повышения пропускной способности применялась многоблочная запись по SPI. Экспериментальная оценка показала, что узким местом системы является не аппаратный ускоритель, а накладные расходы ЦП и обслуживание периферии. Конечная конфигурация достигает 375 590 Б/с (93 897 Words/s), тогда как пропускная способность только SD-карты составляет 659 820 Б/с.
Keywords: Collatz conjecture; inverse graph; FPGA SoC; hardware accelerator; FIFO queues; SPI SD card.
Ключевые слова: гипотеза Коллатца; обратный граф; FPGA SoC; аппаратный ускоритель; FIFO очереди; SPI SD карта.
Introduction
The Collatz Conjecture is a simple but unsolved problem: x/2 for even x and 3x+1 for odd x, and it is claimed that every positive integer eventually converges to 1 [1; 2].
In this paper, we propose an FPGA-based SoC for constructing a Collatz preimage tree up to a configurable C_max. Unlike the work of Ito and Nakano [3; 4], the problem is treated as graph construction and verification: a hardware accelerator generates preimages via FIFO queues, and a RISC-V core manages the stopping conditions.
Ito and Nakano used an FPGA coprocessor to accelerate direct verification [3], subsequently increasing parallelism and throughput [4]. Ren et al. verified very large-width integers via bit strings and external storage [5]. Our prototype, in contrast, focuses on the 32-bit domain and explores the tradeoffs between accelerator, buffering, and SD card I/O.
Andrey and colleagues describe an explicit construction of Collatz trees using inverse rules [6], and Tezkan develops reduced inverse formulations [7]. Our contribution is a measured hardware and software implementation of these ideas in the form of a streaming FIFO path with an SD write pipeline. Alves et al. relate Collatz orbits to linear algebraic identities [8], Monks et al. analyze the graph through its modular structure and sufficient sets [9]. Mailland and Grobelna model it using Petri nets with an emphasis on inverse construction [10].
In general, existing work emphasizes the throughput of direct verification, representation scalability, or structural analysis of the graph [3–10]. The presented SoC complements these by considering the generation of inverse preimages as a streaming workload and quantifying the performance under limited FPGA resources.
We define the Collatz transform function
as
/Davletyrov.files/image002.png)
For given
, predecessor set ispre:
/Davletyrov.files/image004.png)
From (1) it follows that each y has an even preimage p0=2y, and an odd preimage when y=3p1+1 for odd p1, i.e.,
/Davletyrov.files/image005.png)
We generate a bounded reverse graph by emitting edges
for every
, that also satisfies
.
The preimage of p1 in equation (3) is derived from the inversion of the odd Collatz rule. If
, this means that
is also an integer for any given
. This means that for p1 to be an integer,
must be true. Furthermore, p1 must be odd, which constrains x to
. Substituting the previous statement yields
. Together, these two conditions reduce to
. This serves as both a necessary and sufficient condition for finding p1.
Materials and methods
Figure 1 describes the overall architecture based on the PicoRV32 (rv32i) core with memory-mapped peripherals [11; 12]. The configuration has three goals: a) the CPU manages system execution by signaling the return loop and stopping when termination conditions are reached; b) the hardware accelerator (HA) together with FIFO queues smooths out the difference in throughput between the fast HA and the slow CPU and SD write operations; c) the dump to the SD card is performed by firmware on the CPU in parallel with the loop management.
/Davletyrov.files/image015.png)
Figure 1. High-level architecture of a dataflow FPGA SoC
Figure 2 shows that the HA is wrapped in a small finite-state machine (FSM) that accounts for the latency of the input and output queues. When the FSM is enabled and the input queue is not empty, a pop command is issued, and the number extracted from the queue is made available in the next cycle and sent to the HA. The controller then sends p0 and p1 to the output queue, stopping if it is full to avoid data loss.
/Davletyrov.files/image016.png)
Figure 2. State machine for the hardware accelerator shell.
The system uses closed-loop firmware that redirects output data to the input, ensuring constant HA utilization and compensating for performance differences between the HA, CPU, and SD card. The SoC writes 512-byte sectors over SPI through an auxiliary module that offloads shift operations; multi-block writes maintain throughput. The overall performance is limited by the slowest element, so the key metrics are words/s for the HA and bytes/s for the SD card.
The accelerator is wrapped in a wrapper module (rcc_stream) in the form of an FSM that manages interaction with the FIFO. Since the input FIFO responds to a request with a one-cycle latency, the wrapper contains an intermediate wait state. The workflow is: wait for a non-empty input FIFO
extract the value
pass it to the accelerator
push p0, then p1 (if valid) to the output FIFO. When the output FIFO (out_full) is full, the controller holds the state and tries again without discarding data.
The input and output queues are synchronous FIFOs with MMIO wrappers providing 32-bit data and status registers. The final configuration: the input queue depth is 64, the output queue depth is 8192. Simultaneous push and pop operations when the FIFO is full are handled correctly: a write is allowed if a read occurs in the same cycle, freeing the slot.
A memory-mapped SPI module is used for writing to the SD card, handling clock generation and byte shifting. In streaming mode, the firmware sends a single CMD25 command, after which data is continuously transmitted through the module's TX FIFO at a fixed frequency of
= 13,5 MHz, reducing MMIO overhead compared to block-based writes. The output stream is written in 512-byte sectors as 32-bit unsigned integers in little-endian format. To increase throughput, sectors are batched and written with a single multi-block write command—the final configuration is 32 sectors per transaction.
Table 1 shows resource utilization on the Tang Nano 20K at 27 MHz: 22% logic elements and 20% registers. The main limitation is the on-chip memory: with an output queue depth of 8192, 16 single-port BSRAM blocks and 18 semi-dual-port (SDPB) blocks are used, which is 74% of the BSRAM utilization.
Table 1.
Resource usage after synthesis
|
Resource |
Used / Available |
Util. |
|
Logic |
4,367 / 20,736 |
22% |
|
Registers |
3,049 / 15,750 |
20% |
|
CLS |
3,778 / 10,368 |
37% |
|
I/O ports |
15 / 66 |
23% |
|
BSRAM |
16 SP + 18 SDPB |
74% |
The experiments were conducted on a Tang Nano 20K (GW2AR-LV18QN88C8/I7) board running at 27 MHz. The system includes a PicoRV32 (RV32I) core and memory-mapped peripherals: two MMIO interfaces for HA-CPU queues, an SPI auxiliary module, a UART, and a cycle counter. Memory mapping: cycle counter - 0x80000010-13, SPI - 0x80000030-3f, input queue - 0x80000040-4f, output queue - 0x80000050-5f.
The accelerator accepts 32-bit integers from the input queue and generates up to two preimages to the output queue. The firmware continuously redirects the output queue to the input until C_max is reached; The initial value of y is 1. The firmware was compiled statically with riscv64-unknown-elf-gcc for the RV32I (-march=rv32i2p0 -mabi=ilp32), with the -O1 optimization. The clock frequency is specified by the -DCLK_FREQ=Hz parameter in both the RTL and the firmware. The ~32 KiB SRAM (address width 13) is initialized via *.ini files using the method [13], after which the bitstream is synthesized and traced in the GoWin IDE.
Three benchmarks were run: a) HA only — words/s were measured without writing to SD, sensitivity to queue depth and backpressure was assessed; b) SD only — the firmware wrote a deterministic pattern using multi-block writes, the metric was bytes/s. c) End-to-end test – direct stream output to SD with periodic throughput reports.
Dump format: start marker sector, then a stream of 32-bit little-endian numbers at 512 bytes per sector, followed by an end marker sector. Sectors are combined into packets of configurable size and written with a single multi-block write command.
Throughput is calculated using the built-in cycle counter and the known frequency f_clk = 27 MHz:
/Davletyrov.files/image019.png)
/Davletyrov.files/image020.png)
The FIFO queue depth and multi-block write packet size are varied; for each configuration, the end-to-end throughput is recorded and qualitative effects, such as the impact of a nearly full output queue on system stability, are noted.
The correctness of the dump is verified in two ways: a) by confirming the rules for generating preimages and their order; b) by byte-by-byte comparison with the dump of a software reference model implementing the same rules (3) with an identical file structure. Additionally, the UART output and SD card dump are compared, and signal suppression conditions are checked: p1 is not output for
p0 is suppressed for
.
y ≢ 4 (mod 6), and p
Under non-saturation (
), the system behaves stably and deterministically. Rare sequence shift events were observed when the output queue was full, which are a consequence of backpressure rather than an accelerator error. This is considered a known limitation; throughput is estimated independently, and full correctness under saturation conditions requires further investigation.
Results and discussion
In HA-only mode, the CPU continuously empties the output queue while simultaneously feeding data to the input. As Table 2 shows, the cycles per word (cpw) remains constant (51) at all clock rates, while wps grows linearly—the system is limited by fixed control/I/O overhead, not throughput.
Table 2.
Throughput in HA mode only versus clock frequency
|
fclk (MHz) |
cpw (cycles/words) |
wps (words/s) |
|
27.0 |
51 |
529,408 |
|
13.5 |
51 |
264,704 |
|
9.0 |
51 |
176,469 |
|
6.75 |
51 |
132,352 |
Table 3 shows that output queue depth does not affect wps, but it does increase BSRAM consumption: as the depth increases from 512 to 8192, SDPB utilization increases from 42% to 74%.
Table 3.
Throughput and BSRAM usage in HA mode depending on output FIFO depth
|
IN depth |
OUT depth |
wps 27MHz |
Использование usage |
|
64 |
512 |
529,408 |
SP=16, SDPB=3 (42%) |
|
64 |
2048 |
529,408 |
SP=16, SDPB=6 (48%) |
|
64 |
8192 |
529,408 |
SP=16, SDPB=18 (74%) |
In SD-only mode, throughput increases with packet size due to a reduction in overhead. For packets larger than 32 sectors, the data does not fit in SRAM, so 32 is the optimal value, as can be seen in table 4.
Table 4.
SoC throughput in SD mode
|
Packets (блоки) |
B/s (average) |
B/s (min-max) |
|
4 |
550,885 |
504,198–557,808 |
|
8 |
600,683 |
545,588–610,189 |
|
16 |
636,245 |
570,039–639,978 |
|
32 |
659,820 |
653,284–669,460 |
The results are of end-to-end mode (HA+SD) in table 5 show significant performance drop compared to isolated tests: the CPU becomes the bottleneck, simultaneously servicing the push/pop FIFO and the SD packet write.
Table 5.
27 MHz Bandwidth Summary.
|
Benchmark |
Words/s |
Bytes/s |
|
HA |
529,408 Words/s |
— |
|
SD |
— |
659,820 Bytes/s |
|
HA+SD |
93,897 Words/s |
375,590 Bytes/s |
Conclusion
Conclusions. This paper examines a PicoRV32 (rv32i)-based SoC for generating a reverse Collatz tree, with FIFO queues and auxiliary modules, implemented on a Tang Nano 20K at 27 MHz. In HA mode, a stable throughput of 51 cycles/word (529,408 words/s) was achieved; queue depth has no effect on throughput but increases BSRAM consumption from 42% to 74%. Multi-block writes to SD memory increased throughput from 550,885 to 659,820 B/s as the packet size increased from 4 to 32 sectors. End-to-end throughput was 375,590 B/s, with the bottleneck being the CPU load when servicing HA and SD memory simultaneously.
In output queue saturation mode, rare sequence mismatch events were observed due to backpressure. Future plans include addressing this limitation, reducing CPU load, and narrowing the throughput gap between end-to-end and SD-only modes.
References:
- Lagarias, J. C. The 3 x + 1 problem and its generalizations / J. C. Lagarias// The American Mathematical Monthly. — 1985. — Vol. 92. — P. 3–23.
- Lagarias, J. C. The 3x+ 1 problem: An overview / J. C. Lagarias. — American Mathematical СнКiety, 2010. — Vol. 1. — P. 3–29.
- Ito, Y. A Hardware-Software Cooperative Approach for the Exhaustive Verification of the Collatz Conjecture / Y. Ito, K. Nakano. — IEEE, 2009.
- Ito, Y. Efficient exhaustive verification of the Collatz conjecture using DSP48E blocks of Xilinx Virtex-5 FPGAs / Y. Ito, K. Nakano. — IEEE, 2011.
- Ren, W. Collatz conjecture for 2^100000−1 is true: algorithms for verifying extremely large numbers / W. Ren [et al.] // 2018 IEEE SmartWorld (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). — IEEE, 2018. — P. 411–416. — URL: https://ieeexplore.ieee.org/document/8560077/ (дата обращения: 11.03.2026).
- Andrei, T. About the collatz conjecture : Tech. Rep. / T. Andrei, C. Masalagiu. — 1998.
- Tezcan, E. On collatz conjecture / E. Tezcan. — 2019. — URL: http://arxiv.org/abs/1902.07312 (дата обращения: 11.03.2026).
- Alves, J. F. A linear algebra approach to the conjecture of collatz / J. F. Alves [et al.] // Linear Algebra and Its Applications. — 2005. — Vol. 394. — P. 277–289.
- Monks, K. Strongly sufficient sets and the distribution of arithmetic sequences in the 3 x + 1 graph / K. Monks [et al.] // Discrete Mathematics. — 2013. — Vol. 313. — P. 468–489.
- Mailland, D. A novel approach to the collatz conjecture with petri nets / D. Mailland, I. Grobelna // Information (Switzerland). — 2025. — Vol. 16.
- Jahnke, M. Performance evaluation of picorv32 risc-v softcore for resource-constrained devices / M. Jahnke, L. Bublitz, U. Kulau // 2023 IEEE Nordic Circuits and Systems Conference (NorCAS). — IEEE, 2023. — P. 1–6.
- Todd, D. W. Tightly coupling the PicoRV32 RISC-V processor with custom logic accelerators via a generic interface / D. W. Todd. — Clemson University, 2021.
- Huhler G. Picorv32 Tang Nano Unified: PicoRV32 мини-СнК для FPGA-плат Tang Nano 9K и 20K [Электронный ресурс]. — 2026. — URL: https://github.com/grughuhler/picorv32_tang_nano_unified (дата обращения: 07.03.2026).