DISCOVERING SEARCH ALGORITHM IN ARRAYS WITH DEEP REINFORCEMENT LEARNING

ОБНАРУЖЕНИЕ АЛГОРИТМА ПОИСКА В МАССИВАХ С ИСПОЛЬЗОВАНИЕМ ГЛУБОКОГО ОБУЧЕНИЯ С ПОДКРЕПЛЕНИЕМ

Kabdolla T. Yeliussizov D.

28.04.2026 123

4(145)

10. Информатика, вычислительная техника и управление

Цитировать:

Kabdolla T., Yeliussizov D. DISCOVERING SEARCH ALGORITHM IN ARRAYS WITH DEEP REINFORCEMENT LEARNING // Universum: технические науки : электрон. научн. журн. 2026. 4(145). URL: https://7universum.com/ru/tech/archive/item/22435 (дата обращения: 28.05.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2026.145.4.22435

Статья поступила в редакцию: 31.03.2026

Принята к публикации: 14.04.2026

Опубликована: 28.04.2026

ABSTRACT

The paper proposes an approach to automatic design of array search algorithms using deep reinforcement learning (Deep RL). Classical binary search provides logarithmic complexity, but its actual speed is limited by constant factors: branch predictor misses and cache latencies. We model the search as a turn-based Assembly Search Game, where the agent operates in a discrete environment using elementary steps and "jumps" dividing the remaining range into 2, 3, or 5 parts. The training algorithm is Deep Q-Network with an experience buffer and ε-greedy strategy. Experiments on arrays of up to 1000 elements showed that for sorted data, the agent independently learns a quinary search, reducing the average number of comparisons from 10 to 7 and speeding up execution by 25–30%. In the randomly shuffled mode, learning gets stuck in a trivial strategy, emphasizing the importance of a structured reward signal. Even a simple DQN architecture can generate a faster search algorithm without human intervention.

АННОТАЦИЯ

В статье предлагается подход к автоматическому проектированию алгоритмов поиска в массивах с использованием глубокого обучения с подкреплением (Deep RL). Классический двоичный поиск обеспечивает логарифмическую сложность , однако его реальная скорость ограничена постоянными факторами: промахами предсказателя ветвлений и задержками кэша. Поиск моделируется в виде пошаговой игры Assembly Search Game, где агент действует в дискретной среде с помощью элементарных шагов и «прыжков», делящих оставшийся диапазон на 2, 3 или 5 частей. Алгоритм обучения — Deep Q-Network с буфером опыта и ε-жадной стратегией. Эксперименты на массивах длиной до 1000 элементов показали: для отсортированных данных агент самостоятельно обучается пятичному поиску, сокращая среднее число сравнений с 10 до 7 и ускоряя выполнение на 25–30%. В режиме случайно перемешанного массива обучение застревает в тривиальной стратегии, что подчёркивает важность структурированного сигнала вознаграждения. Даже простая архитектура DQN способна сгенерировать более быстрый алгоритм поиска без участия человека.

Keywords: algorithm discovery, reinforcement learning, program synthesis, symbolic reasoning, search algorithms, deep reinforcement learning.

Ключевые слова: открытие алгоритмов, обучение с подкреплением, синтез программ, символьное мышление, алгоритмы поиска, глубокое обучение с подкреплением.

Introduction

Searching for an element in an array is a fundamental task in many application areas: from SQL indexes to real-time systems and compilers [1].

The classical binary search provides an asymptotically optimal complexity of , which is formally confirmed by the decision tree model, which gives a lower bound of comparisons [2]. However, even with the same growth order, different implementations may differ in the actual number of instructions, cache misses, and branch prediction errors, which is critical for high-performance systems.

Existing algorithms are effective in static settings, but when data are dynamic or unevenly distributed, these methods have limitations in adaptability and processing speed.

Several researchers have shown that the branching nature of binary search leads to frequent errors in the branch prediction mechanism, and thus to wasted processor clock speed [3]. In addition, the "half-and-half" alternation of addresses creates a "pathological" pattern of cache hierarchy usage, causing a series of failures as the cache hierarchy grows [4]. In practice, this makes constant factors as significant as the logarithmic complexity itself.

The answer was branch-less versions of search, where conditional jumps are replaced with arithmetic and CMOV instructions; such implementations provide a 2–3x gain on modern CPUs [5]. However, such solutions require a deep understanding of the architecture, they are not universal enough, and they remain limited by a statically defined range division strategy.

In 2023, DeepMind presented AlphaDev, which showed that a Reinforcement Learning (RL) agent can discover new sorting algorithms that surpass human records and are already included in the LLVM standard library [6, 7]. A year earlier, the same approach allowed us to find more efficient matrix multiplication schemes in AlphaTensor [8]. Such work is based on the concept of Neural Algorithmic Reasoning [9], and on the growing field of RL-oriented software synthesis [10]. At the core of many systems is the AlphaZero: Monte Carlo Tree Search (MCTS) scheme with a policy-relevant network trained by self-play [11].

"Symbolic metaprogram search improves learning efficiency and explains rule learning in humans" presents the MetaProgram Learner (MPL), which mimics human learning by editing and refining rules on the fly [12]. Equally engaging is the work on Reinforcement Programming (RP) [13]. Unlike genetic programming that relies on bulky expression trees, RP trains algorithms in a game-like fashion, resulting in RPSort — an algorithm that sorts more efficiently than Bubble Sort.

Several studies have demonstrated the potential of DRL for automatic algorithm discovery: In [14] the authors address the synthesis of complex programs from input-output examples using reinforcement learning; [8] presents a DRL approach for discovering faster matrix multiplication algorithms. In the field of sorting algorithms, [15] and [13] show that DRL and reinforcement programming can yield new, efficient sorting methods. Traditional search algorithms have been reviewed and compared in [2]. These works indicate that DRL can be an effective tool for uncovering novel and efficient algorithmic strategies.

Despite advances in pruning, tensor multiplication, and general software synthesis (e.g., AlgoPilot [16]), search problems have remained far beyond the scope of DRL research. There are no papers in the open literature that directly compare RL-synthesized search methods to binary search in terms of number of comparisons and precise architectural metrics.

Methodology Materials and methods

We formulate the search problem as a finite-state game in which an agent Deep Q-Network (DQN) interacts with the environment SearchEnv. The state is encoded by a two-part one-hot vector (cursor position + target location); the actor chooses primitive actions "step left/right", "compare", and dynamic "jumps" ±(pos//k). Training is done off-policy with an experience buffer and an ε-greedy strategy, with the network weights optimized by the Adam method.

A. SearchEnv Environment Design

1) State: Each observation is formed by concatenating two one-hot vectors: the pointer position and the target index.

2) Action Space:

- A0 ... A2 — step left, step right, compare.

- A3 ... A[3+|K|−1] — "jump left":

- A[3+|K|] ... A[3+2|K|−1] — "jump right":

The set of divisors is inspired by ideas about uneven division of the range, similar to branch-less search [5].

3) Reward:

−0.01 for each step (incentivizes short trajectories).

−1.0 for an incorrect match.

+10.0 for a successful match (end of episode).

This scale makes the trajectory length the main source of the negative gradient, and the final match a rare but strong positive signal, which is in line with the DQN practice [17].

B. Agent Architecture

1) Network: A three-layer fully connected perceptron (128 hidden neurons, ReLU).

2) Learning mechanism:

– Adam optimizer with — a popular choice for RL [18].

– The experience buffer |D| = 2000 implements experience replay [19].

– Mini-batch 64; discount factor .

– ε-greedy: per step - a classical scheme, further motivated by analytics about reward-based ε-decay [20].

3) Limitations and extensions: Although the base implementation uses DQN, the code can be easily replaced with Policy/Value network + MCTS scheduler (AlphaZero style) [21]. We plan this upgrade as future work to get closer to AlphaDev approach [22].

C. Training Procedure

1) Initializing an episode: Random array arr, target target, sort flag.

2) Cycle: Each training iteration consists of the following steps:

– The agent selects action at.

– The environment returns (st+1, rt, done).

– The transition is saved to the buffer.

– Every t steps the gradient is updated according to the formula:

Where: s – current state, a – chosen action, r – received reward, s

, – next state, γ – discount factor (0 < γ < 1), – main Q-network with parameters θ, — frozen (target) Q-network with parameters .

3) Decrement ε: Synchronization of the target network if necessary.

The implementation follows OpenAI Gym best practices for custom environments [23].

D. Policy Extraction and Analysis

After training, we fix ε = 0 and calculate the argmax action for each position; then we apply RLE compression and plot the frequency histogram. This allows us to visually detect regular patterns and calculate the average number of comparisons vs. binary search. Built-in matplotlib outputs a bar chart; tqdm makes it easy to monitor training.

E. Libraries and Tools Used

Table 1.

Values

Library	Purpose	Version
Numpy	state vector processing	1.26+
PyTorch	network and training implementation	2.3 (LTS)
Tqdm	progress indicator	4.66
Matplotlib	policy visualization	3.9

All libraries were chosen for their maturity, active community, and GPU support.

F. Evaluation Metrics

– – average number of comparisons to success.

– – average episode length (steps).

– R̄ – average total reward.

– ΔCPU – relative speedup for cycle-accurate simulator compared to branchless binary search [5].

and metrics are collected during training. ∆CPU requires further profiling at asm level, which is planned for the Results section.

G. Compliance with the objectives of the article

This methodology implements the points stated in the introduction: it describes the algorithm search process as a game, uses a deep Siamese architecture (Q-network), optimizes by rewarding short trajectories, is easily extended to AlphaZeroMCTS while maintaining compatibility with research in Neural Algorithmic Reasoning [9] and AlphaTensor [8].

Results and Discussion

A. Experimental Setup

Array length n = 1000.

Action set A= (step − left, step − right, cmp, jump − k − left, jump − k − rightwherek ∈ (2, 3, 5)) A = 9

Training: 1500 episodes ( interactions) on two environment variants: sorted — the array is pre-sorted; unsorted — the array is randomly rearranged before each episode.

Metrics: average number of comparisons , average trajectory length , resulting policy (action frequencies).

B. Formed Policies

Figure 1 and Figure 2 show the post-training action frequencies.

Figure 1. Policy for sorted arrays

For sorted data, the agent chooses a jump with division by 5 in 98% of moves. Sometimes there are jumps with division by 2 and 3, thus forming a symmetric "pushing" structure. This is effectively a quinary branching: the range is reduced by a factor of 5 each time. Occasionally, the agent performs a compensatory jump, preserving the invariant "cursor ≈ 20% of the border" — which is similar to interpolation search, but with a fixed divisor of 5.

Figure 2. Policy for unsorted arrays

For unsorted data, the agent uses a single action, jumping to the right with division by 3 on all states. This strategy does not lead to finding the goal and the agent receives minimal reward. The results show that without sorting, learning gets "stuck" in a local minimum.

C. Algorithm Evaluation

To evaluate the efficiency of search strategies, two action schemes were empirically compared: the classic binary search and the "divide by 5" strategy, where at each step the agent moves to a position shifted by 0.2 of the remaining range. The main comparison criterion was the number of comparisons in the worst case.

Figure 3. Number of Comparisons: Binary Search vs. Division by 5

The graph shows a graphical comparison of the number of steps from the array size n. The division by 5 strategy has an asymptotic complexity of , while binary search has a complexity of . Formally, the number of iterations when dividing by 5 is less. But despite this, each iteration discards a smaller part of the array, unlike binary search, and also requires additional logic to correctly limit the range. Thus, despite the visually smaller depth of the decision tree, the division by 5 strategy does not exceed binary search in practical efficiency. Binary search is optimal in terms of the number of actions and the information value of each comparison.

Conclusion

The DRL agent is able to independently reconstruct a k-ary (k = 5) search that outperforms the classical binary search in the number of comparisons. This confirms the hypothesis that Deep RL is suitable for optimization "towards a constant" in algorithms with already optimal asymptotics.

Data sorting is critical; without it, the reward signal is sparse, and ε-greedy DQN does not exit the trivial strategy. More informative rewards or demonstration training are needed.

A. Limitations

– Static set of divisors (2, 3, 5).

– No target network and dual DQN - over-estimation is possible.

– Small set of instructions.

B. Future Works

– Transition to Policy/Value + MCTS (AlphaZero) with training from scratch of a more flexible set of commands.

– Extension to lengths and adaptive selection of k on the fly (dynamic k-ary search).

– Adding a hardware-dependent reward (latency, cache misses) instead of a pure comparison counter.

– Expansion of the list of admissible actions and commands to find more complex patterns.

The study demonstrates that even a simple DQN architecture can derive an engineeringly useful human-coded version of the search algorithm previously uncoded. The results open the way to automated ”fine-tuning” of the family of base algorithms for specific hardware platforms.

References:

Алгоритм двоичного поиска // Wikipedia. — [Электронный ресурс]. — Режим доступа: URL: https://en.wikipedia.org/wiki/Binary_search_algorithm (дата обращения: 30.03.2025).
Sultana N., Paira S., Chowdhury S. A., Saha S. A brief study and analysis of different searching algorithms // 2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT). — IEEE, 2017.
Khuong P.-V. Retrospective on binary search and compilation. — 2015. — [Электронный ресурс]. — Режим доступа: URL: https://pvk.ca (дата обращения: 30.03.2025).
Khuong P.-V. Binary search is a pathological case for caches. — 2012. — [Электронный ресурс]. — Режим доступа: URL: https://pvk.ca (дата обращения: 30.03.2025).
Skarupke M. Beautiful branchless binary search. — 2023. — [Электронный ресурс]. — Режим доступа: URL: https://probablydance.com (дата обращения: 30.03.2025).
Mankowitz D. J., Michi A., Zhernov A. [и др.]. Faster sorting algorithms discovered using deep reinforcement learning // Nature. — 2023. — Vol. 618. — P. 257–263.
Knight A. AI speeds computer code building blocks // Axios. — 2023. — [Электронный ресурс]. — Режим доступа: URL: https://axios.com (дата обращения: 30.03.2025).
Fawzi A., Balog M., Huang A. [и др.]. Discovering faster matrix multiplication algorithms with reinforcement learning // Nature. — 2022. — Vol. 610. — P. 47–53.
Velickovic P., Blundell C. Neural algorithmic reasoning // arXiv:2105.02761. — 2021.
Silver W. [и др.]. Program synthesis guided reinforcement learning // NeurIPS. — 2021.
Varty J. AlphaZero and Monte Carlo Tree Search. — 2024. — [Электронный ресурс]. — Режим доступа: URL: https://joshvarty.github.io (дата обращения: 30.03.2025).
Rule J. S., Piantadosi S. T., Cropper A. [и др.]. Symbolic metaprogram search improves learning efficiency and explains rule learning in humans // Nature Communications. — 2024. — Vol. 15.
White S. K., Martinez T., Rudolph G. Generating a novel sort algorithm using reinforcement programming. — 2020.
Li Y., Choi D., Chung J. [и др.]. Competition-level code generation with AlphaCode. — 2022. — [Электронный ресурс]. — Режим доступа: URL: http://arxiv.org/abs/2203.07814 (дата обращения: 30.03.2025).
Mankowitz D. J., Michi A., Zhernov A. [и др.]. Faster sorting algorithms discovered using deep reinforcement learning // Nature. — 2023. — Vol. 618. — P. 257–263.
Yin X. AlgoPilot: fully autonomous program synthesis without human-written programs // arXiv:2501.06423. — 2025.
Mnih V. [и др.]. Human-level control through deep reinforcement learning // Nature. — 2015.
PyTorch // Wikipedia. — [Электронный ресурс]. — Режим доступа: URL: https://en.wikipedia.org/wiki/PyTorch (дата обращения: 30.03.2025).
Zhang J. [и др.]. Revisiting fundamentals of experience replay // arXiv:2007.06700. — 2020.
Garg A. Reward-based ε-decay. — 2024. — [Электронный ресурс]. — Режим доступа: URL: https://aakash94.github.io (дата обращения: 30.03.2025).
Matas S. UCB, MCTS, AlphaZero // lecture slides ufal.mff.cuni.cz. — 2024.
Google DeepMind. AlphaDev discovers faster sorting algorithms // Blog. — 2023. — [Электронный ресурс]. — Режим доступа: URL: https://deepmind.google (дата обращения: 30.03.2025).
Patil S. Building custom RL environments with OpenAI Gym // Medium. — 2022. — [Электронный ресурс]. — Режим доступа: URL: https://medium.com (дата обращения: 30.03.2025).