REINFORCEMENT LEARNING APPROACH TO ADAPTIVE MAINTENANCE SCHEDULING IN SEMICONDUCTOR MANUFACTURING

ПОДХОД К АДАПТИВНОМУ ПЛАНИРОВАНИЮ ТЕХНИЧЕСКОГО ОБСЛУЖИВАНИЯ В ПРОИЗВОДСТВЕ ПОЛУПРОВОДНИКОВ С ИСПОЛЬЗОВАНИЕМ ОБУЧЕНИЯ С ПОДКРЕПЛЕНИЕМ

Mykhaylov M. Wang A.

28.03.2025 48

3(132)

10. Информатика, вычислительная техника и управление

Цитировать:

Mykhaylov M., Wang A. REINFORCEMENT LEARNING APPROACH TO ADAPTIVE MAINTENANCE SCHEDULING IN SEMICONDUCTOR MANUFACTURING // Universum: технические науки : электрон. научн. журн. 2025. 3(132). URL: https://7universum.com/ru/tech/archive/item/19517 (дата обращения: 20.04.2025).

Прочитать статью:

DOI - 10.32743/UniTech.2025.132.3.19517

ABSTRACT

Effective maintenance scheduling is crucial in semiconductor manufacturing due to the high uptime requirements of production machinery and the substantial financial losses associated with downtime. Traditional scheduling methods often rely on predefined heuristics or static optimization approaches that do not fully leverage the available data on machine performance and future production demands. In this study, we develop a reinforcement learning (RL) model designed to optimize maintenance scheduling in a semiconductor fabrication environment. Our model integrates historical maintenance records, real-time machine performance data, and future order projections to generate adaptive maintenance schedules. By allowing an RL agent to interact with a simulated production environment, the system learns to make scheduling decisions that minimize downtime while ensuring equipment reliability. Experimental results demonstrate that our RL-based scheduler outperforms conventional industry-standard scheduling techniques by up to 30%, offering a more efficient and data-driven solution for maintenance planning in semiconductor manufacturing.

АННОТАЦИЯ

Эффективное планирование технического обслуживания имеет решающее значение в производстве полупроводников из-за высоких требований к времени безотказной работы производственного оборудования и существенных финансовых потерь, связанных с простоем. Традиционные методы планирования часто полагаются на предопределенные эвристики или подходы статической оптимизации, которые не в полной мере используют имеющиеся данные о производительности машины и будущих производственных потребностях. В этом исследовании мы разрабатываем модель обучения с подкреплением (RL), предназначенную для оптимизации планирования технического обслуживания в среде производства полупроводников. Наша модель объединяет исторические записи технического обслуживания, данные о производительности машины в реальном времени и прогнозы будущих заказов для создания адаптивных графиков технического обслуживания. Позволяя агенту RL взаимодействовать с имитируемой производственной средой, система учится принимать решения по планированию, которые минимизируют время простоя, обеспечивая при этом надежность оборудования. Экспериментальные результаты показывают, что наш планировщик на основе RL превосходит стандартные отраслевые методы планирования до 30%, предлагая более эффективное и управляемое данными решение для планирования технического обслуживания в производстве полупроводников.

Keywords: Reinforcement Learning, Maintenance Scheduling, Semiconductor Manufacturing, Adaptive Scheduling, Production Optimization

Ключевые слова: Обучение с подкреплением, планирование технического обслуживания, производство полупроводников, адаптивное планирование, оптимизация производства

Introduction. Semiconductor manufacturing presents unique challenges in maintenance scheduling due to the highly specialized nature of production equipment and the stringent uptime requirements. Each manufacturing contract comes with distinct specifications, including production capacity, chip technology, and urgency, all of which influence the rate of equipment wear and degradation. Furthermore, there is a strong correlation between equipment condition and chip yield, meaning that while extended operation without maintenance may seem beneficial in the short term, it can lead to diminished productivity and increased defect rates. Additionally, semiconductor fabrication tools are highly complex and expensive, and their failure can require days or even weeks to repair, resulting in significant production losses.

Traditional maintenance scheduling approaches, such as calendar-based maintenance, often lead to unnecessary downtime, reducing overall efficiency and failing to account for the dynamic nature of semiconductor production. Conversely, reactive strategies like ad-hoc maintenance address failures only after they occur, leading to costly disruptions and making them unsuitable for modern high-throughput manufacturing environments. To overcome these limitations, an adaptive system that continuously evaluates equipment conditions and optimally schedules maintenance is necessary. This research explores the application of reinforcement learning (RL) to develop an intelligent maintenance scheduling system. By leveraging historical maintenance data, real-time equipment performance metrics, and future production requirements, our RL-based approach aims to generate optimized maintenance schedules that maximize uptime and efficiency while minimizing operational risks.

Methodology

Problem Formulation

The initial approach to modeling maintenance scheduling in semiconductor manufacturing was based on the Economic Lot Scheduling Problem (ELSP) [2], a well-known problem in operations management. ELSP traditionally involves a single machine capable of producing multiple products at a constant rate, with a fixed demand and switching cost for each product. The objective is to schedule production in a way that minimizes overall costs, including production and storage expenses. In our initial formulation, maintenance was treated as an additional ”product” with its own demand, cost, and ”production capacity,” allowing the scheduling algorithm to determine the optimal proportion of time allocated to maintenance. However, this approach proved inadequate for long-term planning (beyond one month), as it failed to account for the dynamic nature of semiconductor production, where the production rate depends on the machine’s current condition and its maintenance history.

To address these limitations, we redefined the problem by shifting from a demand-based to an order-based representation of semiconductor production. Instead of modeling production as daily quotas for specific chip types, the system was redesigned to consider upcoming orders, incorporating parameters such as the total number of chips required, die area, deadlines, and penalties for missed deliveries. The state of the machine and its maintenance needs were inferred from simulated sensor readings, including temperature fluctuations and vibration patterns, rather than being explicitly dictated by predefined maintenance cycles. Under this revised framework, the goal of the reinforcement learning algorithm was to maximize overall profit, which inherently incentivized optimal maintenance scheduling by balancing machine uptime, production efficiency, and long-term equipment health.

Environment Simulation

The environment simulation model receives inputs from several sources, including future orders, the current state of the machine, production data, past maintenance records, and future scheduled maintenance. The simulation operates on a daily time step, with each state transition representing a one-day interval [8]. Future orders are modeled Fig. 1. Representation of semiconductor production state as the next five orders in the queue, each characterized by the order quantity

chip area

price per chip

deadline in days

and a late penalty, defined as five times the expected daily profits per day of delay. These simplified parameters have been estimated based on historical data and industry benchmarks.

The machine state is represented by one of three possible states: working, maintained, or broken. The yield of the machine follows an exponential decay function from the last day it was maintained, adjusted to account for larger chip sizes, which reduce the yield [4]. The number of chips produced on a given day depends on the yield and the chip size specified in the current order. Past maintenance records provide inputs such as the number of days since the machine was last maintained or broken. Future scheduled maintenance is represented by a binary indicator of whether maintenance is scheduled and the number of days until the next maintenance event, with a value of 0 indicating maintenance is either occurring today or not scheduled at all. A visual representation of these inputs is provided in Figure 1.

Figure 1. Representation of semiconductor production state

Based on these inputs, the model generates one of three possible actions: (1) do nothing, proceeding to the next day without altering the maintenance schedule; (2) schedule maintenance, adding or replacing a maintenance event 7, 14, 30, 60, or 90 days in the future depending on the specific action; or (3) deschedule maintenance, removing a previously scheduled maintenance event. The model assigns a reward to each state based on daily production, late penalties, and maintenance costs. The daily production reward is proportional to the number of chips produced that satisfy the current order, calculated at the order’s chip cost rate. If no chips are produced, the reward is zero. A late penalty is subtracted from the reward if the production deadline for the current order is missed. Additionally, if maintenance is performed, the reward is reduced by the cost of maintenance.

Reinforcement Learning Model

The reinforcement learning model is based on the Proximal Policy Optimization (PPO) algorithm, as proposed by [6]. PPO employs two neural networks with identical architectures: an actor network that interacts with the environment and a critic network that evaluates the performance of the actor. The training process proceeds iteratively, with each iteration consisting of the following steps. First, the current policy is executed in the environment for 1,000 steps, during which tuples of (state, action, reward) are collected and stored in memory. Next, the advantage of the taken actions is calculated using the critic network. The model then processes minibatches from the collected data, using the computed advantages to calculate the loss for the actor network. Finally, the loss is backpropagated, and both the actor and critic networks are updated using the Adam optimizer [3]. This architecture is illustrated in Figure 2.

Figure. 2. PPO actor-critic architecture

The actor and critic networks share the same architecture, differing only in their output layers. The actor network outputs log-probabilities for the seven possible actions, while the critic network outputs a single value representing the advantage of the current state. The network architecture, illustrated in Figure 3, consists of an input layer that accepts all model inputs, totaling 40 values. The inner layers comprise 512 neurons with ReLU activation functions, and the output layers use Softmax for the actor and a linear activation for the critic [1]. The actor network produces a probability distribution over the seven actions, while the critic network outputs a scalar advantage value. The full implementation is available at [5].

Figure 3. Neural network architecture

Results

The proposed reinforcement learning model for adaptive maintenance scheduling was evaluated against three standard industry techniques: adhoc maintenance, bi-weekly maintenance, and monthly maintenance. The ad-hoc approach involves no preemptive scheduling; maintenance is performed only when the machine breaks down.

The bi-weekly and monthly approaches schedule maintenance every two weeks and four weeks, respectively, regardless of the machine’s condition. The performance of each method was assessed based on two key metrics: production loss due to equipment failure and production loss due to maintenance downtime. Table 1 compares the production losses for each maintenance strategy

Table 1.

Production loss due to equipment failure and maintenance (in units of 10,000 chips)

Maintenance Strategy	Equipment Failure Loss	Maintenance Loss
Ad-Hoc	6.8	0.9
Monthly	3.1	3.3
Bi-weekly	2.4	6.9
Adaptive	1.6 ± 0.3	2.1 ± 0.4

The ad-hoc maintenance strategy resulted in significant production losses due to equipment failures, as the machine was not maintained proactively. However, this approach incurred minimal production loss from maintenance downtime, as maintenance was performed only during breakdowns. In contrast, the bi-weekly and monthly maintenance strategies demonstrated improved performance in reducing production losses due to equipment failures, as regular maintenance prevented unexpected breakdowns. However, these methods incurred higher production losses due to maintenance downtime, with the bi-weekly approach experiencing greater losses than the monthly approach due to its more frequent maintenance schedule.

Figure 4. Visualization of tested approaches in maintenance scheduling

The proposed reinforcement learning model outperformed all three baseline techniques, as can be seen in Figure 4, achieving a balance between minimizing production losses due to equipment failures and maintenance downtime. By adaptively scheduling maintenance based on the machine’s condition and production demands, the model significantly reduced the likelihood of equipment failures while maintaining a low maintenance overhead. This adaptive approach resulted in higher overall production efficiency compared to the rigid schedules of the bi-weekly and monthly methods and the reactive nature of the ad-hoc strategy. The results highlight the potential of reinforcement learning to optimize maintenance scheduling in semiconductor manufacturing, offering a robust solution that mitigates both failure-related and maintenance-related production losses.

Conclusion

This paper presented a reinforcement learning-based approach to adaptive maintenance scheduling in semiconductor manufacturing, addressing the critical challenge of balancing equipment uptime and maintenance costs. By leveraging the Proximal Policy Optimization (PPO) algorithm, the proposed model demonstrated superior performance compared to standard industry techniques, including ad-hoc, bi-weekly, and monthly maintenance strategies. The model effectively minimized production losses due to both equipment failures and maintenance downtime, showcasing its ability to adaptively schedule maintenance based on real-time machine conditions and production demands. These results underscore the potential of reinforcement learning to optimize complex decision-making processes in industrial settings, particularly in high-stakes environments like semiconductor fabrication.

Future work would explore several avenues to further enhance the model’s performance and applicability. First, incorporating additional environmental factors, such as varying production workloads, machine aging effects, and external supply chain disruptions, could improve the robustness of the scheduling algorithm. Second, extending the model to multi-machine environments [7], where maintenance scheduling must account for interdependencies between equipment, could provide a more comprehensive solution for large-scale semiconductor fabs. Third, integrating human expertise through hybrid approaches that combine reinforcement learning with rule-based systems could enhance interpretability and facilitate adoption in real-world settings. Finally, investigating the use of more advanced reinforcement learning algorithms, such as those incorporating hierarchical or meta-learning techniques, could further improve the model’s adaptability and scalability. By addressing these challenges, future research can continue to advance the state of the art in adaptive maintenance scheduling and its applications in industrial automation.

References:

Andrychowicz, Marcin, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Leonard Hussenot, et al. “What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study,” 2020. https://openreview.net/forum?id=nIAxjsniDzg&.
Elmaghraby, Salah E. “The Economic Lot Scheduling Problem (ELSP): Review and Extensions.” Management Science 24, no. 6 (1978): 587–98.
Kingma, Diederik P., and Jimmy Ba. “Adam: A Method for Stochastic Optimization.” arXiv, January 30, 2017. https://doi.org/10.48550/arXiv.1412.6980.
Kuo, Way, and Taeho Kim. “An Overview of Manufacturing Yield and Reliability Modeling for Semiconductor Products.” Proceedings of the IEEE 87, no. 8 (August 1999): 1329–44. https://doi.org/10.1109/5.775417.
Mykhaylov, Michael. “Mikemykhaylov/Semiconductor-Env.” Jupyter Notebook, November 16, 2023. https://github.com/mikemykhaylov/semiconductor-env.
Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal Policy Optimization Algorithms.” arXiv, August 28, 2017. https://doi.org/10.48550/arXiv.1707.06347.
M. Wang, J. Zhang, P. Zhang, and M. Jin, “Cooperative multi-agent reinforcement learning for multi-area integrated scheduling in wafer fabs,” International Journal of Production Research, vol. 0, no. 0, pp. 1–18, Oct. 2024, doi: 10.1080/00207543.2024.2411615.
Zhou, MengChu. “Modeling, Analysis, Simulation, Scheduling, and Control of Semiconductor Manufacturing Systems: A Petri Net Approach.” IEEE Transactions on Semiconductor Manufacturing 11, no. 3 (August 1998): 333–57. https://doi.org/10.1109/66.705370.

Информация об авторах