BEHAVIORAL FINANCE AND PORTFOLIO OPTIMIZATION

ПОВЕДЕНЧЕСКИЕ ФИНАНСЫ И ОПТИМИЗАЦИЯ ПОРТФЕЛЯ
Цитировать:
Dussaliyeva A., Bissembayev A. BEHAVIORAL FINANCE AND PORTFOLIO OPTIMIZATION // Universum: технические науки : электрон. научн. журн. 2025. 5(134). URL: https://7universum.com/ru/tech/archive/item/20164 (дата обращения: 05.12.2025).
Прочитать статью:
DOI - 10.32743/UniTech.2025.134.5.20164

 

ABSTRACT

This paper explores the integration of behavioral finance principles into portfolio optimization by leveraging sentiment analysis and reinforcement learning. Traditional models often assume investor rationality, yet empirical evidence highlights the influence of cognitive biases and emotions in financial decision-making. To address this gap, I propose a sentiment-aware optimization framework using Proximal Policy Optimization (PPO), where market sentiment is extracted from financial news headlines and weighted based on recency to reflect behavioral biases.

The methodology involves the construction of a custom reinforcement learning environment in which the agent learns to allocate dynamic portfolio weights by maximizing a Sharpe ratio-based reward. Two PPO models were independently trained and evaluated: one using both historical financial returns and recency-weighted sentiment signals, and another relying solely on return data. Additionally, Grey Wolf Optimizer (GWO) and Whale Optimization Algorithm (WOA) were employed as metaheuristic optimizers to identify optimal static portfolio allocations, using the PPO-trained value function as a fitness evaluator rather than directly interacting with the environment. Experimental results demonstrate that the sentiment-enhanced PPO model consistently outperformed the sentiment-free variant and the SPY benchmark in terms of total return. The findings support the hypothesis that incorporating behavioral signals can significantly improve financial decision-making and portfolio performance. This work highlights the value of hybridizing reinforcement learning with behavioral finance aspects and opens new pathways for adaptive investment strategies.

АННОТАЦИЯ

В данной работе исследуется интеграция принципов поведенческих финансов в задачу оптимизации портфеля с использованием анализа сентимента и обучения с подкреплением. Традиционные модели часто предполагают рациональность инвесторов, однако эмпирические данные указывают на влияние когнитивных и эмоциональных факторов на процесс принятия финансовых решений. Чтобы учесть этот аспект, я предлагаю фреймворк оптимизации с учетом сентимента, основанный на алгоритме Proximal Policy Optimization (PPO), где рыночный сентимент извлекается из заголовков финансовых новостей и взвешивается по принципу убывающей актуальности, чтобы отражать поведенческие отклонения.

Методология включает разработку индивидуальной среды обучения с подкреплением, в которой агент обучается динамическому распределению весов портфеля, максимизируя вознаграждение на основе коэффициента Шарпа. Были независимо обучены и оценены две модели PPO: одна использовала как исторические финансовые доходности, так и взвешенные по давности сигналы сентимента, другая опиралась исключительно на данные о доходностях. Кроме того, для поиска оптимальных статических аллокаций портфеля были применены методы метаэвристической оптимизации — Grey Wolf Optimization (GWO) и Whale Optimization Algorithm (WOA), где в качестве функции приспособленности использовалась обученная ценностная функция PPO, без прямого взаимодействия с окружающей средой.

Экспериментальные результаты показали, что модель PPO с учетом сентимента стабильно превосходила вариант без сентимента, а также бенчмарк SPY по совокупной доходности. Полученные результаты подтверждают гипотезу о том, что интеграция поведенческих сигналов может значительно повысить эффективность финансовых решений и улучшить результаты портфельного инвестирования. Работа подчеркивает ценность гибридного подхода, объединяющего обучение с подкреплением и аспекты поведенческих финансов, и открывает новые возможности для адаптивных инвестиционных стратегий.

 

Keywords: Behavioral Finance, Portfolio Optimization, Sentiment Analysis, Reinforcement Learning, Proximal Policy Optimization (PPO), Grey Wolf Optimization (GWO), Whale Optimization Algorithm (WOA), Sharpe Ratio, Dynamic Asset Allocation, Metaheuristic Optimization.

Ключевые слова: Поведенческие финансы, Оптимизация портфеля, Анализ сентимента, Обучение с подкреплением, Алгоритм Proximal Policy Optimization (PPO), Алгоритм оптимизации серых волков (GWO), Алгоритм оптимизации китов (WOA), Коэффициент Шарпа, Динамическое распределение активов, Метаэвристическая оптимизация.

 

Introduction

Portfolio optimization is a crucial part of modern investment management, aiming to balance risk and return in an efficient manner. Classical portfolio theories, such as Markowitz’s Modern Portfolio Theory (MPT) [1] and the Capital Asset Pricing Model (CAPM) [2], have long provided theoretical foundations for investment decision-making. However, these models rely on the assumption that investors behave rationally and that markets are efficient, as suggested by the Efficient Market Hypothesis (EMH) [3]. In reality, investors often exhibit irrational behavior, influenced by psychological biases, heuristics, emotions, and cognitive limitations. This has led to the emergence of behavioral finance, a field that integrates insights from psychology and economics to explain deviations from rationality in financial decision-making [4].

Behavioral finance challenges traditional financial theories by emphasizing the role of loss aversion, overconfidence, herding, and framing effects in shaping investor behavior [5]. These cognitive and emotional biases affect portfolio allocation, leading investors to under-diversify, chase past performance, or panic-sell during market downturns. Recognizing these behavioral tendencies, researchers have sought to incorporate behavioral factors into portfolio optimization models, developing approaches such as behavioral portfolio theory (BPT) [5] and prospect theory-based optimization [6].

Parallel to this, advancements in artificial intelligence (AI), evolutionary computation, and machine learning (ML) have enabled the development of behavioral-aware portfolio optimization models that dynamically adapt to changing market conditions and investor preferences. Recent research has explored deep learning-based portfolio selection, reinforcement learning-driven trading strategies, and heuristic optimization algorithms such as Genetic Algorithms (GA), Whale Optimization Algorithm (WOA), Grey Wolf Optimization (GWO), and Particle Swarm Optimization (PSO) to enhance portfolio decision-making [2, 11]. Moreover, sentiment analysis using social media and financial news data has gained prominence as a way to quantify market sentiment and investor mood, further integrating behavioral finance into quantitative portfolio management [17, 22].

Given these developments, this research aims to explore the integration of behavioral finance and portfolio optimization, bridging the gap between traditional financial theories, investor psychology, and computational intelligence. Specifically, this study will investigate how investor biases impact asset allocation and risk-taking behavior, and how AI-driven models can mitigate inefficient decision-making in portfolio construction. The research will leverage machine learning techniques, sentiment analysis, and heuristic optimization algorithms to design a behavioral-aware portfolio optimization framework, providing an innovative approach to investment decision-making in highly volatile financial markets.

This research is guided by the following key hypotheses:

H1: Incorporating behavioral sentiment improves portfolio performance over return-only models.

H2: PPO-trained value functions can serve as effective static fitness evaluators.

H3: GWO and WOA optimizers, when guided by PPO, can construct superior static portfolios.

H4: PPO-augmented static optimizers can outperform passive strategies like SPY.

Related works

This section reviews existing literature relevant to integrating behavioral finance, machine learning, and metaheuristic optimization within portfolio management. It emphasizes sentiment analysis, reinforcement learning (RL), particularly Proximal Policy Optimization (PPO), and metaheuristic algorithms including Grey Wolf Optimization (GWO) and Whale Optimization Algorithm (WOA).

1. Traditional Portfolio Optimization and Behavioral Finance

Portfolio optimization originated with Markowitz's Modern Portfolio Theory (MPT), which laid the foundational mean-variance optimization framework [7]. While influential, MPT relies heavily on assumptions such as rational investor behavior and stable market correlations, often invalid in practice [8]. Extensions such as the Capital Asset Pricing Model (CAPM) [9] and Arbitrage Pricing Theory (APT) [6] attempt to address these limitations but retain restrictive assumptions about investor rationality and market efficiency.

Behavioral finance addresses these limitations by incorporating psychological biases, such as loss aversion, overconfidence, and herding behavior, into investment decisions [10]-[11]. Behavioral Portfolio Theory (BPT) explicitly models investor preferences through layered risk-return structures, aligning more closely with actual investor psychology than traditional models [12]. Prospect Theory-based portfolio optimization further incorporates nonlinear investor risk preferences and subjective probability distortions, enhancing the realism of portfolio strategies [3].

2. Sentiment Analysis in Portfolio Optimization

Integrating investor sentiment through natural language processing (NLP) has significantly advanced portfolio optimization. Chen & Zou [13] demonstrated improved Sharpe ratios using sentiment from financial news. Extending this, Pundir [14] proposed recency-weighted sentiment signals to capture dynamic investor moods, enhancing portfolio returns and risk management. Muthivhi and van Zyl [15] combined LSTM-based predictive models with Twitter-derived sentiment, outperforming traditional predictive models and passive strategies significantly in terms of risk-adjusted returns.

3. Machine Learning and Reinforcement Learning in Finance

Machine learning (ML) and artificial intelligence (AI) introduce dynamic adaptability to portfolio optimization, allowing models to adjust to market structures, economic indicators, and investor sentiment in real-time [16]. Deep learning methods, such as Long Short-Term Memory (LSTM) networks, have demonstrated superior predictive performance in financial markets [17].

Reinforcement learning (RL) has become prominent due to its capacity for dynamic decision-making. Jiang et al. [4] employed Deep Deterministic Policy Gradient (DDPG) to achieve substantial returns and Sharpe ratios in portfolio management. Recently, Proximal Policy Optimization (PPO), recognized for its stability in continuous action spaces, emerged as particularly effective. Wang et al. [18] confirmed PPO’s robustness, significantly outperforming DDPG and Q-learning algorithms in annual returns and Sharpe ratios. To enhance interpretability, de la Rica Escudero et al. [19] integrated PPO with explainability frameworks such as SHAP and LIME, bridging the gap between model transparency and performance. Additionally, Zhang et al. [20] explored hybrid PPO approaches combining discrete and continuous action spaces, achieving reduced turnover and stable returns.

4. etaheuristic Optimization (GWO, WOA)

Metaheuristic optimization methods such as Grey Wolf Optimization (GWO) and Whale Optimization Algorithm (WOA) efficiently address complex, high-dimensional portfolio optimization problems. Ahmad and Shahid [21] demonstrated GWO’s superior efficient frontiers compared to traditional heuristics. Hasan et al. [22] validated WOA’s convergence and risk-return efficiency, surpassing Genetic Algorithms (GA) and Particle Swarm Optimization (PSO).

Liang et al. [1] introduced the concept of using RL-trained value functions as surrogate fitness evaluators in genetic algorithms, significantly improving computational efficiency and optimization outcomes. Similar principles guide the hybrid RL-metaheuristic approaches presented in this research.

5. Hybrid Approaches and Advanced Techniques

Recent advances have explored hybrid methods combining deep learning, reinforcement learning, and metaheuristics for portfolio optimization. Pratama and Putra [24] developed an LSTM and GA-enhanced PMPT framework, achieving substantial outperformance over traditional models. Jeribi et al. [23] integrated deep learning models (CNN-based) and metaheuristics (IBWO) to significantly improve prediction accuracy and Sharpe ratios.

Quantum-inspired approaches have also shown potential. Kuo et al. [26] utilized quantum-inspired tabu search, outperforming classical methods in capturing market trends and managing drawdowns. Additionally, He and Zhang [25] integrated dynamic forecasting and regularization methods into heuristic optimization, significantly enhancing long-term wealth growth and Sharpe ratios.

6. Challenges and Future Directions

Despite advancements, challenges persist: RL models often lack interpretability, complicating investor trust and regulatory compliance. Sentiment analysis models face challenges in accurately representing nuanced market sentiment. Future research should focus on enhancing interpretability, refining sentiment integration, and validating methodologies across broader and diverse market conditions to develop robust, behaviorally-informed portfolio management frameworks.

Materials and methods

This section explains methodological framework employed to investigate the integration of behavioral sentiment and advanced optimization algorithms in portfolio management. The pipeline is conceived as a multi-stage process, where each phase builds upon the results of the preceding one, ensuring both theoretical rigor and empirical validity. The primary research question - How can recency-weighted behavioral sentiment signals and global optimization techniques be integrated to construct portfolios with superior risk-adjusted performance? - serves as the guiding thread throughout the workflow. The pipeline comprises five core stages and is reflected in Figure 1.

1. Data collection and processing

The implementation of the portfolio optimization model relies heavily on robust data collection and meticulous preprocessing to ensure that both quantitative market data and qualitative sentiment information are accurately incorporated into the framework. This section outlines, step-by-step, the methods employed for sourcing, integrating, and transforming these diverse datasets.

a. Financial Data Acquisition

The data collection involved two distinct yet complementary datasets aimed at capturing both market behavior and investor sentiment. The first dataset comprises labeled financial news articles containing headline-level sentiment labels (positive, negative, or neutral). The second dataset consists of daily stock prices of S&P 500 constituents from 2014 to 2024. The goal was to transform these raw inputs into synchronized weekly matrices of asset returns and behavioral sentiment, suitable for reinforcement learning and optimization models.

Figure 1. Flowchart of the proposed methodology

 

b. Sentiment Data Preprocessing

The labeled news dataset was merged with a supplementary file containing publication timestamps by matching headline texts. To ensure temporal alignment, only entries with valid publication dates were retained. Each news headline was parsed using regular expressions to extract potential ticker symbols, defined as sequences of 2 to 5 uppercase alphabetical characters (e.g., `AAPL', `TSLA').

To incorporate behavioral bias—specifically recency bias—the sentiment score associated with each headline was weighted by the inverse of the number of days since its publication. Formally, the weighted score for a sentiment label published d days ago is defined as:  .

These scores were grouped by week and ticker, then averaged to form a weekly sentiment score per asset. The resulting sentiment matrix , where T denotes the number of weeks and N the number of assets, captures temporal fluctuations in market sentiment, with greater emphasis on more recent news.

c. Market Data Preprocessing

The financial dataset consisted of daily adjusted closing prices for a broad set of S&P 500 tickers. The dataset was cleaned by removing columns with missing values to maintain consistency across assets. Daily returns were computed as percentage changes in closing prices. These were then resampled to a weekly frequency by summing daily returns within each calendar week: , where  is the return for asset i during week t, and  is the adjusted price on day d.

To maintain temporal consistency with the sentiment matrix, the weekly returns matrix was reindexed to align with the starting Monday of each week. Only those tickers present in both sentiment and price datasets were retained.

d. Synchronization and Final Feature Matrices

The sentiment matrix and the return matrix were aligned by taking the intersection of their time indices (weeks) and columns (tickers). This ensured that each element in the final dataset had corresponding sentiment and return information. The output of this process consists of two matrices:

,: Recency-weighted sentiment scores

: Weekly asset returns

These matrices form the foundational input for both the PPO-based reinforcement learning agent and the GWO/WOA evolutionary optimizers, enabling behavioral-aware portfolio decision-making.

2. Reinforcement learning environment design

To simulate the decision-making process of a portfolio manager, a custom reinforcement learning environment named PortfolioEnvSharpe was developed using the OpenAI Gym framework. This environment is tailored for weekly portfolio allocation tasks, enabling the agent to learn investment strategies through interaction with market data and behavioral sentiment signals.

a. State Space

The observation vector provided to the agent at each step consists of three components:

Weekly return vector: Historical returns for all assets over a specified lookback window.

Sentiment vector: Recency-weighted sentiment scores for each asset at the current timestep (optional, depending on experiment configuration).

Previous portfolio weights: The allocation of assets in the previous week.

The inclusion of the sentiment vector enables the agent to incorporate behavioral information into its decision-making process, potentially improving its ability to anticipate price movements influenced by market mood.

b. Action Space

The action taken by the agent is a vector of portfolio weights representing the proportion of capital allocated to each asset in the current week. The weights are subject to the following constraints:

Each weight must be non-negative: .

The sum of weights must equal 1: .

An optional upper bound (e.g., 0.2) can be imposed to limit exposure to any single asset.

These constraints ensure realistic and diversified portfolio allocations.

c. Reward Function

The agent receives a scalar reward at each step based on the Sharpe Ratio computed over a rolling window of portfolio returns. The reward at time t is calculated as:  where r denotes the portfolio return series over the past k timesteps (e.g., 8 weeks),  is the mean return,  is the standard deviation, and   is a small constant added for numerical stability.

This reward structure encourages the agent to seek portfolios that offer high risk-adjusted returns, rather than simply maximizing raw profit.

d. Reset and Step Functions

Upon reset, the environment initializes the portfolio with uniform weights and sets the portfolio value to 1.0. At each step:

The environment calculates the portfolio return for the week based on the agent’s chosen weights and the realized asset returns.

The portfolio value is updated multiplicatively.

The reward is computed based on recent portfolio performance.

A new observation is returned to the agent.

e. Motivation for Design Choices

The design of PortfolioEnvSharpe reflects real-world considerations in portfolio management:

Weekly granularity mimics institutional rebalancing cycles.

Sharpe-based reward captures investor preference for stability and return consistency.

Inclusion of sentiment reflects behavioral finance insights, allowing the agent to exploit crowd-driven inefficiencies.

This environment serves as the foundation for training and evaluating the PPO-based reinforcement learning agents described in subsequent sections.

3. Proximal Policy Optimization (PPO) Implementation

The reinforcement learning agent for portfolio allocation was trained using the Proximal Policy Optimization (PPO) algorithm implemented via the Stable-Baselines3 library. PPO was chosen for its stability, sample efficiency, and suitability for continuous action spaces such as portfolio weights.

a. Algorithm Overview

PPO is an on-policy actor–critic algorithm, concurrently learning a policy (actor) and a value function (critic). It operates through five main steps:

Experience Collection: Agent interactions produce state-action-reward trajectories.

Advantage Estimation: Generalized Advantage Estimation (GAE) computes advantage  ​, balancing bias and variance in reward prediction.

Clipped Policy Update: PPO updates the policy by maximizing a clipped surrogate objective:

Where , represents the importance sampling between the new and old policies, ensuring off-policy corrections during updates. Clipping  to  prevents overly large policy shifts and maintains stable learning.

Value Function Update: The value network minimizes Mean Squared Error (MSE) against observed returns.

Exploration Regularization: An entropy term encourages sufficient exploration to avoid premature convergence.

b.  Advantages and Limitations

Key benefits of PPO include:

  • Stability from clipped policy updates,
  • Direct handling of continuous actions,
  • Simplicity of implementation,
  • Efficient on-policy data usage.
  • However, PPO also faces challenges:
  • Data intensity due to on-policy requirements,
  • Sensitivity to hyperparameter tuning,
  • Potential exploration restrictions,
  • Higher computational overhead from multiple optimization epochs per batch.

Training Setup

The PortfolioEnvSharpe environment was vectorized via DummyVecEnv.
Two variants were trained:

  • PPO with sentiment (sentiment included in the state vector),
  • PPO without sentiment (baseline).

The input state comprised normalized weekly returns, sentiment scores, and previous portfolio weights.

c. Hyperparameters

Empirically tuned hyperparameters are summarized in the table below:

Table 1.

PPO hyperparameters

Hyperparameter

Value

Rationale

Learning rate

2.5e-5

Avoid overfitting on noisy returns

Discount factor ()

0.985

Emphasize mid-term returns

Entropy coefficient

0.001

Reduce randomness in actions

Batch size

128

Improve gradient smoothness

Number of steps per update

512

Enable more accurate policy updates

Number of epochs

10

Multiple passes on each batch

PPO clip range

0.2

Stabilize policy updates

GAE lambda

0.95

Standard λ for GAE

Total training timesteps

30,000

Increase confidence in learning outcome

Seed

42

Ensure reproducibility

 

These parameters balance exploration and exploitation, emphasizing stable, long-term portfolio performance.

d. Training and Evaluation Procedure

The agent was trained on historical data (2014–2024). At each timestep, the agent chose portfolio allocations, observed rewards, and updated its policy.
Performance was evaluated via backtesting, comparing cumulative returns, volatility, and Sharpe Ratios of PPO models (with/without sentiment) and the SPY market benchmark.

4. PPO-Guided Fitness Function for Evolutionary Optimization

In addition to training the PPO agent for dynamic portfolio allocation, this study proposes a novel static optimization method using the trained PPO value function (critic) as a surrogate fitness evaluator. Rather than deploying the PPO policy directly, the learned value estimates are reused to evaluate static portfolios, enabling evolutionary algorithms to efficiently explore allocation spaces.

a. Motivation

The PPO value network estimates future risk-adjusted returns based on returns, sentiment, and portfolio weights.

Reusing these learned estimates as fitness scores allows efficient optimization aligned with the PPO agent’s internal market and behavioral understanding, significantly reducing computational overhead.

b. Surrogate Fitness Function

The surrogate function evaluates static portfolio allocations by:

  1. Concatenating the latest weekly return vector, sentiment vector, and portfolio weights into one feature vector.
  2. Normalizing using PPO’s training data mean and standard deviation.
  3. Feeding this input into the PPO value function network.

The output scalar serves as the portfolio’s estimated fitness score.

c. Mathematical Formulation

Let  denote the PPO value function. Then, the fitness function F for a weight vector w is defined as:

where is the return vector and is the sentiment vector at the current time step.

d. Advantages

Key benefits include:

  1. Efficiency: Rapidly evaluates numerous portfolios, significantly reducing computation.
  2. Generalization: Learns complex relationships, allowing assessment of novel portfolios.
  3. Consistency: Uses fixed evaluation criteria derived from the trained PPO model.
  4. Behavioral Signal Integration: Implicitly incorporates sentiment and behavioral factors into evaluations.

e. Disadvantages

Potential drawbacks include:

1. Approximation Error: Imperfect PPO training may result in suboptimal approximations.

2. Training Dependency: Surrogate accuracy relies on PPO agent’s training quality.

3. Limited Adaptivity: Static portfolios lack dynamic adaptation to evolving market conditions.

4. Loss of Temporal Dependencies: Static optimization ignores sequential decision-making captured by RL.

This surrogate fitness approach bridges reinforcement learning and metaheuristics, forming the basis for the GWO and WOA strategies discussed subsequently.

5. Evolutionary Optimization with GWO and WOA

To complement the reinforcement learning-based strategy, this study explores evolutionary optimization techniques — specifically Grey Wolf Optimization (GWO) and Whale Optimization Algorithm (WOA) — to search for globally optimal static portfolio allocations. Unlike PPO, which dynamically updates allocations over time, these methods seek a single optimal weight allocation evaluated via a surrogate model.

a.  Algorithm Overview

GWO and WOA are population-based metaheuristic algorithms inspired by animal behavior:

  • GWO (Grey Wolf Optimization): Simulates grey wolves' leadership hierarchy and hunting, where top candidates (alpha, beta, delta) guide the search process.
  • WOA (Whale Optimization Algorithm): Mimics humpback whales' bubble-net hunting behavior, balancing exploration and exploitation through encircling and spiraling movements.

These algorithms are well-suited for black-box, non-differentiable, and high-dimensional optimization problems, making them ideal for portfolio allocation tasks.

b. Optimization Constraints

Candidate portfolios (weight vectors) must satisfy:

In the portfolio optimization context, the candidate solutions represent weight vectors across assets. The search is performed under realistic financial constraints:

Non-negativity: , ensuring no short-selling.

Budget constraint: , enforcing full investment of capital.

Maximum exposure constraint:

Constraints are enforced via normalization and boundary projection at each generation.

6. Fitness Evaluation Using PPO Value Network

Rather than recalculating traditional financial metrics, the pretrained PPO value network is used to score portfolios:

  1. Concatenate the current returns vector, sentiment vector, and candidate weights.
  2. Normalize the input using PPO training statistics.
  3. Feed the input into the PPO value network to obtain a fitness score.

This method leverages PPO’s learned understanding of market patterns and behavioral signals.

a. Search Procedure

The evolutionary search progresses as follows:

  • Initialization: Generate an initial population of feasible portfolios.
  • Evaluation: Score each candidate using the PPO surrogate.
  • Update: Modify candidate positions via GWO or WOA update rules.
  • Constraint enforcement: Reproject any infeasible solutions.
  • Convergence check: Stop if improvement stagnates or maximum iterations are reached.

b. Outcome and Backtesting

The optimization yields a single static weight vector, which is backtested alongside PPO dynamic strategies and the SPY benchmark to assess comparative performance.

c. Interpretability and Utility

Static GWO/WOA-optimized portfolios enhance interpretability for institutional investors. By leveraging the PPO-trained value network, these methods efficiently incorporate learned market behaviors without requiring continual retraining, uniting the adaptability of reinforcement learning with the global search capability of evolutionary algorithms.

7. Evaluation metrics

To comprehensively assess the performance of all portfolio optimization strategies, a suite of financial evaluation metrics was employed. These metrics collectively capture return potential, risk exposure, and risk-adjusted efficiency, enabling a multidimensional comparison between dynamic reinforcement learning-based models and static evolutionary optimization strategies.

a. Total Return

Total return measures the absolute percentage gain or loss of a portfolio over the entire evaluation period (2014–2024). It reflects the cumulative effect of weekly compounding of returns and is defined as:

Where  is the terminal portfolio value and =1 is the initial investment amount.

b. Average Weekly Return

This metric calculates the mean return earned by the portfolio on a weekly basis. It provides insights into the consistency and directionality of returns:

where T is the number of weeks and  is the portfolio return in week t.

c. Volatility (Standard Deviation of Weekly Returns)

Volatility quantifies the dispersion of returns and is used as a proxy for portfolio risk. A lower volatility implies more stable performance, which is particularly desirable under behavioral finance paradigms that penalize downside fluctuations:

where  is the average weekly return.

d. Sharpe Ratio

The Sharpe ratio is a widely adopted metric for evaluating risk-adjusted return. It measures the amount of excess return per unit of risk and is defined as:

In this study, the risk-free rate  is assumed to be zero, simplifying the equation to:

 

Where  is the average return and  is the standard deviation.

By applying these evaluation criteria uniformly, the study provides a robust and interpretable basis for comparing the relative effectiveness of sentiment-aware reinforcement learning, traditional PPO models, evolutionary optimization, and passive index tracking strategies.

Results and discussion

This section presents the empirical findings from backtesting five portfolio optimization strategies over 2014–2024: PPO with sentiment, PPO without sentiment, GWO, WOA (both using PPO value function as surrogate), and SPY as a passive benchmark. Performance is evaluated using cumulative return, average weekly return, volatility, and Sharpe Ratio.

1. Asset Selection

From the S&P 500 index, a subset of 117 assets was selected based on availability of complete price and sentiment data, ensuring reliable and synchronized datasets.

2. Portfolio Evolution: Initial Testing

Using random actions in the PortfolioEnvSharpe environment, portfolio value dynamics were observed:

  • Portfolio initialized with equal weights; starting value = 1.0.
  • Early fluctuations showed sensitivity to returns and volatility.
  • Reward normalization correctly reflected risk-adjusted gains/losses.

 

Figure 2. Portfolio Value Evolution During Initial Random Actions

 

3. Portfolio Evolution: PPO Agent with Sentiment

After training, the PPO agent achieved:

  • Total Return: +405.55%
  • Average Weekly Return: +0.5122%
  • Volatility: 2.5773%
  • Sharpe Ratio: 0.1990

 

A graph showing a growth of a company

AI-generated content may be incorrect.

Figure 3. Portfolio Value Over Time (PPO with Sentiment)

 

4. Impact of Sentiment: Ablation Study

To isolate the effect of sentiment integration, a comparative ablation study was performed. Two environments were used:

  • PortfolioEnvSharpe (With Sentiment): The observation space included historical returns, recency-weighted sentiment scores, and previous portfolio weights.
  • PortfolioEnvSharpeNoSentiment (Without Sentiment): The observation space included only historical returns and previous portfolio weights.

Key differences between the environments:

  • Input Features: Financial returns + sentiment vs only returns.
  • Market Information: Quantitative + behavioral vs only quantitative.
  • Decision Basis: Sentiment-aware decisions vs price-history-only decisions.
  • Expected Sensitivity: Higher sensitivity to market shifts with sentiment.
  • Potential Volatility Handling: Improved volatility anticipation using sentiment.

This comparison highlights the informational advantage offered by sentiment signals.

5. Portfolio Evaluation Using PPO Value Function

Static portfolio quality was assessed using the PPO value network. Random portfolio Sharpe scores clustered around 11.02–11.04.

 

A graph with a bar

AI-generated content may be incorrect.

Figure 4. Distribution of Predicted Sharpe Ratios (Random Portfolios)

 

Observations:

Low variance in portfolio quality estimates.

Preference for diversified portfolios with minor score differences.

This suggests strong generalization, but potentially limited fine-grained discrimination.

6. Optimization via GWO and WOA

Evolutionary optimizers were applied using the PPO surrogate:

  • GWO Best Predicted Sharpe: 11.0384
  • WOA Best Predicted Sharpe: 11.0454

 

A graph with a line

AI-generated content may be incorrect.

Figure 5. Convergence of GWO and WOA

 

WOA achieved slightly faster convergence and marginally better fitness scores.

7. Analysis and Interpretation

 

A graph of a graph showing different colored lines

AI-generated content may be incorrect.

Figure 6. Return Comparison Across Strategies

 

Table 1.

Summary of Strategy Performance

Strategy

Total Return (%)

Average Weekly Return (%)

Volatility (%)

Sharpe Ratio

PPO with Sentiment

405.55

0.5122

2.5773

0.1987

PPO without Sentiment

352.43

0.4759

2.4383

0.1952

GWO (PPO fitness)

308.87

0.4451

2.5033

0.1778

WOA (PPO fitness)

222.10

0.3729

2.4196

0.1541

SPY Benchmark

163.25

0.3184

2.5571

0.1245

 

a. Analysis of PPO-Based Strategies

The PPO agent trained with sentiment signals outperformed all others:

  • Total Return: 405.55%
  • Sharpe Ratio: 0.1987

PPO without sentiment:

  • Total Return: 352.43%
  • Sharpe Ratio: 0.1952

Although Sharpe Ratios were close, the cumulative wealth was significantly higher with sentiment integration. This confirms the predictive value of behavioral signals.

b. Performance of Evolutionary Strategies

  • GWO: Total Return 308.87%, Sharpe Ratio 0.1778
  • WOA: Total Return 222.10%, Sharpe Ratio 0.1541

Both evolutionary strategies outperformed the SPY benchmark but lagged behind PPO models, confirming the advantages of dynamic adaptation.

c. Benchmark Comparison

SPY Benchmark:

  • Total Return: 163.25%
  • Sharpe Ratio: 0.1245

The benchmark showed significantly lower performance, validating the superiority of adaptive, sentiment-enhanced models.

d. Conclusions from Experimental Analysis

H1: The consistent outperformance of the sentiment-aware PPO agent over its sentiment-free counterpart validates H1. The integration of behavioral sentiment signals into the observation space significantly enhanced both cumulative returns and Sharpe Ratios, confirming the value of qualitative data in improving portfolio performance beyond return-only models.

H2: The effectiveness of the PPO-trained value function as a static evaluator supports H2. Even without full environment rollouts, this learned surrogate reliably estimated the quality of portfolio configurations, as evidenced by the stable and discriminative Sharpe Ratio predictions across thousands of sampled allocations.

H3: The ability of the GWO and WOA optimizers to identify high-quality static portfolios using the PPO value function as a fitness guide confirms H3.
Among the two, WOA demonstrated faster convergence and marginally higher predicted Sharpe Ratios, indicating its strength in navigating complex search landscapes.

H4: The fact that both PPO-guided metaheuristics (GWO and WOA) outperformed the SPY benchmark across all key performance metrics validates H4.
Despite being static in nature, these strategies leveraged the intelligence embedded in the PPO model to achieve superior results relative to traditional passive investment approaches. In summary, the results demonstrate that sentiment-integrated reinforcement learning agents, when coupled with nature-inspired optimization techniques, can significantly improve investment outcomes.
This research highlights the importance of combining adaptive policy models with behavioral signals and metaheuristic exploration to unlock new potential in computational finance. 

Conclusion.

This research presents a comprehensive investigation into integrating behavioral sentiment signals and AI techniques — specifically reinforcement learning (PPO) and evolutionary optimization (GWO, WOA) — for portfolio allocation.

1. Summary of Findings

  • H1: Sentiment-aware PPO agents achieved superior cumulative returns (+405.55%) and the highest Sharpe Ratio (0.1987).
  • H2: PPO’s value function effectively served as a fitness evaluator.
  • H3: Evolutionary optimizers (GWO, WOA) successfully constructed high-quality static portfolios.
  • H4: PPO-augmented strategies outperformed the SPY index.

2. Contributions to Research and Practice

  • Developed a sentiment-aware reinforcement learning framework.
  • Introduced PPO’s value function for static portfolio evaluation.
  • Empirically validated sentiment signals in portfolio management.
  • Demonstrated that hybrid RL–metaheuristic models outperform standalone methods.

3. Methodological Innovations

  • Created a custom environment (PortfolioEnvSharpe) with recency-weighted sentiment.
  • Implemented PPO agents under realistic financial constraints.
  • Applied GWO/WOA under strict portfolio constraints.

4.Theoretical and Practical Implications

  • Theoretically bridges behavioral finance with AI-driven decision-making.
  • Offers practical, adaptive strategies for institutional portfolio managers.

5. Limitations and Future Work

Limitations:

  • Reliance on headline sentiment.
  • Exclusion of transaction costs.
  • Static fitness evaluation post-training.

6. Future directions:

  • Incorporating advanced NLP models (e.g., FinBERT).
  • Implementing rolling-window validation.
  • Expanding to ESG, multi-asset, or global portfolios.
  • Exploring explainable RL for greater transparency.

Final Remarks

Combining deep reinforcement learning with behavioral sentiment analysis and evolutionary optimization forms a powerful and flexible framework for intelligent portfolio management. This study advocates for sentiment-aware, AI-driven strategies that balance returns, risks, and behavioral context in financial decision-making.

 

References:

  1. Markowitz H. Portfolio selection. // The Journal of Finance. – 1952. – Vol. 7(1). – P. 77–91.
  2. Fama E. F. Efficient Capital Markets: A Review of Theory and Empirical Work. // Journal of Finance. – 1970. – Vol. 25(2). – P. 383–417.
  3. Kahneman D., Tversky A. Prospect Theory: An Analysis of Decision under Risk. // Econometrica. – 1979. – Vol. 47(2). – P. 263–291.
  4. Ross S. A. The Arbitrage Theory of Capital Asset Pricing. // Journal of Economic Theory. – 1976. – Vol. 13(3). – P. 341–360.
  5. Thaler R. H. Mental Accounting Matters. // Journal of Behavioral Decision Making. – 1999. – Vol. 12(3). – P. 183–206.
  6. Chen N., Zou Z. Sentiment-based portfolio optimization. // Finance Research Letters. – 2022. – Vol. 46. – 102325.
  7. Pundir P. Dynamic asset allocation using recency-weighted sentiment. // Expert Systems with Applications. – 2023. – Vol. 208. – 118244.
  8. Pratama B. N., Putra F. Deep Learning and Portfolio Optimization. // Journal of Risk and Financial Management. – 2022. – Vol. 15(6). – 263.
  9. Jeribi A., et al. Deep Learning Expert Framework for Stock Markets. // Neural Computing and Applications. – 2023. – Vol. 35. – P. 14975–14992.
  10. Jiang Z., Xu D., Liang J. A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem. // arXiv preprint arXiv:1706.10059. – 2017.
  11. Wang Y., et al. Dynamic portfolio optimization using PPO. // IEEE Access. – 2021. – Vol. 9. – P. 69407–69420.
  12. Liang Z., et al. Financial Portfolio Management via Deep Reinforcement Learning. // Quantitative Finance. – 2022. – Vol. 22(2). – P. 271–290.
  13. de la Rica Escudero F., et al. Explainable Deep Reinforcement Learning for Portfolio Management. // Journal of Financial Data Science. – 2023. – Vol. 5(1). – P. 59–73.
  14. Zhang J., et al. Hybrid Action Space PPO for Financial Portfolio Optimization. // Knowledge-Based Systems. – 2023. – Vol. 266. – 110293.
  15. Hasan R., et al. Whale Optimization Algorithm Applied to Portfolio Optimization. // Computational Economics. – 2022. – Vol. 60(3). – P. 931–952.
  16. Ahmad T., Shahid M. Grey Wolf Optimizer for Portfolio Optimization. // International Journal of Finance and Economics. – 2022.
  17. Zhou B., et al. Directional-Change Genetic Algorithm for Portfolio Rebalancing. // Applied Soft Computing. – 2023. – Vol. 126. – 109353.
  18. Kuo R. J., et al. Quantum-Inspired Tabu Search for Portfolio Optimization. // Expert Systems with Applications. – 2022. – Vol. 201. – 116981.
  19. Carrascal G. A., et al. Quantum Approximate Optimization Algorithm for Portfolio Optimization. // Quantum Information Processing. – 2023. – Vol. 22(3). – 86.
  20. Tripathy A., et al. D-Wave Constrained Quadratic Model for Portfolio Optimization. // Journal of Quantum Computing. – 2023.
  21. He X., Zhang C. Dynamic Price Prediction with Transaction Costs. // Information Sciences. – 2022. – Vol. 596. – P. 416–432.
  22. Loke Y. H., et al. Hybrid Metaheuristics for Portfolio Optimization: A Review. // Artificial Intelligence Review. – 2023.
  23. Muthivhi T., van Zyl G. Sentiment-Integrated LSTM Portfolios. // Journal of Computational Finance. – 2023.
  24. Leung M. T., et al. Machine Learning and Behavioral Finance. // International Review of Financial Analysis. – 2023.
  25. Aithal A., et al. Comparative Study of PPO, DDPG, and SAC for Financial Trading. // Applied Soft Computing. – 2022. – Vol. 128. – 109585.
Информация об авторах

Student, School of Information Technology and Engineering Kazakh-British Technical University, Kazakhstan, Almaty

студент, Школа информационных технологий и инженерии Казахстанско-Британский Технический Университет, Казахстан, г. Алматы

PhD, Associate Professor, School of Information Technology and Engineering, Kazakh-British Technical University, Almaty, Kazakhstan

PhD, ассоциированный профессор, Школа информационных технологий и инженерии, Казахстанско-Британский Технический Университет, Казахстан, г. Алматы

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Звездина Марина Юрьевна.
Top