Service Delivery Lead, Head Hunter (hh.ru), Russia, Moscow
DIFFERENTIAL IMPACT OF STORY POINT REVISIONS ON EFFORT PREDICTABILITY AND DELIVERY EFFICIENCY IN AGILE DEVELOPMENT
ABSTRACT
Agile software development relies on initial story point estimates for planning purposes; however, these estimates are frequently revised, raising questions about their impact on project predictability and delivery efficiency. This study examines the effect of post-estimation story point revisions on the predictability of actual effort, operationalized as absolute residuals from a baseline regression model, as well as on resolution time and time in progress. Using a dataset of 5,579 resolved software tasks, multiple regression models were applied with controls for project, issue type, and priority. A key methodological finding is that all tasks in the sample were estimated at creation, which excluded the analysis of estimation timing and allowed the study to focus explicitly on the core effect of estimate revisions. The results indicate that while story point revisions do not have a statistically significant impact on the predictability of final effort, they are strongly associated with substantially longer development cycles. Specifically, tasks with revised estimates exhibit approximately 48% longer resolution times and 70% more time in progress. These findings suggest that story point revisions serve as a critical signal of underlying complexity or scope change, significantly increasing active development effort and overall delivery timelines, and thus provide important indicators for risk management in agile contexts.
АННОТАЦИЯ
Agile-разработка программного обеспечения опирается на первоначальные оценки стори пойнтов для планирования, однако данные оценки часто пересматриваются, что вызывает вопросы об их влиянии на предсказуемость проекта и эффективность поставки. В данном исследовании изучается влияние пересмотра стори пойнтов после первоначальной оценки на предсказуемость фактических трудозатрат, операционализированную как абсолютные остатки от базовой регрессионной модели, а также на время разрешения и время в работе. Анализируя набор данных из 5579 разрешенных программных задач, применялись множественные регрессионные модели с контролем проекта, типа задачи и приоритета. Ключевым методологическим открытием стало то, что все задачи в выборке были оценены при создании, что исключило анализ времени оценки и, следовательно, сфокусировало исследование на основном эффекте пересмотров. Результаты показывают, что хотя пересмотр стори пойнтов не оказывает существенного влияния на предсказуемость конечных трудозатрат, он сильно связан с существенно более длительными циклами разработки, т.к. задачи с пересмотренными оценками демонстрировали примерно на 48% более длительное время разрешения и на 70% больше времени в работе. Полученные результаты указывают на тот факт, что пересмотр стори пойнтов является критическим сигналом базовой сложности или изменений объема работ, существенно увеличивающим активные усилия по разработке и общие сроки поставки, тем самым предоставляет важные индикаторы для управления рисками в agile-контекстах.
Keywords: Agile development, story points, estimate revision, effort predictability, delivery efficiency, issue resolution time, risk management, requirement complexity, software engineering.
Ключевые слова: Agile-разработка, стори пойнты, пересмотр оценок, предсказуемость трудозатрат, эффективность поставки, время разрешения задач, управление рисками, сложность требований, программная инженерия.
Introduction. Agile software development, characterized by its iterative and adaptive nature, has become the dominant paradigm in modern software engineering. A cornerstone of agile planning and resource allocation is the practice of story point estimation. Story points function as a heuristic, relative measure encompassing the effort, complexity, and inherent risk associated with the implementation of a user story or task. Initial estimates are crucial in enabling teams to forecast delivery timelines, manage backlogs, and effectively evaluate their capacity for upcoming sprints, thereby supporting efficient project management and the maintenance of predictability in dynamic development environments.
However, the inherent dynamism of software development means that initial plans rarely remain static. Story point estimates, despite their foundational role, are frequently revised as teams develop a deeper understanding of requirements, encounter unforeseen technical challenges, or adapt to changing business needs. Post-estimation adjustments are commonplace, raising fundamental questions about their actual implications. Are such revisions merely a healthy adaptive response to new information, or do they represent critical signals of underlying complexity, scope creep, or planning deficiencies that may substantially affect project predictability and delivery efficiency? The challenge lies in disentangling these possibilities within a complex and multifaceted development process.
It is important to note that the existing literature often conceptualizes estimation accuracy as a static comparison between initial estimates and actual outcomes, thereby overlooking the dynamic aspects of the estimation process itself [4]. Two critical yet underexplored dimensions are the timing of estimation finalization—the duration from issue creation to the point at which the final story point estimate is recorded (hereafter referred to as estimation finalization delay)—and the frequency of story point revisions after the initial estimate. Delays in estimation finalization may indicate early uncertainty or an extended exploration phase, whereas estimate revisions explicitly signal changes in the team’s understanding of the required work. The interaction between these two factors, as well as their combined impact on development outcomes, represents a significant gap in our understanding of agile project dynamics [4].
This study aims to address this critical gap by examining the dual effects and interaction of estimation finalization delay and the occurrence of story point revisions on both the predictability of actual development effort and key indicators of delivery efficiency in agile contexts [4; 9]. We hypothesize that both delays in estimation finalization and subsequent estimate revisions interactively influence the accuracy with which actual effort can be predicted, as well as the overall development cycle time [4; 9]. Specifically, effort predictability is operationalized as the absolute residuals from a baseline regression model predicting actual effort (in minutes) based on story points and other standard issue attributes [4; 9; 11], with larger absolute residuals indicating lower predictability. To assess delivery efficiency, we analyze resolution time and time in progress (both measured in minutes) as direct indicators of development efficiency and throughput [4].
To rigorously test the hypothesis, we analyze a large dataset of 5,579 resolved software issues. The methodology employs multiple regression models to systematically disentangle the individual and interactive effects of ‘Estimation_Finalization_Lag’ (as a continuous variable) and ‘story_point_changed_after_estimation’ (as a binary indicator of revision occurrence) [4]. Potential confounding factors, such as ‘project_id’, ‘type’, and ‘priority’, are carefully controlled for in our statistical models. This statistical approach enables testing the significance and magnitude of the effects and provides empirical evidence for interpreting the complex signals embedded in estimate revisions and their timing [4].
By providing a detailed understanding of how the dynamic nature of story point estimation—both in terms of timing and revisions—affects subsequent development outcomes, this study aims to deliver vital, empirically grounded insights [7; 17]. Ultimately, the work moves beyond simplified views of estimation accuracy [10] and offers practical indicators for agile teams to enhance planning precision, proactively manage risks, and substantially improve the predictability and efficiency of their software delivery processes [7].
Research methodology. The empirical basis of the study is a large dataset comprising 5,579 resolved software issues. The dataset, extracted from an agile software development environment and stored as issues.csv, contains detailed information related to the lifecycle of each issue. Key attributes include creation dates, dates of initial and final estimation, resolution dates, story point estimates, total effort recorded in minutes, and various categorical descriptors such as issue type, priority level, and unique project identifiers.
The initial phase of data preprocessing involved converting all relevant date–time columns, namely creation_date, estimation_date, resolution_date, and last_updated, into a standardized datetime object format [13; 15]. This transformation was critical to ensuring accurate temporal calculations required for subsequent feature engineering [12].
Following this, a rigorous filtering procedure was applied to construct a clean analytical sample, referred to as df_model_data [12; 20]. The refined dataset included only issues that met the following criteria: a non-zero story_point estimate, recorded non-zero total_effort_minutes, valid (non-null) entries for creation_date, estimation_date, and resolution_date, a status explicitly indicating completion (specifically, ‘Closed’, ‘Resolved’, or ‘Done’), and a total_effort_minutes value greater than zero [3]. These criteria ensured that the analysis focused exclusively on issues that had completed the full development lifecycle—from estimation through active work to final resolution—thereby providing a robust and relevant sample for examining effort predictability and delivery efficiency.
From the preprocessed dataset, several critical variables were constructed to directly address the hypotheses outlined in the study [5].
To quantitatively capture estimation finalization time, a central construct of the study, the variable Estimation_Finalization_Lag_Days was computed. This continuous variable measures the duration, in days, between an issue’s creation_date and its estimation_date. The estimation_date specifically denotes the point at which the final story point estimate for the issue was recorded [4].
To ensure logical consistency and prevent spurious data points, any calculated negative lag values were removed from the dataset, ensuring that all issues included in the analysis exhibited a non-negative estimation finalization delay.
The frequency of story point revisions after the initial estimation was captured by the binary indicator variable story_point_changed_after_estimation. This variable is fundamental to our investigation of whether such revisions serve as signals of underlying complexity or evolving work scope.
A value of True for this variable indicates that the issue’s initial story point estimate was subsequently modified or revised at least once after it was first recorded. Conversely, a value of False indicates that the story point estimate remained unchanged throughout the entire lifecycle of the issue.
In line with the assumptions of linear regression models and acknowledging the inherent right-skewness commonly observed in software engineering metrics such as effort and time, logarithmic transformations were applied to several key continuous variables.
Specifically, total_effort_minutes, resolution_time_minutes, in_progress_minutes, and Estimation_Finalization_Lag_Days were transformed using the natural logarithm of one plus the variable value (np.log1p).
This transformation effectively mitigates skewness, stabilizes variance, and promotes linearity in the relationships between variables, thereby enhancing the validity and interpretability of the regression analyses. The resulting transformed variables were labeled log_total_effort, log_resolution_time, log_in_progress_time, and log_estimation_lag, respectively.
To rigorously operationalize effort predictability as the accuracy with which actual development effort can be forecast, a baseline ordinary least squares (OLS) regression model was constructed [19]. The model was used to predict log_total_effort based on standard and widely used issue attributes, thereby establishing a benchmark for prediction error [19].
The dependent variable for this baseline model was log_total_effort. The set of independent variables included story_point, treated as a continuous numerical predictor, along with several categorical control variables: type, representing the issue category (e.g., bug, feature); priority, indicating its urgency; and project_id, to account for project-specific variation. Prior to model fitting, categorical predictors were converted into a numerical format suitable for regression through one-hot encoding using pd.get_dummies. To prevent multicollinearity, the first category for each one-hot encoded variable was excluded. A constant term was also explicitly added to the matrix of independent variables.
The OLS model was fitted using the statsmodels.api library in Python. After fitting the model, predicted values of log_total_effort were generated. The primary metric for effort predictability was then computed as the absolute value of the residuals from the baseline model [19]. Effort_Predictability_Residuals represent the absolute difference between the observed log_total_effort and the model-predicted log_total_effort [2]. Accordingly, larger absolute residuals indicate lower predictability of actual effort, serving as a direct and quantifiable measure of prediction error.
The central analytical phase of the study involved fitting three distinct multiple regression models. These models were designed to systematically examine the individual and interactive effects of log_estimation_lag and story_point_changed_after_estimation on the primary outcome variables. All models employed the OLS method [8] using the statsmodels.formula.api library in Python, which facilitates concise and intuitive formula-based specification of regression models, including complex interaction terms and categorical variables.
Outcome ∼ log_estimation_lag * story_point_changed_after_estimation + C(project_id) + C(type) + C(priority)
In this formula, the ‘*’ operator automatically includes both main effects—log_estimation_lag and story_point_changed_after_estimation—as well as their multiplicative interaction term. The C() wrapper explicitly designates project_id, type, and priority as categorical variables, ensuring their appropriate treatment as sets of dummy variables within the regression framework.
Model 1. Predicting effort predictability residuals. The first primary regression model was aimed at assessing the effects of estimation finalization delay, the occurrence of story point revisions, and their interaction on Effort_Predictability_Residuals. The model directly addressed whether delays in estimation finalization or the act of revising estimates contribute to reduced predictability of actual development effort, thereby increasing uncertainty in project forecasting [4].
Model 2. Predicting resolution time. The second model focused on predicting log_resolution_time. This outcome variable, representing the total duration from issue creation to its final resolution, serves as an important indicator of overall delivery efficiency and development cycle length [18]. The model sought to quantitatively assess how estimation dynamics influence the efficiency and speed of issue completion.
Model 3. Predicting time in progress. The third model examined the effects of the key predictors on log_in_progress_time. This metric specifically captures the duration an issue spends in an active development status, excluding any waiting time or periods of inactivity. By analyzing time in progress, the model provided insight into the efficiency of actual active development effort, offering a more granular understanding of how estimate revisions affect direct work execution.
For each of the three models, detailed regression summaries were carefully recorded and analyzed, including estimated coefficients, standard errors, p-values [1], and model fit statistics (such as R-squared), in order to establish the statistical significance [14] and practical magnitude of the hypothesized effects.
Prior to conducting the main regression analyses, an extensive exploratory data analysis (EDA) was performed to gain a deeper understanding of the dataset characteristics, variable distributions, and preliminary relationships. Histograms were generated for key continuous variables, including Effort_Predictability_Residuals, log_estimation_lag, log_resolution_time, and log_in_progress_time. These visualizations enabled assessment of their distributions, identification of skewness, and detection of potential outliers [16], thereby justifying the application of logarithmic transformations.
Bivariate analysis was conducted using boxplots to visually compare the distributions of the three outcome variables (Effort_Predictability_Residuals, log_resolution_time, and log_in_progress_time) across the two groups defined by story_point_changed_after_estimation (i.e., issues with revisions versus those without), providing an initial, intuitive assessment of the differential impact of story point revisions on critical performance metrics [10].
To aid in interpreting the interaction terms derived from the main regression models, interaction plots were generated using the regplot function from the seaborn library. These plots illustrate how the relationship between log_estimation_lag and each outcome variable varies depending on whether a story point revision occurred. This visualization approach provided a clear and intuitive understanding of conditional effects, demonstrating how the impact of estimation delay on predictability and delivery efficiency is moderated by the presence of story point revisions captured by story_point_changed_after_estimation [6]. All generated figures were saved as high-resolution PNG files for comprehensive documentation and reporting.
All stages of data manipulation, statistical modeling, and graphical visualization were carried out using the Python programming language (version 3.x). The primary libraries employed in the study included pandas for efficient data loading, cleaning, and manipulation; numpy for advanced numerical operations; statsmodels for robust statistical modeling, specifically statsmodels.api for general OLS regression and statsmodels.formula.api for formula-based model specification and inclusion of interaction terms; sklearn.linear_model for linear regression utilities; and matplotlib.pyplot in conjunction with seaborn for the creation of high-quality statistical visualizations.
Results and discussion. The initial dataset of software issues underwent a rigorous cleaning and filtering process, as described earlier, resulting in a final analytical sample of 5,579 resolved issues. This refined dataset formed the foundation for all subsequent analyses.
A key observation during feature engineering was the complete absence of variation in the variable ‘Estimation_Finalization_Lag_Days’. For every issue included in the analytical sample, the estimation_date was identical to the creation_date. This finding indicates a consistent development practice across the selected projects, whereby story point estimates are invariably recorded at the moment of issue creation. This pattern is visually illustrated in Figure 1 and Figure 2, where ‘Estimation_Finalization_Lag_Days’ (or its logarithmically transformed version) consistently exhibits zero variance. While this provides valuable insight into the teams’ workflow, it fundamentally precluded the examination of ‘Estimation_Finalization_Lag_Days’ as a variable and, consequently, the primary hypothesis regarding its interaction with story point revisions. As a result, the focus of the study was adapted to investigate solely the main effect of story point revisions.
The univariate distributions of the primary outcome variables—‘Effort_Predictability_Residuals’, ‘log_resolution_time’, and ‘log_in_progress_time’—exhibited the characteristic right-skewness commonly observed in software engineering metrics. This justified the application of logarithmic transformations to ‘resolution_time_minutes’ and ‘in_progress_minutes’, as described in the Methods section, ensuring that the transformed variables (‘log_resolution_time’ and ‘log_in_progress_time’) more closely satisfied the assumptions of linear regression models. Figure 1 presents a comparison of the original and logarithmically transformed distributions, clearly illustrating how the transformations mitigate right-skewness. In addition, Figure 2 specifically highlights the distributions of key variables, including ‘Effort_Predictability_Residuals’, ‘Log(Resolution Time)’, and ‘Log(In Progress Time)’, confirming their right-skewed nature prior to transformation (for the time-based metrics) and the zero variance of ‘Log(Estimation Finalization Lag)’.
/Khakimov1.files/image001.jpg)
Figure 1. Univariate distributions comparing the original and logarithmically transformed values for total effort, resolution time, time in progress, and estimation finalization lag.
The original effort and time metrics exhibit right-skewness, which is mitigated by logarithmic transformation to better satisfy regression assumptions. Estimation finalization lag displays zero variance, indicating consistent estimation at issue creation across the entire dataset.
/Khakimov1.files/image002.jpg)
Figure 2. Univariate distributions of key variables
The histograms illustrate right-skewed distributions for effort predictability residuals, Log(Resolution Time), and Log(Time in Progress), thereby justifying the application of logarithmic transformations to time-based metrics. Critically, Log(Estimation Finalization Lag) exhibits zero variance, indicating that all issues were estimated at creation, which precludes the analysis of estimation timing.
Preliminary bivariate analyses and the correlation matrix presented in Figure 3 revealed weak unadjusted correlations between ‘story_point_changed_after_estimation’ and the outcome variables (e.g., r = 0.073 for ‘log_resolution_time’, r = 0.053 for ‘log_in_progress_time’, and r = −0.008 for ‘Effort_Predictability_Residuals’). These initial observations underscored the necessity of employing multivariate regression to isolate the specific effect of story point revisions while rigorously controlling for potential confounding factors such as ‘project_id’, ‘type’, and ‘priority’. The correlation matrix also visually confirms the absence of correlations involving ‘log_estimation_lag’, reinforcing its zero variance.
/Khakimov1.files/image003.jpg)
Figure 3. Correlation matrix quantifying linear relationships among the key variables
The correlation matrix quantitatively defines the linear relationships among the key variables, including story points, effort, resolution time, time in progress, estimation lag, story point revision status, and effort predictability residuals. It illustrates weak unadjusted correlations between story point revision status and the outcome variables (logarithm of resolution time, logarithm of time in progress, and effort predictability residuals), underscoring the need for multivariate analysis. The absence of correlations involving ‘log_estimation_lag’ further confirms its zero variance.
To operationalize effort predictability, a baseline ordinary least squares (OLS) regression model was constructed to predict ‘log_total_effort’ based on ‘story_point’ and categorical control variables (‘type’, ‘priority’, and ‘project_id’). The model achieved an R-squared value of 0.379, indicating that approximately 37.9% of the variance in the logarithm of total effort is explained by the standard predictors. This level of explanatory power is considered moderate and aligns with expectations for effort estimation models in complex software development environments.
The absolute values of the residuals derived from the baseline model were then designated as ‘Effort_Predictability_Residuals’. This metric serves as a direct measure of prediction error, whereby larger absolute residuals indicate lower predictability of actual effort, reflecting greater divergence between observed and expected effort.
An assessment of multicollinearity among the predictors in the baseline model, particularly among the one-hot encoded categorical variables, revealed high variance inflation factors (VIFs) for several levels of ‘priority’ and ‘project_id’. Although this indicates substantial multicollinearity among the control variables, it is a common issue with highly granular categorical predictors and typically does not bias the coefficients of the primary variables of interest, nor does it invalidate the use of the model residuals for subsequent analysis, provided that the variable of interest is not highly collinear with the affected controls.
As a direct consequence of the zero variance in ‘log_estimation_lag’, as discussed in the descriptive analysis, the planned examination of the interaction between estimation delay and story point revision could not be conducted. Accordingly, the focus of the study was refined to investigate the main effect of ‘story_point_changed_after_estimation’ on the three primary outcome variables: ‘Effort_Predictability_Residuals’, ‘log_resolution_time’, and ‘log_in_progress_time’. Three separate multiple regression models were fitted, each including ‘story_point_changed_after_estimation’ as the key predictor and controlling for ‘project_id’, ‘type’, and ‘priority’. Table 1 summarizes the key coefficients for ‘story_point_changed_after_estimation’ from these models.
Table 1.
Key coefficients for ‘story_point_changed_after_estimation’
|
Модель |
Зависимая переменная |
Коэффициент ( |
Станд. ошибка |
|
|
R-квадрат |
|
1 |
‘Effort |
-0.0970 |
0.061 |
-1.595 |
0.111 |
0.131 |
|
2 |
‘log |
0.3923 |
0.091 |
4.313 |
|
0.206 |
|
3 |
‘log |
0.5279 |
0.120 |
4.394 |
|
0.429 |
|
Примечание: Все модели контролируют ‘C(project |
||||||
|
‘log |
||||||
Model 1, as detailed in Table 1, examined the relationship between story point revisions and the predictability of actual development effort, measured as ‘Effort_Predictability_Residuals’. The regression coefficient for ‘story_point_changed_after_estimation’ was estimated at −0.0970. However, this effect was not statistically significant (p = 0.111). This result is visually corroborated by Figures 4 and 5, which show largely comparable distributions of ‘Effort Predictability Residuals’ for issues with revised versus unrevised story point estimates.
The result suggests that, after controlling for various issue characteristics and project context, the occurrence of a story point revision does not substantially alter the magnitude of the final prediction error for an issue’s total effort. This finding may initially appear counterintuitive. However, it can be interpreted in light of the purpose of revisions: they are often undertaken to correct an initial, potentially inaccurate estimate based on new information or a deeper understanding of the task. If a revision successfully adjusts the estimate to better reflect the true effort required, the final story point value used in the baseline model may yield a prediction that is no less accurate than one based on an unrevised estimate. The underlying factors that necessitated the revision (e.g., emerging complexity) may be captured by other variables in the baseline model, or the revision itself may function as an effective recalibration mechanism.
Model 2, reported in Table 1, examined the impact of story point revisions on ‘log_resolution_time’, which represents the total duration from issue creation to final resolution. The analysis revealed a highly significant positive association, with a coefficient (β) of 0.3923 (p < 0.001). This effect is clearly illustrated in Figures 4 and 5, where issues with revised story point estimates (shown in orange in Figure 4 and in the right-hand box in Figure 5) consistently exhibit higher values of ‘Log(Resolution Time)’.
These results constitute a robust finding, indicating that issues undergoing story point revisions are associated with substantially longer resolution times. Given that the dependent variable is logarithmically transformed, the coefficient can be interpreted in terms of percentage change. Exponentiating the coefficient (e^0.3923 ≈ 1.48) implies that, holding all other control variables constant, issues with revised story point estimates experience, on average, approximately 48% longer resolution times compared to issues whose estimates remained unchanged. This strongly positions story point revisions as a critical signal of elevated schedule risk and extended overall delivery cycles.
Model 3, also summarized in Table 1, focused on ‘log_in_progress_time’, a more granular measure reflecting the duration an issue spends in an active development status. The model revealed an even stronger and highly significant positive association between story point revisions and time in progress, with a coefficient (β) of 0.5279 (p < 0.001). Consistent with this statistical finding, Figures 4 and 5 visually demonstrate that ‘Log(In-Progress Time)’ is markedly higher for issues with revised story point estimates.
Interpreting the coefficient (e^0.5279 ≈ 1.695) indicates that issues with revised story point estimates spend, on average, approximately 70% more time in the “In Progress” status compared to issues without revisions, after controlling for project, type, and priority. This result provides important insight into the source of the extended resolution time observed in Model 2. The substantial increase in active development effort, rather than merely administrative delays or waiting time, drives the overall lengthening of the development cycle. This finding strongly supports the view that story point revisions are symptomatic of underlying issues such as previously unanticipated complexity, scope expansion, or the need for significant rework, all of which translate directly into greater hands-on development time.
Although the interaction plots were initially intended to illustrate the combined effects of estimation delay and revisions, the zero variance in ‘log_estimation_lag’ limited their utility. Nevertheless, as discussed and shown, Figures 4 and 5 visually corroborate the main effects observed for ‘log_resolution_time’ and ‘log_in_progress_time’, demonstrating a clear upward shift in the distributions for issues with revisions compared to those without, under a uniformly zero estimation delay.
/Khakimov1.files/image009.jpg)
Figure 4. Scatter plots illustrating outcome measures by story point revision status, accounting for the fact that all estimates were recorded at issue creation (zero ‘Log (Estimation Finalization Lag)’)
Issues with revised story point estimates (orange) consistently exhibit higher ‘Log(Resolution Time)’ and ‘Log(Time in Progress)’ than unrevised issues (blue), indicating extended work duration. ‘Effort Predictability Residuals’ are comparable across both groups, suggesting that revisions do not substantially affect predictability under these conditions.
/Khakimov1.files/image010.jpg)
Figure 5. Boxplots comparing effort predictability residuals, Log (Resolution Time), and Log (Time in Progress) for issues with and without story point revisions
Issues with revised story point estimates exhibit noticeably higher median log-transformed resolution times and time in progress, indicating that such issues require substantially more time to complete. In contrast, the median absolute residuals for effort predictability are similar across both groups, suggesting that revisions do not materially alter the magnitude of prediction error.
The results reveal a nuanced and critical role of story point revisions in agile development. Contrary to what might be intuitively expected, the act of revising story point estimates does not substantially reduce the predictability of final actual effort, as evidenced by the non-significant coefficient in Model 1 and the visual similarity in ‘Effort Predictability Residuals’ shown in Figures 4 and 5. This suggests that teams are, on average, successful in recalibrating their estimates to align with an updated understanding of the work. However, this “correction” comes at a considerable cost: issues with revised story point estimates emerge as strong indicators of substantially extended development cycles. Specifically, as reported in Table 1 and visually corroborated by Figures 4 and 5, such issues are associated with approximately 48% longer overall resolution times and a striking 70% increase in active time spent in development.
The results underscore that story point revision is not merely an administrative update; rather, it serves as a powerful, empirically observable signal of underlying complexity, scope changes, or unforeseen challenges that translate directly into increased development effort and extended delivery timelines. For agile teams, this implies that the occurrence of a revision should be treated as a critical event. It signals an elevated risk of delays and warrants immediate attention, potentially prompting deeper root-cause analysis, reassessment of priorities, or replanning of sprint commitments. Proactive management of issues that undergo story point revisions may therefore represent a key strategy for improving delivery predictability and efficiency in agile contexts. The consistent practice of estimating issues at creation, as observed in our dataset and illustrated in Figures 1 and 2, further emphasizes the importance of robust upfront analysis and planning to minimize the need for costly revisions later in the development lifecycle.
Conclusion. Agile software development, while emphasizing adaptability, relies heavily on initial story point estimates for effective planning and resource allocation. However, the frequent revision of such estimates poses a significant challenge, creating ambiguity regarding their true implications for project predictability and delivery efficiency. This study sought to move beyond simplified views of estimation accuracy by empirically examining the differential impact of post-estimation story point revisions on the predictability of actual development effort and key indicators of delivery efficiency.
To address this objective, a comprehensive dataset of 5,579 resolved software issues was analyzed. The methodological approach involved constructing a baseline regression model to operationalize effort predictability as the absolute residuals of predicted actual effort. Multiple regression models were then employed to assess the relationships between story point revision (as a binary indicator) and these residuals, as well as with the logarithmically transformed resolution time and time in progress. Importantly, the initial plan to examine the interaction between revisions and estimation finalization delay was adapted due to a key data characteristic: all issues in the sample were estimated at creation, resulting in zero variance for the delay variable. Consequently, the study focused on the main effects of story point revisions while carefully controlling for confounding factors such as project, issue type, and priority.
The findings reveal a nuanced and critical role of story point revisions in agile development. Contrary to expectations, the occurrence of a story point revision was not found to have a substantial effect on the predictability of final actual effort. This suggests that, on average, teams are able to successfully recalibrate their estimates to better align with an updated understanding of the work, thereby preserving the predictive power of the revised estimate. However, this recalibration comes at a significant cost to delivery efficiency. Issues that underwent story point revisions were strongly associated with substantially longer development cycles. Specifically, after controlling for other factors, they experienced approximately 48% longer overall resolution times and a striking 70% increase in time spent in the active “In Progress” status compared to issues whose estimates remained unchanged.
The results further emphasize that story point revision is not merely an administrative update or a benign adaptive response. Instead, it serves as a powerful, empirically verifiable signal of underlying complexity, scope changes, or unforeseen challenges that translate directly into increased development effort and extended delivery timelines. For agile teams and project managers, this implies that the frequency of story point revisions should be treated as a critical event, signaling elevated risk of delays and warranting immediate attention—potentially triggering deeper root-cause analysis, reprioritization, or replanning of sprint commitments. Proactive management and deeper examination of issues that undergo story point revisions may therefore represent a key strategy for enhancing delivery predictability and improving overall efficiency in agile contexts. The consistent practice of estimating issues at creation, as observed in our dataset, further underscores the importance of robust upfront analysis and planning to minimize the need for costly revisions later in the development lifecycle. Thus, this study moves beyond simplified views of estimation accuracy by offering practical insights for more effective risk management and improved project outcomes in dynamic agile environments.
References:
- Bühlmann P. Statistical significance in high dimensional linear models // Bernoulli. 2013. Vol. 19. No. 4. P. 1212–1242. DOI: 10.3150/12-BEJSP11.
- Carvalho H. D. P., Lima M. N. C. A., Santos W. B. Ensemble regression models for software development effort estimation: a comparative study // International Journal of Software Engineering and Applications. 2020. Vol. 11. No. 3. P. 1–20. DOI: 10.5121/ijsea.2020.11305.
- Deep learning, machine learning, advancing big data analytics and management / W. Hsieh, Z. Bi, K. Chen [et al.]. 2024. URL: https://arxiv.org/abs/2412.02187
- Dynamic prediction of delays in software projects using delay patterns and Bayesian modeling / E. Kula, E. Greuter, A. van Deursen, G. Gousios. 2023. URL: https://arxiv.org/abs/2309.12449
- Impacts of data preprocessing and hyperparameter optimization on the performance of machine learning models applied to intrusion detection systems / M. G. Lima, A. Carvalho, J. G. Álvares [et al.]. 2024. URL: https://arxiv.org/abs/2407.11105
- Inglis A., Parnell A., Hurley C. Visualizing variable importance and variable interaction effects in machine learning models. 2021. URL: https://arxiv.org/abs/2108.04310
- Khan M. M., Xi X., Meneely A. Efficient story point estimation with comparative learning. 2025. URL: https://arxiv.org/abs/2507.14642
- Kuchibhotla A. K., Brown L. D., Buja A. Model free study of ordinary least squares linear regression. 2018. URL: https://arxiv.org/abs/1809.10538
- Pasuksmit J., Thongtanunam P., Karunasekera S. A systematic literature review on reasons and approaches for accurate effort estimations in agile. 2024. URL: https://arxiv.org/abs/2405.01569
- Pham K. P., Neumann M. How to measure performance in agile software development? A mixed method study. 2024. URL: https://arxiv.org/abs/2407.06357
- Poženel M., Fürst L., Vavpotič D. Agile effort estimation: comparing the accuracy and efficiency of planning poker, bucket system, and affinity estimation methods // International Journal of Software Engineering and Knowledge Engineering. 2024. Vol. 34. No. 1. DOI: 10.1142/S021819402350064X.
- Qi D., Miao Z., Wang J. CleanAgent: automating data standardization with LLM based agents. 2025. URL: https://arxiv.org/abs/2403.08291
- Seeam M. I. R., Sheng V. S. Proactive statistical process control using AI: a time series forecasting approach for semiconductor manufacturing. 2025. URL: https://arxiv.org/abs/2509.16431
- Statistical agnostic regression: a machine learning method to validate regression models / J. M. Gorriz, J. Ramirez, F. Segovia [et al.] // Journal of Applied Research and Technology. 2025. DOI: 10.1016/j.jare.2025.04.026.
- TabArena: a living benchmark for machine learning on tabular data / N. Erickson, L. Purucker, A. Tschalzev [et al.]. 2025. URL: https://arxiv.org/abs/2506.16791
- Tavakoli Y., Soares A., Pena L. A novel multilevel taxonomical approach for describing high dimensional unlabeled movement data. 2025. URL: https://arxiv.org/abs/2504.20174
- Tawosi V., Moussa R., Sarro F. Agile effort estimation: have we solved the problem yet? Insights from a replication study // IEEE Transactions on Software Engineering. 2022. DOI: 10.1109/TSE.2022.3228739.
- Towards an interpretable analysis for estimating the resolution time of software issues / D. N. Nastos, T. Diamantopoulos, D. Tosi [et al.]. 2025. URL: https://arxiv.org/abs/2505.01108
- Whigham P. A., Owen C. A., MacDonell S. G. A baseline model for software effort estimation // ACM Transactions on Software Engineering and Methodology. 2021. DOI: 10.1145/2738037.
- Zhang H., Dong Y., Xiao C. Jellyfish: a large language model for data preprocessing. 2024. URL: https://arxiv.org/abs/2312.01678
)
-статистика
-значение
Predictability
0.001