A COMPARISON OF FEATURE SELECTION TECHNIQUES IN QSAR MODELING

СРАВНЕНИЕ МЕТОДОВ ВЫБОРА ПРИЗНАКОВ В МОДЕЛИРОВАНИИ QSAR
Цитировать:
Davronov R.R., Kushmuratov S.I. A COMPARISON OF FEATURE SELECTION TECHNIQUES IN QSAR MODELING // Universum: технические науки : электрон. научн. журн. 2023. 10(115). URL: https://7universum.com/ru/tech/archive/item/16040 (дата обращения: 05.05.2024).
Прочитать статью:
DOI - 10.32743/UniTech.2023.115.10.16040

 

ABSTRACT

This research aimed to assess the effectiveness of nine distinct feature selection techniques, examining their stability and performance with an environmental dataset. The utilized methods included Chi-square, Mutual Information, Anova F-value, Fisher Score, Recursive Feature Elimination, Permutation Importance, Random Forest, LightGBM, and SHAP (Shapley Additive Explanations). Features were ranked using these methods, with the top-ranked features accounting for 6% of the final evaluation. Classifiers were trained using an 80% training and 20% testing configuration, with the results assessed using the accuracy metric. The experiments revealed that the Recursive Feature Elimination and SHAP methods consistently outperformed others. Specifically, SHAP emerged as the top performer, closely followed by Recursive Feature Elimination. The findings underscore the importance of selecting appropriate feature selection methods for optimal classification performance. All Python codes used for our experiments can be found on Github at the provided link(https://github.com/kushmuratoff/feature_selection).

АННОТАЦИЯ

Это исследование направлено на оценку эффективности девяти различных методов выбора признаков, а также на изучение их стабильности и производительности на экологическом наборе данных. Используемые методы включали в себя критерий хи-квадрат, взаимную информацию, F-значение ANOVA, оценку Фишера, рекурсивное исключение признаков, важность перестановки, случайный лес, LightGBM и SHAP (аддитивные объяснения Шепли). Признаки ранжировались с использованием этих методов, причем наиболее высоко ранжированные признаки составляли 6% от итоговой оценки. Классификаторы обучались с использованием конфигурации 80% для обучения и 20% для тестирования, а результаты оценивались с использованием метрики точности. Эксперименты показали, что методы рекурсивного исключения признаков и SHAP стабильно превосходили другие. В частности, SHAP был признан лучшим методом, тесно за ним следовал метод рекурсивного исключения признаков. Результаты подчеркивают важность выбора подходящих методов выбора признаков для оптимальной производительности классификации. Все Python-коды, использованные для наших экспериментов, можно найти на Github по предоставленной ссылке(https://github.com/kushmuratoff/feature_selection).

 

Keywords: Structure-activity mathematical model (QSAR), feature selection methods, comparative analysis.

Ключевые слова: Математическая модель «структура-активность»(QSAR), методы отбора признаков, сравнительный анализ.

 

Introduction

Drug discovery is a complex area of study that encompasses the understanding of cellular processes, the ability to predict protein structures, and the evaluation of interactions between molecules and their targets in living organisms [1]. Scientists aim to unravel the mechanisms underlying diseases and develop compounds that can effectively combat these disease-causing agents. However, the intricate nature of cellular activities presents considerable obstacles in the comprehensive design of drugs. To address these challenges, methods such as high-throughput screening have been introduced [1].

The advent of combinatorial chemistry in the 1980s led to the synthesis of a vast number of novel molecular compounds, necessitating the development of more focused search methods beyond exhaustively exploring every possible combination of molecules. As a result, statistical techniques and advanced computational tools have become crucial in the field of drug development [2]. Among these methods, the Structure-Activity Relationship (SAR) approach assumes that the structural characteristics of a molecule are closely linked to its biological function. SAR aims to uncover these correlations and utilize the physiochemical properties of new molecules to predict their biological activity [3-4].

Machine learning has demonstrated potential in SAR analyses, enabling the assessment of the potential therapeutic effectiveness of compounds for specific diseases or targets [5]. For example, artificial neural networks have been utilized to efficiently search large databases and identify potential drug candidates [3]. Research conducted by Wagener et al. [4] has achieved accuracies of 70-80% using decision trees, while Burbidge et al. [2] have reported that support vector machines outperformed other machine learning techniques in predicting specific inhibitions. These studies highlight the promising applications of machine learning in enhancing SAR analyses and aiding in the discovery of new drugs.

Describing complex molecular compounds can involve a vast range of features and attributes, including topological indices and quantum mechanical descriptors. However, the high dimensionality of these features can pose challenges for many learning algorithms [6]. Therefore, in the initial stages of machine learning, a crucial step is to identify the most relevant features. Having excessive redundant features can complicate algorithmic decision-making, requiring more training data or leading to longer convergence times. An optimal approach often involves identifying essential variables and selecting subsets of features that improve algorithm efficiency. It has been observed that reducing the feature space can maintain accuracy while enhancing performance in various contexts, such as text categorization [7].

The objective of this study is to compare different techniques for feature selection in order to streamline high-dimensional feature spaces in drug discovery. The study specifically focuses on using the Logistic Regression method for classifying compounds.

Methods

Data Description: This research utilized the dataset mentioned in reference [8] for analysis. Each compound in the dataset is represented by a feature vector consisting of 105 features, along with a corresponding class label ("A" for active and "I" for inactive).

Methods for Feature Selection: In this study, five different techniques were evaluated, all with the objective of ranking features based on their importance for model training and classification purposes.

  • Chi-square: In this study, a univariate filter approach was employed, utilizing the chi-square statistical test. This test measures the extent to which the distribution of a feature deviates from the expected distribution if it were independent of the class value (Jin et al., 2006). A higher chi-square value signifies greater relevance of the feature.
  • Mutual Information (Information Gain): Initially proposed by Quinlan (1986) and subsequently discussed by Hoqne et al. (2014), this univariate feature selection approach is widely utilized in practice due to its computational efficiency. It assesses the reduction in entropy associated with individual features. However, it is considered a 'myopic' method as it examines features independently without considering their relationships with other features.
  • Anova F-value: A univariate filter method that utilizes variance is employed to evaluate the separability of features among different classes, which is particularly relevant for multi-class endpoints (Ding et al., 2014; Jafari and Azuaje, 2006).
  • Fisher Score: This filter ranks features by considering both their mean and variance (Duda et al., 2001). Features that are considered ideal show consistent values within the same class but exhibit variation across different classes. However, this approach has limitations in handling feature redundancy (Duda et al., 2012).
  • Recursive Feature Elimination (RFE):This technique involves iteratively removing the least significant features until the desired feature count is achieved (Guyon, I., et al., 2002).
  • Permutation Feature Importance: This technique allows for the examination of any fitted model using tabular data (L. Breiman, 2001).
  • Random Forests: This ensemble technique involves using multiple decision trees during training for classification, regression, and feature selection purposes (Tin Kam Ho, 1995).
  • LightGBM: A tree-based gradient boosting framework (Guolin Ke et al., 2017).
  • SHAP (Shapley Additive Explanations): An innovative method that provides insights into the outputs of machine learning models is SHAP (SHapley Additive exPlanations). It leverages concepts from game theory to calculate the contributions of features to specific predictions. It distributes the "payout" of a prediction among the features, treating them as players in a coalition. SHAP can be applied to individual features or groups of features, such as pixels in an image. An important characteristic of SHAP is its use of additive feature attribution for explanations, which establishes a connection with LIME (Local Interpretable Model-Agnostic Explanations) and Shapley values. The SHAP model defines explanations in a specific manner:

The continuation of the explanation is missing but the above captures the essence of the provided text:

                                              (1)

where g is the explanation model,  is the coalition vector, M is the maximum coalition size and  is the feature attribution for a feature j, the Shapley values. In the SHAP paper, what I term as the "coalition vector" is referred to as "simplified features." This nomenclature is likely adopted because, in contexts like image data, representations are not at the pixel granularity but rather consolidated into superpixels. It's beneficial to conceptualize the z's as representations of coalitions. Within the coalition vector, an entry denoted as 1 indicates the affiliated feature value is "active," whereas 0 signifies its "inactivity." This concept might resonate with those acquainted with Shapley values. When determining Shapley values, we emulate scenarios where only certain feature values are active ("present") while others remain inactive ("absent"). The representation as a linear model of coalitions is a trick for the computation of the  s. For x, the instance of interest, the coalition vector x’ is a vector of all l’s, i.e. all feature values are “present”. The formula simplifies to:

                                                    (2)

You can find this formula in similar notation in the Shapley value chapter. More about the actual estimation comes later. Let us first talk about the properties of the s before we go into the details of their estimation.

Classifiers

Once the most discriminative features were identified, two classifiers were employed to assess the effectiveness of the feature selection techniques.

Logistic regression, while logistic regression may sound like a regression technique, it is primarily used as a linear classification method. It is also known as logit regression, maximum-entropy classification (MaxEnt), or log-linear classification in academic literature. Instead of predicting numerical values, logistic regression models the probabilities of different outcomes using a logistic function[11]. In terms of optimization, binary class l2 penalized logistic regression aims to minimize a specific cost function:

              (3)

Similarly, l1  regularized logistic regression solves the following optimization problem:

                                                (4)

During the fitting process of logistic regression, the model takes input arrays X and y. It then calculates and stores the coefficients (w) of the linear model in its "coef_member .

Accuracy,  Accuracy is a commonly used metric for assessing the performance of classification models. It measures the proportion of correct predictions made by the model. In formal terms, accuracy is defined as follows:

                                       (5)

For binary classification, accuracy can also be calculated in terms of positives and negatives as follows:

                         (6)

Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.

Results and discussion

We evaluated various feature selection techniques in combination with a specific machine learning algorithm using a given dataset. The experimental process consisted of the following steps:

  1. We applied each feature selection technique individually, which produced a ranking of features based on their corresponding scores.
  2. From the rankings mentioned earlier, we selected the top-ranked features, which accounted for 6% of the final evaluation.
  3. The chosen algorithm was utilized to classify the data using an 80% training and 20% testing configuration, without making any adjustments to its parameters.
  4. The results were evaluated using the Accuracy metric, as specified in Table 1.

Table 1.

Accuracy Scores for Each Feature Selection Method.

Method

Accu-racy

Feature 1

Feature 2

Feature 3

Feature 4

Feature 5

Feature 6

Chi-square

0.88

MAC3(Ap)

MAC4(ECI)

MAC4(Ap)

MAC5(Ap)

MAC6(Ap)

MAC7(Ap)

Mutual Information

0.76

MAC4(Ap)

MAC5(Mw)

MAC5(At)

MAC5(Ap)

MAC6(Mw)

MAC6(Ap)

Anova F-value

0.8

MAC4(IP)

MAC4(Ap)

MAC5(IP)

MAC5(At)

MAC5(Ap)

MAC7(Ap)

Fisher Score

0.64

MAC4(ECI)

MAC3(Pa)

MAC3(ECI)

MAC3(Vm)

MAC1(ECI)

MAC4(Pa)

Recursive Feature Elimination

0.92

MAC3(IP)

MAC4(Ap)

MAC5(Mw)

MAC5(IP)

MAC5(Ap)

MAC7(At)

Permutation Importance

0.84

MAC3(At)

MAC3(Ap)

MAC4(Ap)

MAC5(At)

MAC7(Mw)

MAC7(Vm)

Random Forest

0.8

MAC3(Ap)

MAC4(Ap)

MAC5(Mw)

MAC5(At)

MAC6(Ap)

MAC7(At)

SHAP

0.92

MAC5(Mw)

MAC4(Ap)

MAC4(Z1)

MAC3(Z2)

MAC5(IP)

MAC1(IP)

LightGBM

0.72

MAC1(Mw)

MAC1(HP)

MAC1(IP)

MAC1(ECI)

MAC1(Vm)

MAC1(Anp)


 

In our study, we conducted experiments using Python and incorporated the pandas, numpy, and scikit-learn libraries [9]. Initially, we loaded the datasets and checked for any missing values, but none were found. We utilized the default feature selection methods provided by scikit-learn, including Chi-square, Mutual Information, Anova F-value, Fisher Score, Permutation Importance, Random Forest, and LightGBM. The default parameters of scikit-learn were consistently used. For Recursive Feature Elimination, we employed the Logistic Regression algorithm from scikit-learn, treating it as a black box for normalized features. We chose this algorithm due to its widespread adoption, speed, and satisfactory results. Additionally, for SHAP analysis, we utilized the shapely package [10] in combination with the XGBoost algorithm. Notably, our findings indicate that Recursive Feature Elimination and SHAP, our benchmark feature selection methods, consistently outperformed the other approaches across various tasks.

Conclusion

In our study, we examined nine different feature selection techniques. We evaluated the stability and performance of these methods using an environmental dataset and one of the top classification algorithms in the field.

Based on our findings, the Reciprocal Ranking method demonstrated superior performance compared to other techniques, while also maintaining a commendable level of stability. Among the evaluated methods, SHAP emerged as the top performer, with Recursive Feature Elimination also showing notable results.

 

References:

  1. C. Corwin and I. D. Kuntz, “Database searching: Past, present and future. In Designing bioactive molecules'”, American Chemical Society: Washington, 1 (1998).
  2. R. Burbridge, M. Trotter, B. Buxton and S. Holden, Ding design “Proceedings of the AISB'OO Symposium on Artificial Intelligence m Bioinfomiatics”, (2000).
  3. T. M. Frimurer, R. Bywater, L. Namm, L. N. Lauritsen and S. Bnuiak, “Improving the odds in discriminating drug-like from non drug-like compounds”, J. Chem. Inf. Comput. Sci. 40. 1315-1324 (2000).
  4. M.Wagener and V. Geerestein, “Potential drugs and nondings: prediction and identification of important structural features”, J. Chem. Inf. Comput. Sci. 40, 280-292 (2000)
  5. S. Eschrich, N. V. Chawla and L. O. Hall, “BIOKDD02: Workshop on Data Mining in Bioin-fomiatics”, (2002).
  6. Y. Yang and J. O. Pederson, “International Conference on machine Leaning (ICML’97)”, (1997).
  7. Y. Dabrowski and J.M. Deltorn, Machine learning application to drug design,  http://www.inimngmachines.com
  8. A.K. Halder and M. Natália Dias Soeiro Cordeiro, “QSAR-Co-X: an open source toolkit for multitarget QSAR modelling”, Journal of Cheminformatics. 13, 29 (2021)
  9. Simple and efficient tools for predictive data analysis, https://scikit-learn.org/stable/index.html  
  10. Python package for manipulation and analysis of planar geometric objects https://github.com/Toblerity/Shapely
  11. Logistic function defination https://en.wikipedia.org/wiki/Logistic_function
Информация об авторах

Ph.D., Senior Researcher V.I. Romanovsky Institute of Mathematics of the Academy of Sciences of the Republic of Uzbekistan, Republic of Uzbekistan, Tashkent

канд. техн. наук, ст. научн. сотр., Институт Математики им В.И. Романовского АН Республики Узбекистан, Республика Узбекистан, г. Ташкент

Junior Researcher, V.I. Romanovsky Institute of Mathematics of the Academy of Sciences of the Republic of Uzbekistan, Republic of Uzbekistan, Tashkent

мл. науч. сотр., Институт Математики им В.И. Романовского АН Республики Узбекистан, Республика Узбекистан, г. Ташкент

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Ахметов Сайранбек Махсутович.
Top