REGRESSION BASED ON DECISION TREE ALGORITHM

РЕГРЕССИЯ НА ОСНОВЕ АЛГОРИТМА ДЕРЕВА РЕШЕНИЙ
Eshankulov H. Malikov A.
Цитировать:
Eshankulov H., Malikov A. REGRESSION BASED ON DECISION TREE ALGORITHM // Universum: технические науки : электрон. научн. журн. 2022. 6(99). URL: https://7universum.com/ru/tech/archive/item/14006 (дата обращения: 24.04.2024).
Прочитать статью:
DOI - 10.32743/UniTech.2022.99.6.14006

 

ABSTRACT

A decision tree is a tree whose internal nodes can be taken as tests (on input data patterns) and whose leaf nodes can be taken as categories (of these patterns). These tests are filtered down through the tree to get the right output to the input pattern. Decision Tree algorithms can be applied and used in various different fields. It can be used as a replacement for statistical procedures to find data, to extract text, to find missing data in a class, to improve search engines and it also finds various applications in medical fields. Many Decision tree algorithms have been formulated. They have different accuracy and cost effectiveness. It is also very important for us to know which algorithm is best to use.I discuss the advantages ,disadavantages of using regression methods to analyze the data.

АННОТАЦИЯ

Дерево решений — это дерево, внутренние узлы которого можно рассматривать как тесты (для шаблонов входных данных), а конечные узлы — как категории (этих шаблонов). Эти тесты фильтруются по дереву, чтобы получить правильный вывод для входного шаблона. Алгоритмы дерева решений могут применяться и использоваться в различных областях. Его можно использовать в качестве замены статистических процедур для поиска данных, извлечения текста, поиска недостающих данных в классе, для улучшения поисковых систем, а также находит различные применения в медицинских областях. Было сформулировано множество алгоритмов дерева решений. Они имеют разную точность и экономичность. Для нас также очень важно знать, какой алгоритм лучше всего использовать. Я обсуждаю преимущества и недостатки использования методов регрессии для анализа данных.

 

Keywords: supervised learning, Decision tree, regression analysis

Ключевые слова: обучения с учителем, дерево решений, регрессионный анализ.

1. Introduction

Predicting the values of numeric or continuous attributes is known as regression in the statistical literature, and it is a research area for many researchers in this field. Predicting real values is also an important topic for machine learning. Most of the problems that humans learn in real life, such as sporting abilities, are continuous. Dynamic control is one such problem which is the subject of research in machine learning. For example, learning to catch a ball, moving in a three-dimensional space, is an example of this problem which is studied in robotics. In such applications, machine learning algorithms are used to control robot motions, where the response to be predicted by the algorithm is a numeric or real-valued distance measure and direction. In the paper, we review most current regression techniques developed in machine learning and statistics. After describing the main focus for the development

of new techniques in the next section we review decision tree method.

2. Decision Tree Algoritm

Decision Trees (DTs) are a non-parametric supervised learning method used for classificationhttps://scikit-learn.org/stable/modules/tree.html - tree-classification and regression. Decision trees are extremely intuitive ways to classify or label objects: you simply ask a series of questions designed to zero-in on the classification

Classification decision trees − In this kind of decision trees, the decision variable is categorical. The above decision tree is an example of classification decision tree

Regression decision trees − In this kind of decision trees, the decision variable is continuous.

The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

For instance, in the example below, decision trees learn from data from student information and decide to give dormitory to the best candidates . The deeper the tree, the more complex the decision rules and the fitter the model.Firstly I prepare data according to Machine Learning[1,p.45]

# Import the necessary modules and libraries

#Firstly,I load dataset

import pandas as pd

df=pd.read_excel('combine.xlsx')

#determine input variable  and target

inputs=pd.concat([df.Familiya,df.Ism,df.Otasining_ismi,df.Jinsi, df.Mutaxassislik,df.Tulov_shakli,df.Yashash_manzili,

df.Nogironligi,df.Chin_yetimligi,df.Kurs],axis=1)

target=df.Ball

# Then I encode the string variable into integers

inputs.Jinsi = inputs.Jinsi.map({'ayol': 1, 'erkak': 0})

inputs.Tulov_shakli=inputs.Tulov_shakli.map({'Tulov-shartnoma':1,'Davlat-granti':0})

from sklearn.preprocessing import LabelEncoder

le_Yashash_manzili = LabelEncoder()

le_Nogironligi = LabelEncoder()

le_Chin_yetimligi = LabelEncoder()

le_Kurs=LabelEncoder()

inputs['Yashash_manzili_n'] = le_Yashash_manzili.fit_transform(inputs['Yashash_manzili'])

inputs['Nogironligi_n'] = le_Nogironligi.fit_transform(inputs['Nogironligi'])

inputs['Chin_yetimligi_n'] = le_Chin_yetimligi.fit_transform(inputs['Chin_yetimligi'])

inputs['Kurs_n']=le_Kurs.fit_transform(inputs['Kurs'])

#And drop unnecessary variables

inputs_n = inputs.drop(['Yashash_manzili','Nogironligi','Chin_yetimligi','Kurs','Familiya','Ism', 'Otasining_ismi','Jinsi','Mutaxassislik',                  'Tulov_shakli'],axis='columns')

hs = pd.concat([inputs,inputs_n, target], axis=1)

 

Preparation is done. And now I train data with decision tree classifier

from sklearn import tree

model = tree.DecisionTreeClassifier()

#This code train the data

model.fit(inputs1, target)

#This shows the model score

model.score(inputs1,target)

model score shows 0.91 result. It is good enough.

Now we use this model into another dataset to predict target.

#I load another dataset

df2=pd.read_excel('test.xlsx')

#prepare dataset

inputs2=pd.concat([df2.Jinsi,df2.Tulov_shakli,df2.Yashash_manzili_n,df2.Nogironligi_n,

df2.Chin_yetimligi_n,df2.Kurs_n],axis=1)

#and predict the result

p=model.predict(inputs2)

2. Important Terminology related to Tree based Algorithms

Let’s look at the basic terminology used with Decision trees:

  1. Root Node: A root node is either the topmost or the bottom node in a tree data structure, depending on how the tree is represented visually. The root node may be considered the top if the visual representation is top-down or the bottom if it is bottom-up. The analogy is that the tree starts at the roots and then goes up to its crown, so the first node is considered the root
  2. Splitting: Describes the process of dividing a node into two or more sub-nodes. There exist several methods to split a decision tree, involving different metrics (e.g. Information Gain, Gini Impurity).
  3. Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
  4. Leaf/ Terminal Node: If a sub-node cannot be divided any further, we call that node a leaf node. The leaf node represents the response value (e.g. most-common class label) used for the prediction.
  5. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
  6. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
  7. Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node[2,p.10]
  8. Information Gain Information gain provides a measure to describe how much information a feature provides — or in other terms, how much entropy is removed.The information gain can be calculated for a split by subtracting the weighted entropies of the children from the parent’s entropy. Thus, making it especially useful for evaluating a possible split candidate, allowing us to find and choose the optimal split.An information gain of 1 would be the best possible value, whereas a value of 0 means no uncertainty or entropy has been removed.

These are the terms commonly used for decision trees. As we know that every algorithm has advantages and disadvantages, below are the important factors which one should know.[3,p.58]

2.1 Advantages

  1. Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical background. It does not require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily relate their hypothesis.
  2. Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of decision trees, we can create new variables / features that has better power to predict target variable. You can refer article (Trick to enhance power of regression model) for one such trick.  It can also be used in data exploration stage. For example, we are working on a problem where we have information available in hundreds of variables, there decision tree will help to identify most significant variable.
  3. Less data cleaning required: It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree.
  4. Data type is not a constraint: It can handle both numerical and categorical variables.
  5. Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.
  6. The tree structure can be easily understood and interpreted by domain experts with little statistical knowledge, since it is essentially a logical decision flow diagram.
  7. The tree structure can handle both categorical and numerical features in a natural and straightforward way. Specifically, there is no need to pre-process categorical features, say via the introduction of dummy variables.
  8. The final tree obtained after the training phase can be compactly stored for the purpose of making predictions for new feature vectors. The prediction process only involves a single tree traversal from the tree root to a leaf.
  9. In the classification setting, it is common to report not only the predicted value of a feature vector, e  but also the respective class probabilities. Decision trees handle this task without any additional effort. Specifically, consider a new feature vector. During the estimation process, we will perform a tree traversal and the point will end up in a certain leaf w. The probability of this feature vector lying in class z can be estimated as the proportion of training points in w that are in class z
  10. As each training point is treated equally in the construction of a tree, their structure of the tree will be relatively robust to outliers. In a way, trees exhibit a similar kind of robustness as the sample median does for real-valued data.

2.2. Disadvantages

1. Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in detailed below).

2. Decision Trees love orthogonal decision boundaries (all splits are perpendicular to an axis), which makes them sensitive to training set rotation

3. More generally, the main issue with Decision Trees is that they are very sensitive to small variations in the training data

4. Not fit for continuous variables: While working with continuous numerical variables, decision tree looses information when it categorizes variables in different categories.[4,p.188]

3. Regression Analysis

Regression analysis is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables. More specifically, Regression analysis helps us to understand how the value of the dependent variable is changing corresponding to an independent variable when other independent variables are held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.

Regression is a supervised learning technique which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables. It is mainly used for prediction, forecasting, time series modeling, and determining the causal-effect relationship between variables.

Some examples of regression can be as:

  • Prediction of rain using temperature and other factors
  • Determining Market trends
  • Prediction of road accidents due to rash driving.

3.1. Why do we use Regression Analysis?

As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are various scenarios in the real world where we need some future predictions such as weather condition, sales prediction, marketing trends, etc., for such case we need some technology which can make predictions more accurately. So for such case we need Regression analysis which is a statistical method and used in machine learning and data science. Below are some other reasons for using Regression analysis:

Regression estimates the relationship between the target and the independent variable.

It is used to find the trends in data.

It helps to predict real/continuous values.

By performing the regression, we can confidently determine the most important factor, the least important factor, and how each factor is affecting the other factors.

Types of Regression

There are various types of regressions which are used in data science and machine learning. Each type has its own importance  different scenarios, but at the core, all the regression methods analyze the effect of the independent variable on dependent variables. Here we are discussing some important types of regression which are given below:

Linear Regression

Logistic Regression

Polynomial Regression

Support Vector Regression

Decision Tree Regression

Random Forest Regression

Ridge Regression

Lasso Regression

3.2. Impurity and its measurement

Impurity means how heterogeneous our data is.Impurity is calculated with 2 measures. They are Gini index and Entropy. Below I show the formula of both.

1. Gini Index:  

2. Entropy -:

Here is p is probability. i is the number of instances. For example,data is distributed as [3,5,4,3,6,3,5,5,5]. Probability of 5 is p(5)=4/9, probability of 3 is p(3)=3/9.

If entropy is equal to 0. It means data is not dependent each other and completely useless and impure. If entropy of data is 1. Data is completely pure. At this point Gini index will be equal to 0.5. And it is maximum of Gini.

So should you use Gini impurity or entropy? The truth is, most of the time it does not make a big difference: they lead to similar trees. Gini impurity is slightly faster to compute, so it is a good default. However, when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees.[4,p.183]

4. Summary and Conclusion

In this article, we’ve discussed in-depth the Decision Tree algorithm. It’s a supervised learning algorithm that can be used for both classification and regression. The primary goal of decision tree is to split the dataset as  a tree based on a set of rules and conditions. Lastly, we discussed the advantages and disadvantages of using decision trees. There is still a lot more to learn, and this article will give you a quick-start to explore other regression and classification algorithms.In this article, we’ve discussed in-depth the Decision Tree algorithm. It’s a supervised learning algorithm that can be used for both classification and regression. The primary goal of decision tree is to split the dataset as a tree based on a set of rules and conditions. Lastly, we discussed the advantages and disadvantages of using decision trees. There is still a lot more to learn, and this article will give you a quick-start to explore other regression and classification algorithms

 

References:

  1. AbouEisha, H., Amin, T., Chikalov, I., Hussain, S., Moshkov, M., Extensions of Dynamic Programming for Combinatorial Optimization and Data Mining, Intelligent Systems Reference Library, Springer, vol. 146, 2019.
  2. Pea-Lei Tu and Jen-Yao Chung, "A New Decision-Tree Classification Algorithm for Machine Learning", Proc. of the 1992 IEEE Int. Conf. on Tools with AI Arlington, Nov. 1992
  3. Machine Learning, Tom Mitchell, McGraw Hill, 1997.
  4. Hands-on Machine Learning with Scikit-Learn,Keras, and Tensorflow,Aurelien Geron,2019
Информация об авторах

Associate Professor, Bukhara State University, Republic of Uzbekistan, Bukhara

д.ф.ф.н, доцент, Бухарский государственный университет, Республика Узбекистан, г. Бухара

Master's student, Bukhara State University, Republic of Uzbekistan, Bukhara

магистрант, Бухарский государственный университет, Республика Узбекистан, г. Бухара

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Ахметов Сайранбек Махсутович.
Top