DETERMINATION OF INFORMATIVE FEATURES USING THE METHOD OF DIVISION INTO INTERVALS BASED ON THE COMPACTITY HYPOTHESIS

ОПРЕДЕЛЕНИЕ ИНФОРМАТИВНЫХ ПРИЗНАКОВ МЕТОДОМ РАЗДЕЛЕНИЯ НА ИНТЕРВАЛЫ НА ОСНОВЕ ГИПОТЕЗЫ КОМПАКТНОСТИ

Shodiev F.Yu. Davronova M.I.

27.03.2024 35

3(120)

10. Информатика, вычислительная техника и управление

Цитировать:

Shodiev F.Yu., Davronova M.I. DETERMINATION OF INFORMATIVE FEATURES USING THE METHOD OF DIVISION INTO INTERVALS BASED ON THE COMPACTITY HYPOTHESIS // Universum: технические науки : электрон. научн. журн. 2024. 3(120). URL: https://7universum.com/ru/tech/archive/item/17028 (дата обращения: 05.05.2024).

Прочитать статью:

DOI - 10.32743/UniTech.2024.120.3.17028

ABSTRACT

The sample contains 295 soft wheat varieties (objects) obtained from the experimental fields of the Southern Agricultural Research Institute information about and the values of their features.

Partitioning object values into intervals based on the compactness hypothesis using the method, their optimal limits were found, and based on this, the weights of the features in the sample were calculated and informative features were determined. In addition, smoothing and latency work was carried out on the values of the features.

АННОТАЦИЯ

Выборка содержит данные о 295 сортах мягкой пшеницы (объектах) и значениях их признаков, полученные на опытных полях Южный научно-исследовательский сельскохозяйственный институт.

С помощью метода разделения значений объектов на интервалы на основе гипотезы компактности были найдены их оптимальные пределы, на основе этого рассчитаны веса признаков в выборке и определены информативные признакы. Кроме того, была проведена работа по сгложивание и латентности значений признаков.

Keywords: sample, feature, informative feature, latent feature, feature weight, quantitative feature, division into intervals, stochastic methods, deterministic methods, standardization.

Ключевые слова: выборка, признак, информативный признак, латентный признак, вес признака, количественный признак, разделение на интервалы, стохастические методы, детерминированные методы, стандартизация.

Introduction. The article contains a selection of information about wheat varieties compactness the problem of partitioning into optimal intervals is solved using the method of partitioning into intervals based on the hypothesis. Drought-resistant varieties are determined by calculating the weights of features (parameters) based on optimal intervals. Also, one of the main goals of the research is to develop a new approach to determine the weights of features in the selection of wheat varieties using the method of division into compactness intervals. Adoption of this approach will help bring about positive changes in the seed industry.

Feature weights are used for the following purposes:

to calculate the proximity measure between objects;
to select and sort informative features;
in the search for patterns to model the intuitive decision-making process;
in order to reduce the space of features in the calculation of generalized values (latent features) [1].

Materials and methods. Weighting methods are aimed at solving the problems of teacher and untutored comprehension. It is known that there is no general method of classification. Therefore, conditional and unconditional optimization algorithms are used in the calculation process. It should be noted that there is no strict distinction between the terms "feature weight" and "feature contribution" in terms of content. The essence of the criteria used to calculate the weight and contribution of the features is based on the verification of the truth of the compactness hypothesis [2].

Quantitative features weights. Let’s say sorted,

( 1 )

sequence and is a set of integers, in which the number of values of the feature in the description of objects in the range from the sequence number in the formula ( 1 ) .

Quantitative feature all values in the description of objects (1) in the order numbers and ranges according to the following criteria are equivalent to the nominal scale of the measurement scale:

(2)

The maximum value of this criterion is considered a weight with a set of values in the range of a quantitative feature [3].

In most cases, the technology of dividing the values that quantitative features can take into intervals is widely used in creating models aimed at obtaining new knowledge (hidden laws) from data bases related to the subject areas that are not well conditioned. Stochastic and deterministic methods are used for dividing into intervals.

Stochastic methods are usually used in the initial analysis of givens. The results of measurements on quantitative scales are divided into the following from the point of view of division into intervals:

selection objects are not divided into classes;
selection objects are divided into classes.

Traditional methods for the first case include histograms, decile and percentile distributions. This is the length of the set of values of the features under consideration is divided into k intervals. The number of intervals for decile and percentile distributions is k=9, respectively and is defined as k=90.

Classification into classes can be carried out by the method developed by V.Vapnik, which is based on the distribution law and the number of intervals. This method is a heuristic method, and when dividing into intervals, the belonging of objects to one or another entropy class is taken into account [4].

Quantitative features based on deterministic criteria Two methods of partitioning into non-intersecting intervals are known [5]. Algorithms of these methods are invariant to measurement scales and are used for the following cases:

in the search for latent features from the data base in modeling the intuitive decision-making process;
ensuring that the information lost in the formation of nominal features from quantitative features is minimal;
informative t sets from different categories of features.

Interpretation of criteria. Given a possible set of two disjoint and classed objects . Each object is a feature of n different categories be described on the basis of one of them on a quantitative scale, and the rest on a nominal scale. features obtained from let there be an operator reflecting the quantitative signs on and in its elements taken from , there may also be latent features in the thigh. As an example of latent features , and combinations, as well as generalized indicators derived from quantitative and nominal features [6].

let there be two criteria for dividing the values of the features taken from the subset of the sample into non-intersecting intervals . The first criterion is based on the condition that the number of classes and the number of intervals are equal. In the case we are looking at, this number is 2.

Each feature according to the above criterion is performed as follows. The ordered set of values of the feature is divided into two intervals . Here and . The calculation of the values of the boundary of the interval is based on the following hypothesis, that is, the values of the features of the objects in each interval or is based on that obtained from the class [7].

Suppose that is the number of values of the feature belonging to the class, and to the intervals. ,, , gets from the sample , sorted in ascending order of features values and be a sequence defining the interval limit , , .

The following criterion can be used to calculate the optimal value of the limit of the interval and use its value as an indicator of the compactness of the quantitative feature when dividing objects of the set into classes:

(3)

Only the values of the features of the obtained objects are located in the boundaries corresponding to each of the two objects , then the value of the criterion (3) is equal to 1 (one).

If then the value of criterion (3) is equal to 0(zero). In other cases, the value of the criterion is equal to one of the numbers in the interval [8].

Results and discussion. The selection includes 295 soft wheat varieties and their features (quantitative) values from SARI (Southern Agricultural Research Institute). In addition, the objects in the selection are divided into two classes according to the recommendations of experts. Varieties resistant to drought (objects belonging to class 1, 15), varieties resistant to drought (objects belonging to class 2, 280).

In Table 1 below, we present the values obtained as a result of dividing and smoothing the values of features of drought-resistant wheat varieties into compactness intervals based on criteria (2) and (3).

Calculating the weights of the features in the sample directly (without smoothing) leads to many losses. Because the numerical values in the columns of features differ sharply from each other. For example, the range of values of the “The nature of the grain” feature varies in the range [669.46;838.1].

In order to improve the quality of the obtained results, we polish each quantitative feature column by standardization.

Table 1.

Split intervals and weights after sample file smoothing

№	Features	C0	C1	C2	Feature weight
1	1000 grain weight	-2.5615	1.1405	2.3561	0.643057
2	Productivity	-3.6627	1.1193	4.6424	0.600568
3	The nature of the grain	-4.6941	0.52603	1.7866	0.405435
4	Plant height	-3,042	0.43686	2.9022	0.334819
5	Protein content	-2.1688	0.51598	2.3836	0.312253
6	Spike length	-2.2199	-0.93573	3.6874	0.298118
7	Amount of gluten	-2.2876	0.97734	2.6098	0.282956
8	The length of the last syllable	-3.0479	0.36968	3.86	0.280146
9	The number of spikes	-2.8483	0.47097	2.8419	0.271753
10	Vegetation period	-2.9167	0.20305	2.0749	0.267084
11	IDK	-6,138	-0.22963	1.2866	0.264227
12	Grain vitreousness	-1.2678	-0.09024	3.3513	0.253981
13	Grain moisture	-1.6489	-0.4849	2.8778	0.251748

Features with a weight of 0.4 and above in Table 1 can be taken as informative features. Because these features contribute a lot to the classification and weight of drought-resistant wheat varieties.

When evaluating drought-resistant varieties, it is necessary to pay attention to their combinations, not the individual condition of the features. For this purpose, better results can be achieved if the weights are calculated by delaying the features.

Table 2.

The weights of the features in the latent state of the sample file

№	Features	Feature weight
1	(Productivity*1000 grain weight)	0.693828
2	((The nature of the grainProtein content)(Spike length/Grain moisture))	0.462482
3	((Protein amountIDK)(Spike length*Number of spikes, units))	0.459401
4	((Number of spikes, grain/Grain moisture)(Protein amountIDK))	0.427136
5	((Number of spikes/Grain moisture)(Spike lengthNumber of spikes))	0.417391
6	((Spike length/Grain moisture)(Spike lengthNumber of spikes))	0.410489
7	((Number of spikes/Grain moisture)*(Vegetation period/Grain moisture))	0.397817
8	((Protein content/Grain moisture)(Spike lengthProtein content))	0.393862
9	((Spike Length/Grain Moisture)(The length of the last syllableProtein Content))	0.391693
10	((Number of spikesProtein content)(IDK/Grain vitreousness))	0.391693
11	((Number of spikelets/Grain vitreous)(Protein contentIDK))	0.391693
12	((Vegetation periodSpikes number )( Spike length*Protein amount ))	0.386234
13	((Spike length/Grain moisture)(Number of spikesProtein content))	0.384879
14	((Number of spikes/Grain moisture)(Spike lengthProtein content))	0.384879
15	((Protein content/Grain moisture)(Spike lengthNumber of spikes))	0.384879
16	((Protein contentIDK)(Vegetation period/Grain moisture))	0.374164
17	((The length of the last syllableProtein Amount)(Spike Length/Gluten Amount))	0.372762
18	((Vegetation period/Grain moisture)(Spike lengthNumber of spikes))	0.372694
19	((Number of spikes/Grain vitreousness)(Spike lengthNumber of spikes))	0.371363
20	((Number of spikes/Grain moisture)(Number of spikesProtein content))	0.370322
21	(Plant height*Protein content)	0.336662
22	(Number of spikes*Amount of gluten)	0.312379

We can see in Table 2 above that the weights of the newly formed latent (based on hidden laws) features after the features are latentized twice. This situation means that new informative features have been formed, which will make a more significant contribution to the assessment of varieties.

In conclusion, the following can be given. In this article, based on the division of the features of wheat varieties into compactness intervals, the methods of calculating the weights of the features in the sample were used .

The obtained results show that the informative features of drought-resistant wheat varieties almost overlap with the features recognized by experts [9].

The identified informative features not only confirm the opinion of experts in the field, but also indicate the need to be interested in features that have been overlooked by them.

References:

Madraximov S. F., Saidov D. Y. Stability of the objects of classes and grouping the features //Проблемы вычислительной и прикладной математики. – 2016. – №. 3. – С. 50-54.
Shodiyev F. Intellectual system based on the determination of hidden legality //Central Asian journal of education and computer sciences (CAJECS). – 2022. – Т. 1. – №. 5. – С. 11-16.
Ignatyev N. A., Madrakhimov S. F., Saidov D. Y. Stability of object classes and selection of the latent features //International journal of engineering technology and sciences. – 2017. – Т. 4. – №. 1. – С. 61-71.
Вапник В.Н. Алгоритмы и программы восстановления зависимостей. – М.: Наука, 1984. – 816 с.
Згуральская Е.Н. Алгоритм выбора оптимальных границ интервалов разбиения значений признаков при классификации // Известия Самарского научного центра Российской академии наук. Т.14, №4 (3), 2012. – С.826-829.
Шодиев Ф. Ю., Эшбоев Э. А., Эгамбердиев Э. Х. Использование обобщенных оценок для прогнозирования устойчивости сортов пшеницы к болезням //Азиатский журнал многомерных исследований. – 2021. – Т. 10. – No 4. – С. 602-610.
Игнатьев Н.А. Вычисление обобщённых показателей и интеллектуальный анализ данных // Автоматика и телемеханика. – 2011. –№ 5. – С.183-190.
Шодиев Ф., Эшбоев Е., Суярова А. Прогнозирование устойчивости к болезням высококачественных сортов пшеницы с использованием метода расчета обобщенных оценок //E3S Web of Conferences. – EDP Sciences, 2023. – Т. 401. – С. 04063.
Sharma S.N, Sain R.S, Sharma R.K. Genetics of spike length in durum wheat. Euphytica 130: 2003. –PP. 155-161.