COMBINATORIAL METHODS OF RANKING DATA

КОМБИНАТОРНЫЕ МЕТОДЫ РАНЖИРОВАНИЯ ДАННЫХ
Nuridinov Z. Kartbayev A.Zh.
Цитировать:
Nuridinov Z., Kartbayev A.Zh. COMBINATORIAL METHODS OF RANKING DATA // Universum: технические науки : электрон. научн. журн. 2025. 6(135). URL: https://7universum.com/ru/tech/archive/item/20266 (дата обращения: 05.12.2025).
Прочитать статью:

 

ABSTRACT

We search different methods for ranking data, which is often incomplete and unbalanced features shared by contemporary datasets from web applications, e-commerce, and healthcare management systems. Our techniques provide valuable insights not only for ordinal data but also for cardinal data derived from ratings or assessments. Starting with raw ranking data, we build pairwise rankings, represented as edge flows on a suitable graph. One benefit of this graph structure is that it helps our statistical ranking process. [1]

АННОТАЦИЯ

Мы рассматриваем различные методы для ранжирования данных, которые часто неполны и несбалансированы — черты, общие для современных наборов данных из веб-приложений, электронной коммерции и систем управления здравоохранением. Наши методы дают ценную информацию не только для порядковых данных, но и для кардинальных данных, полученных из рейтингов или оценок. Начиная с необработанных данных ранжирования, мы строим парные ранжирования, представленные в виде потоков ребер на подходящем графе. Одним из преимуществ этой структуры графа является то, что она помогает нашему статистическому процессу ранжирования. [1]

 

Keywords: ranking, combinatorial ranking, graph, pairways ranking, Hodgerank

Ключевые слова: ранжирование, комбинаторное ранжирование, графы, парное ранжирование, Hodgerank

 

Introduction

The problem of ranking in various situations has gained importance in machine learning. Numerous datasets must be rated in order to identify significant items, extract key properties, and perform efficient search and sorting operations. Large-scale datasets have been generated by the rapid growth of modern e-commerce and internet applications. CiteSeer’s citation database [1], eBay’s feedback-reputation system [2], Google’s search engine [1], and Netflix’s movie recommenda- tion system [3] are a few examples of these applications.

Traditional ranking techniques, such as those employed in social choice theory, are frequently insufficient or harmful due to one or more characteristics present in certain con- temporary datasets: In general, the data is: (1) unbalanced, with significant variations in the amount of information avail- able between entries and/or criteria; (2) highly incomplete, with many entries lacking important information; (3) cardinal scores instead of ordinal rankings, in contrast to traditional ranking issues like voting and tournaments; and (4) frequently, either explicitly or implicitly, the data resides on a large, complex network, the structure of which is crucial in the ranking process. We’ll look at a strategy that takes into account the difficulties these unusual qualities provide and how creative thinking is required.

Various sorts of ranking are necessary for the examination of many large-scale modern datasets in order to determine which entries or features are most relevant. Many examples are found in various disciplines, including internet-related applications like the Mechanical Turk system on Amazon [1], the recommendation system on Netflix [3], eBay’s feedback- based reputation mechanism [2], and Google’s search engine [1].

Identification of disease-causing variables and prediction are major areas of research in the medical field. Medical professionals offer evidence-based therapy recommendations based on these predictions [4]. Health organisations classify diseases like diabetes, dengue, and cancer using data mining on a large scale [5], [6]. Various machine learning and data mining algorithms are essential for assessing and forecasting illness severity in the field of bioinformatics.

Materials and methods

Determining a global rating from a dataset that includes various alternatives selected by multiple voters is the main issue this paper discusses. Decision science [2], [7], financial economics [3], machine learning [8], social choice [9], statis- tics [4], and other fields have all expressed interest in this problem. In statistical ranking, we have two goals in mind: first, similar to others, we want to infer a global ranking from the data whenever it is feasible. We also wish to examine the barriers to ”global ranking” and identify instances in which the data preclude a statistically significant global rating.

Even with quite big sets, humans frequently struggle to accurately determine preferences. According to some research, most people can only rank five to nine options at a time [1]. This constraint probably explains why a lot of rating systems, such the ones used by YouTube, Netflix, Amazon, and eBay, are based on a 5-star system. As a result, one anticipates that sizable amounts of human-generated ranking data will be, at most, partially sorted, with chains ranging in length from five to nine [10]. Comparing two items at once is typically easier for people to do than ranking a big set at once. Pairwise comparisons are the only conceivable comparisons in some situations, like wine tastings and tennis championships. Comparing two items at once is typically easier for people to do than ranking a big set at once. Pairwise comparisons are the only conceivable comparisons in some situations, like wine tastings and tennis championships. Therefore, it makes sense to analyse ranking data using pairwise comparison methods, which entail the smallest partial rankings.

By using a relative measure, pairwise comparisons can aid in reducing bias brought on by the arbitrary nature of rating scales. They offer a method for dealing with missing values, which are frequently encountered because people typically lack the motivation or patience to examine big datasets. Pair- wise comparison techniques have gained popularity in social choice theory, statistics, psychology, and management science for these reasons [10]. The machine learning community is also becoming interested in these techniques since they can be modified to investigate categorization issues [9]. We shall discuss two quite distinct scenarios in which pairwise rankings occur: exchange economic systems and recommendation systems.

Typically, in recommendation systems, m voters rank n options. While Netflix viewers rate films on a 5-star rating, analysts in the financial markets provide stocks or other assets a grade based on five classes of recommendations. Cite Cremonesi (2010). These datasets typically have a sizable amount of missing data. For instance, 99% of the variables in the viewer-movie matrix in the dataset Netflix provided for their reward competition were missing. The typical challenge here is to predict these missing values from the available data; this is not covered in our paper. Rather than guessing at the missing values, our aim is to create a global rating of the alternatives based only on the data that is currently available. Remarkably, when pairwise rankings are included, only 0.22% of the pairwise comparison values from the Netflix dataset are missing.

Additionally included is a background research on data mining methods for predicting dengue fever. In a prospective research, Vietnam [11] studied 85 individuals with ”Other Febrile Illness” (OFI) and 111 patients with confirmed dengue. Based on clinical results, patients were categorised into Level 1, 2, or 3. Every day before and during defervescence, as well as after discharge, blood samples were taken from each patient to investigate the relationship between the plasma levels of chymase and dengue-specific immunoglobulin E (IGE) and the severity of dengue.

In a different research, several data mining approaches were combined with three feature selection algorithms—Particle Swarm Optimisation (PSO), Genetic Algorithm (GA), and Rank Search (RS)—to predict and forecast dengue outbreaks. Three predictive modelling strategies (J48, DTNB, and Naive Bayes) were applied for dengue outbreak detection using a dataset from the Public Health Department, Seremban, Negeri Sembilan, Malaysia [12]. Moreover, [8] provided a machine learning model for dengue prediction. A common viral infection, dengue affects over one-third of the world’s population, including several Indian cities. The input was reported from microscope pictures, and the signals were filtered and eliminated before being fed into neural networks. Using the Back Propagation Network (BPN), 98% accuracy in categorization was attained.

Although the present raw ranking data  may not be entirely full, it may be aggregated over all voters to generate a pairwise ranking matrix Y, which is often significantly more comprehensive. The following four statistics can be viewed as a kind of ”average pairwise ranking” over all voters.

Logarithmic odds ratio: similar to a binary comparison, with the exception that we use a logarithmic scale:

                                                   (1.1)

Binary Comparision: Yα = sign(aαj−aαi) The average represents the likelihood that option j will be chosen over option i rather than the other way around:

                                (1.2)

Arithmetic average of score variations: Yα = aαj−aαi is the score difference. Every client who has rated both I and j has the following mathematical average:

                                                        (1.3)

98% accuracy in categorization was attained.

Result and discussion

In mathematics and science, Hodge and Helmholtz decompositions are widely recognised, usually in a continuous setting where the underlying spaces have the structure of an algebraic variety or a Riemannian manifold. On the other hand, the combinatorial Hodge theory that we have introduced functions on the most basic underlying space that can exist: a graph. As a result, our situation does not present many of the difficulties involved in advancing Hodge theory in differential and algebraic geometry. But accessibility is also improved by this simplicity. We mainly rely on multivariate calculus and elementary matrix theory in our combinatorial Hodge theory method. Since we are not aware of any comparable methods in the body of literature now in publication, we view our simple method as a small explanatory contribution. The application of the graph Helmholtzian and Hodge decomposition in other domains, such data analysis and machine learning, may become more common thanks to this discovery.

Conclusion

The Netflix reward dataset consists of over 17,000 films that 480,000 people assessed over a 74-month period, from November 1999 to December 2005. On average, each customer rated 209 films, with 99% of the ratings not being reflected in the customer-product matrix. We have no plans to take on the Netflix reward’s ratings prediction challenge. Rather, we assess the rank aggregation performance of our technique using this special publically available dataset. Our goal is to create a global movie ranking using the audience reviews and assess how accurate that global ranking is. It is noteworthy that viewing groups with similar interests can be found in order to modify this rank aggregation. These groupings might then be used to forecast ratings if necessary.

We limit our choices to films that have received ratings for the entire 74-month period, for reasons we shall shortly clarify. There are just 25 of these kinds of films. The monthly average scores for a few of these films indicate notable upward or decreasing tendencies. The time-varying scores of six of these films—Dune (17064), October Sky (12473), Shakespeare in Love (17764), Interview with the Vampire (8079), The Waterboy (14660), and Witness (15057). The numerical indices in the Netflix dataset are provided in parenthesis. Because user ratings from different eras may not be comparable on the same scale, it is challenging to grade films by averaging user scores due to these temporal variances. It is noteworthy that the methodology employed by Bell and Koren is significantly dependent on their understanding of the temporal dynamics present in the Netflix dataset. We will demonstrate how HodgeRank can be used to efficiently rank these films on a global scale, find any underlying differences, and maintain the system’s stability over time. Formation of a hierarchical pair. Since pairwise rankings are relative measurements, temporal drift should be less of an effect. We only use evaluations from the same customer within the same month to determine our pairwise rankings, which are the geometric mean of score ratios, the arithmetic mean of score discrepancies, and the binary comparisons. We opted to omit the logarithmic odds ratio since, as predicted, it produces a poor result because there is no evidence to establish the importance of a logarithmic scale. We compute the average score for every movie for every consumer, disregarding any temporal information, so that we may compare them. An independent reference score is derived from MRQE (Movie Review Query Engine), the largest online database of movie reviews available.

 

References:

  1. X. Jiang, L. H. Lim, Y. Yao, and Y. Ye, “Statistical ranking and combinatorial hodge theory,” Mathematical Programming, vol. 127, pp. 203–244, 3 2011.
  2. L. Xiong and L. Liu, “A reputation-based trust model for peer-to-peer ecommerce communities,” 2003.
  3. P. Cremonesi, Y. Koren, and R. Turrin, “Performance of recommender algorithms on top-n recommendation tasks,” 2010, pp. 39–46.
  4. J. Li, H. Wang, G. E. Daggard, H. Hu, A. Plank, and G. Daggard, “A comparative study of classification methods for microarray data analysis,” 2006. [Online]. Available: https://www.researchgate.net/publication/221337973
  5. D. Y. Liu, H. L. Chen, B. Yang, X. E. Lv, L. N. Li, and J. Liu, “Design of an enhanced fuzzy k-nearest neighbor classifier based computer aided diagnostic system for thyroid disease,” Journal of Medical Systems, vol. 36, pp. 3243–3254, 10 2012.
  6. “Advances in large margin classifiers.”
  7. D. S. Hochbaum, The Separation, and Separation-Deviation Methodol- ogy for Group Decision Making and Aggregate Ranking. INFORMS, 9 2010, pp. 116–141.
  8. X. Q. Cheng, P. Du, J. Guo, X. Zhu, and Y. Chen, “Ranking on data manifold with sink points,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, pp. 177–191, 2013.
  9. J. L. Johnson and T. Goldring, “ieee copublished by the ieee cs and the aip,” 2013.
  10. H. Yin, X. R. Li, and J. Lan, “Pairwise comparison based ranking vector approach to estimation performance ranking,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 48, pp. 942–953, 6 2018.
  11. F. Ong and M. Lustig, “Beyond low rank + sparse: Multiscale low rank matrix decomposition,” IEEE Journal on Selected Topics in Signal Processing, vol. 10, pp. 672–687, 6 2016.
  12. S. A. alias Balamurugan, M. S. Mallick, and G. Chinthana, “Improved prediction of dengue outbreak using combinatorial feature selector and classifier based on entropy weighted score based optimal ranking,” Informatics in Medicine Unlocked, vol. 20, 1 2020.
Информация об авторах

Master Student, Department of Computer Science, Kazakh British Technical University (KBTU), Kazakhstan, Almaty

магистрант, кафедра компьютерных наук, Казахско-Британский технический университет (КБТУ), Казахстан, г. Алматы

PhD, Associate Professor of School of Information Technology and Engineering at Kazakh-British Technical University, Kazakhstan, Almaty

канд. техн. наук, доцент Школы Информационных Технологий и Инженерии Казахстанско-Британского технического университета, Казахстан, г. Алматы

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Звездина Марина Юрьевна.
Top