Browse Books

Go to Data Classification

Comprehensive Coverage of the Entire Area of Classification Research on the problem of classification tends to be fragmented across such areas as pattern recognition, database, data mining, and machine learning. Addressing the work of these different communities in a unified way, Data Classification: Algorithms and Applications explores the underlying algorithms of classification as well as applications of classification in a variety of problem domains, including text, multimedia, social network, and biological data. This comprehensive book focuses on three primary aspects of data classification: Methods-The book first describes common techniques used for classification, including probabilistic methods, decision trees, rule-based methods, instance-based methods, support vector machine methods, and neural networks. Domains-The book then examines specific methods used for data domains such as multimedia, text, time-series, network, discrete sequence, and uncertain data. It also covers large data sets and data streams due to the recent importance of the big data paradigm. Variations-The book concludes with insight on variations of the classification process. It discusses ensembles, rare-class learning, distance function learning, active learning, visual learning, transfer learning, and semi-supervised learning as well as evaluation aspects of classifiers.

Cited By

Li P, Zhang H, Hu X and Wu X (2023). High-Dimensional Multi-Label Data Stream Classification With Concept Drifting Detection, IEEE Transactions on Knowledge and Data Engineering , 35 :8 , (8085-8099), Online publication date: 1-Aug-2023 .

Hidalgo J, Santos S and Barros R (2021). Dynamically Adjusting Diversity in Ensembles for the Classification of Data Streams with Concept Drift, ACM Transactions on Knowledge Discovery from Data , 16 :2 , (1-20), Online publication date: 30-Apr-2022 .

Bi X, Zhang C, Wang F, Liu Z, Zhao X, Yuan Y and Wang G (2021). An Uncertainty-based Neural Network for Explainable Trajectory Segmentation, ACM Transactions on Intelligent Systems and Technology , 13 :1 , (1-18), Online publication date: 28-Feb-2022 .

Karim M, Cochez M, Zappa A, Sahay R, Rebholz-Schuhmann D, Beyan O and Decker S (2022). Convolutional Embedded Networks for Population Scale Clustering and Bio-Ancestry Inferencing, IEEE/ACM Transactions on Computational Biology and Bioinformatics , 19 :1 , (369-382), Online publication date: 1-Jan-2022 .

Yu S, Wang Y, Gu Y, Dhulipala L and Shun J (2021). ParChain, Proceedings of the VLDB Endowment , 15 :2 , (285-298), Online publication date: 1-Oct-2021 .

LI Q, Zhang X, Liu H, Dai Q and Wu X Dimensionwise Separable 2-D Graph Convolution for Unsupervised and Semi-Supervised Learning on Graphs Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, (953-963)

Yang R, Shi J, Yang Y, Huang K, Zhang S and Xiao X Effective and Scalable Clustering on Massive Attributed Graphs Proceedings of the Web Conference 2021, (3675-3687)

Lensen A, Xue B and Zhang M (2021). Genetic Programming for Evolving Similarity Functions for Clustering, Evolutionary Computation , 28 :4 , (531-561), Online publication date: 1-Dec-2020 .

Vanderlei Fernandes L, Sarmet M, Castanho C, Pezzuol Jacobi R and e Silva T Analysis of Clustering Techniques in MMOG with Restricted Data: The Case of Final Fantasy XIV Design, User Experience, and Usability. Design for Contemporary Interactive Environments, (586-604)

Guidotti A and Vanelli‐Coralli A (2019). Clustering strategies for multicast precoding in multibeam satellite systems, International Journal of Satellite Communications and Networking , 38 :2 , (85-104), Online publication date: 17-Feb-2020 .

Solorio-Fernández S, Carrasco-Ochoa J and Martínez-Trinidad J (2019). A review of unsupervised feature selection methods, Artificial Intelligence Review , 53 :2 , (907-948), Online publication date: 1-Feb-2020 .

Zhang X, Liu H, Li Q and Wu X Attributed graph clustering via adaptive graph convolution Proceedings of the 28th International Joint Conference on Artificial Intelligence, (4327-4333)

Cheng A, Zhou C, Yang H, Wu J, Li L, Tan J and Guo L Deep active learning for anchor user prediction Proceedings of the 28th International Joint Conference on Artificial Intelligence, (2151-2157)

Rezig E, Ouzzani M, Elmagarmid A, Aref W and Stonebraker M Towards an End-to-End Human-Centric Data Cleaning Framework Proceedings of the Workshop on Human-In-the-Loop Data Analytics, (1-7)

Ndenga M, Ganchev I, Mehat J, Wabwoba F and Akdag H (2019). Performance and cost-effectiveness of change burst metrics in predicting software faults, Knowledge and Information Systems , 60 :1 , (275-302), Online publication date: 1-Jul-2019 .

Tang M, Marin D, Ben Ayed I and Boykov Y (2019). Kernel Cuts, International Journal of Computer Vision , 127 :5 , (477-511), Online publication date: 1-May-2019 .

Barton T, Bruna T and Kordik P (2019). Chameleon 2, ACM Transactions on Knowledge Discovery from Data , 13 :1 , (1-27), Online publication date: 28-Feb-2019 .

Cabanban-Casem C Analytical Visualization of Higher Education Institutions' Big Data for Decision Making Proceedings of the 2019 Asia Pacific Information Technology Conference, (61-64)

He Z and Yu C (2019). Clustering stability-based Evolutionary K-Means, Soft Computing - A Fusion of Foundations, Methodologies and Applications , 23 :1 , (305-321), Online publication date: 1-Jan-2019 .

Hu J and Pei J (2018). Subspace multi-clustering, Knowledge and Information Systems , 56 :2 , (257-284), Online publication date: 1-Aug-2018 .

Zhang Y, Zhao P, Cao J, Ma W, Huang J, Wu Q and Tan M Online Adaptive Asymmetric Active Learning for Budgeted Imbalanced Data Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, (2768-2777)

Deshmukh J, Jin X, Majumdar R and Prabhu V Parameter optimization in control software using statistical fault localization techniques Proceedings of the 9th ACM/IEEE International Conference on Cyber-Physical Systems, (220-231)

Ahmadi Z and Kramer S (2018). Modeling recurring concepts in data streams, Knowledge and Information Systems , 55 :1 , (15-44), Online publication date: 1-Apr-2018 .

Nguyen C and Artemiadis P (2018). EEG feature descriptors and discriminant analysis under Riemannian Manifold perspective, Neurocomputing , 275 :C , (1871-1883), Online publication date: 31-Jan-2018 .

Losing V, Hammer B and Wersing H (2018). Incremental on-line learning, Neurocomputing , 275 :C , (1261-1274), Online publication date: 31-Jan-2018 .

Li Y, Hou D, Pan A and Gong Z DeMalC Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, (1559-1567)

Liu W, Ye M, Wei J and Hu X (2017). Compressed constrained spectral clustering framework for large-scale data sets, Knowledge-Based Systems , 135 :C , (77-88), Online publication date: 1-Nov-2017 .

Barbon A, Barbon S, Campos G, Seixas J, Peres L, Mastelini S, Andreo N, Ulrici A and Bridi A (2017). Development of a flexible Computer Vision System for marbling classification, Computers and Electronics in Agriculture , 142 :PB , (536-544), Online publication date: 1-Nov-2017 .

Liu Q, Wu X, Kittinger L, Levy M and Jung C (2017). BenchPrime, ACM Transactions on Embedded Computing Systems , 16 :5s , (1-22), Online publication date: 10-Oct-2017 .

Baghdadi Y, Al-Thuhli A, Al-Badawi M and Al-Hamdani A (2017). A Framework for Interfacing Unstructured Data Into Business Process From Enterprise Social Networks, International Journal of Enterprise Information Systems , 13 :4 , (15-30), Online publication date: 1-Oct-2017 .

Sathe S and Aggarwal C Similarity Forests Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (395-403)

Cassavia N, Flesca S and Masciari E Choose The Best! Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, (1209-1216)

Lensen A, Xue B and Zhang M Improving k-means clustering with genetic programming for feature construction Proceedings of the Genetic and Evolutionary Computation Conference Companion, (237-238)

Lensen A, Xue B and Zhang M GPGC Proceedings of the Genetic and Evolutionary Computation Conference, (449-456)

Xu J, Wang G, Li T, Deng W and Gou G (2017). Fat node leading tree for data stream clustering with density peaks, Knowledge-Based Systems , 120 :C , (99-117), Online publication date: 15-Mar-2017 .

Barbon S, Igawa R and Bogaz Zarpelão B (2017). Authorship verification applied to detection of compromised accounts on online social networks, Multimedia Tools and Applications , 76 :3 , (3213-3233), Online publication date: 1-Feb-2017 .

Luna J, Castro C and Romero C (2017). MDM tool, Computer Applications in Engineering Education , 25 :1 , (90-102), Online publication date: 1-Jan-2017 .

Xu J, Wang G and Deng W (2016). DenPEHC, Information Sciences: an International Journal , 373 :C , (200-218), Online publication date: 10-Dec-2016 .

Esmaelian M, Shahmoradi H and Vali M (2016). A novel classification method, Applied Soft Computing , 49 :C , (56-70), Online publication date: 1-Dec-2016 .

Jang J, Lee Y, Lee S, Shin D, Kim D and Rim H (2016). A novel density-based clustering method using word embedding features for dialogue intention recognition, Cluster Computing , 19 :4 , (2315-2326), Online publication date: 1-Dec-2016 .

Schneider S, Wolf J, Hildrum K, Khandekar R and Wu K Dynamic Load Balancing for Ordered Data-Parallel Regions in Distributed Streaming Systems Proceedings of the 17th International Middleware Conference, (1-14)

Hug N, Prade H, Richard G and Serrurier M Analogical classifiers Proceedings of the Twenty-second European Conference on Artificial Intelligence, (689-697)

Tang N, Chen Q and Mitra P Graph Stream Summarization Proceedings of the 2016 International Conference on Management of Data, (1481-1496)

Khan F, Qamar U and Bashir S (2016). SWIMS, Knowledge-Based Systems , 100 :C , (97-111), Online publication date: 15-May-2016 .

Zhang F, Zheng Q, Zou Y and Hassan A Cross-project defect prediction using a connectivity-based unsupervised classifier Proceedings of the 38th International Conference on Software Engineering, (309-320)

Oikawa M, Dias Z, de Rezende Rocha A and Goldenstein S (2016). Manifold Learning and Spectral Clustering for Image Phylogeny Forests, IEEE Transactions on Information Forensics and Security , 11 :1 , (5-18), Online publication date: 1-Jan-2016 .

Rastin P and Kanawati R A multiplex-network based approach for clustering ensemble selection Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, (1332-1339)

Cai Y, Ratan R, Shen C and Alameda J Grouping game players using parallelized k-means on supercomputers Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, (1-7)

Chaddad A (2015). Automated feature extraction in brain tumor by magnetic resonance imaging using Gaussian mixture models, Journal of Biomedical Imaging , 2015 , (8-8), Online publication date: 1-Jan-2015 .

Aggarwal C The setwise stream classification problem Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, (432-441)

Kanawati R Seed-Centric Approaches for Community Detection in Complex Networks Proceedings of the 6th International Conference on Social Computing and Social Media - Volume 8531, (197-208)

Save to Binder

Charu C. Aggarwal

IBM Thomas J. Watson Research Center

Index Terms

Data Classification: Algorithms and Applications

Reviews

Reviewer: Michael Goldberg

Charu Aggarwal is the editor of this compendium of chapters on data classification for data mining applications. He is a distinguished researcher at the IBM Watson Research Center. He has edited and/or written 14 books on data mining and uncertain data. He also has been on the editorial staff for various IEEE journals, among others, and is a fellow at the three main research societies for mathematical sciences and engineering: ACM, IEEE, and SIAM. In this text, Aggarwal has authored or coauthored seven of the 25 chapters-the introductory chapter 1, instance-based learning using lazy methods (chapter 6), a survey of stream classification algorithms in chapter 9, text classification (chapter 11), rare class detection (chapter 17), a survey on active learning in chapter 22, and the final chapter discussing educational and software resources for data classification. Duda and Hart wrote a fundamental treatise [1] on data and pattern classification, which until recently was considered the classic text. Subsequent editions have equally been important. Wu et al. collected and summarized the top ten most important algorithms in the data mining literature [2]. These ten algorithms are likewise discussed in the first eight chapters of this compendium. Aggarwal takes a three-pronged approach in selecting the chapters for this text. One main focus is on the fundamental methods (chapters 1 through 8). A second focus is that of specific data domains (chapters 9 through 16). The third focus is for advanced applications and variations on classic themes (chapters 17 through 23). The last two chapters are generic and can be useful in tandem with any of the prior chapters. Chapter 24 presents evaluation techniques for classification methods, and chapter 25 considers academic resources for data classification. From a commercial point of view, the most popular techniques are C4.5 (commercially available as C5.0) and classification and regression trees (CART). They are discussed in chapter 4 with case studies presenting their practical application. These approaches are based on a classifier-based system that represents the decision-making process by exploring a tree structure. Utilizing the inherent structure, a necessary set of rules can be constructed. The C4.5 algorithm identifies properties in order to form a rule of the form, "if the object has a certain set of properties Π, then its category must be X ." Questions remain about the robustness of these methods in the presence of error, and whether a hybrid approach would improve these methods by incorporating rules obtained according to other criteria. CART precedes C4.5 and its decision tree is obtained by a recursive partitioning algorithm. The main difference between the two is that CART decisions are binary, whereas C4.5 considers multiple outcomes. Stream classification algorithms are considered to be one of the hottest topics, from a practical perspective, in data classification today. Recent advances in hardware and network technology have enabled large amounts of data and network streams to be handled in a given moment. How to reason and learn about the data from the streams is an open and difficult problem. The source of these streams can be as varied as credit card transactions to voice over Internet protocol (VoIP). Aggarwal surveys the topic in chapter 9, with other authors also adding some insights in their respective chapters (1, 2, 4, and 22). Given the credentials of the editor and the quality of the contributions, it goes without saying that this text is an essential reference for anyone interested in data classification or data mining. Online Computing Reviews Service

Computing Reviews logoComputing Reviews logo

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Recommendations

Modified nearest neighbour classifier for hyperspectral data classification

A modified k-nearest neighbour k-NN classifier is proposed for supervised remote sensing classification of hyperspectral data. To compare its performance in terms of classification accuracy and computational cost, k-NN and a back-propagation neural .

Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm

In this era of big data, processing large scale data efficiently and accurately has become a challenging problem. Ensemble classification is a type of supervised learning that uses multiple experts to generate the final output. It .

Big data classification: problems and challenges in network intrusion prediction with machine learning

This paper focuses on the specific problem of Big Data classification of network intrusion traffic. It discusses the system challenges presented by the Big Data problems associated with network intrusion prediction. The prediction of a possible .