Skip to main content

A survey of Clustering Algorithms

  • Chapter
  • First Online:
Data Mining and Knowledge Discovery Handbook

Summary

This chapter presents a tutorial overview of the main clustering methods used in Data Mining. The goal is to provide a self-contained review of the concepts and the mathematics underlying clustering techniques. The chapter begins by providing measures and criteria that are used for determining whether two objects are similar or dissimilar. Then the clustering methods are presented, divided into: hierarchical, partitioning, density-based, model-based, grid-based, and soft-computing methods. Following the methods, the challenges of performing clustering in large data sets are discussed. Finally, the chapter presents how to determine the number of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Al-Sultan K. S., A tabu search approach to the clustering problem, Pattern Recognition, 28:1443-1451,1995.

    Article  Google Scholar 

  • Al-Sultan K. S. , Khan M. M. : Computational experience on four algorithms for the hard clustering problem. Pattern Recognition Letters 17(3): 295-308, 1996.

    Article  Google Scholar 

  • Arbel, R. and Rokach, L., Classifier evaluation under limited resources, Pattern Recognition Letters, 27(14): 1619–1631, 2006, Elsevier.

    Article  Google Scholar 

  • Averbuch, M. and Karson, T. and Ben-Ami, B. and Maimon, O. and Rokach, L., Contextsensitive medical information retrieval, The 11th World Congress on Medical Informatics (MEDINFO 2004), San Francisco, CA, September 2004, IOS Press, pp. 282–286.

    Google Scholar 

  • Banfield J. D. and Raftery A. E. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803-821, 1993.

    Article  MATH  MathSciNet  Google Scholar 

  • Bentley J. L. and Friedman J. H., Fast algorithms for constructing minimal spanning trees in coordinate spaces. IEEE Transactions on Computers, C-27(2):97-105, February 1978. 275

    Article  Google Scholar 

  • Bonner, R., On Some Clustering Techniques. IBM journal of research and development, 8:22-32, 1964.

    Article  MATH  Google Scholar 

  • Can F. , Incremental clustering for dynamic information processing, in ACM Transactions on Information Systems, no. 11, pp 143-164, 1993.

    Article  Google Scholar 

  • Cheeseman P., Stutz J.: Bayesian Classification (AutoClass): Theory and Results. Advances in Knowledge Discovery and Data Mining 1996: 153-180

    Google Scholar 

  • Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition with Grouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp. 3592-3612, 2007.

    Article  Google Scholar 

  • Dhillon I. and Modha D., Concept Decomposition for Large Sparse Text Data Using Clustering. Machine Learning. 42, pp.143-175. (2001).

    Article  MATH  Google Scholar 

  • Dempster A.P., Laird N.M., and Rubin D.B., Maximum likelihood from incomplete data using the EM algorithm. Journal of the Royal Statistical Society, 39(B), 1977.

    Google Scholar 

  • Duda, P. E. Hart and D. G. Stork, Pattern Classification, Wiley, New York, 2001.

    MATH  Google Scholar 

  • Ester M., Kriegel H.P., Sander S., and Xu X., A density-based algorithm for discovering clusters in large spatial databases with noise. In E. Simoudis, J. Han, and U. Fayyad, editors, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 226-231, Menlo Park, CA, 1996. AAAI, AAAI Press.

    Google Scholar 

  • Estivill-Castro, V. and Yang, J. A Fast and robust general purpose clustering algorithm. Pacific Rim International Conference on Artificial Intelligence, pp. 208-218, 2000.

    Google Scholar 

  • Fraley C. and Raftery A.E., “How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis”, Technical Report No. 329. Department of Statistics University of Washington, 1998.

    Google Scholar 

  • Fisher, D., 1987, Knowledge acquisition via incremental conceptual clustering, in machine learning 2, pp. 139-172.

    Google Scholar 

  • Fortier, J.J. and Solomon, H. 1996. Clustering procedures. In proceedings of the Multivariate Analysis, ’66, P.R. Krishnaiah (Ed.), pp. 493-506.

    Google Scholar 

  • Gluck, M. and Corter, J., 1985. Information, uncertainty, and the utility of categories. Proceedings of the Seventh Annual Conference of the Cognitive Science Society (pp. 283- 287). Irvine, California: Lawrence Erlbaum Associates.

    Google Scholar 

  • Guha, S., Rastogi, R. and Shim, K. CURE: An efficient clustering algorithm for large databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 73-84, New York, 1998.

    Google Scholar 

  • Han, J. and Kamber, M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001.

    Google Scholar 

  • Hartigan, J. A. Clustering algorithms. John Wiley and Sons., 1975.

    Google Scholar 

  • Huang, Z., Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 1998.

    Google Scholar 

  • Hoppner F. , Klawonn F., Kruse R., Runkler T., Fuzzy Cluster Analysis, Wiley, 2000.

    Google Scholar 

  • Hubert, L. and Arabie, P., 1985 Comparing partitions. Journal of Classification, 5. 193-218.

    Article  Google Scholar 

  • Jain, A.K. Murty, M.N. and Flynn, P.J. Data Clustering: A Survey. ACM Computing Surveys, Vol. 31, No. 3, September 1999.

    Google Scholar 

  • Kaufman, L. and Rousseeuw, P.J., 1987, Clustering by Means of Medoids, In Y. Dodge, editor, Statistical Data Analysis, based on the L1 Norm, pp. 405-416, Elsevier/North Holland, Amsterdam.

    Google Scholar 

  • Kim, D.J., Park, Y.W. and Park,. A novel validity index for determination of the optimal number of clusters. IEICE Trans. Inf., Vol. E84-D, no.2, 2001, 281-285.

    Google Scholar 

  • King, B. Step-wise Clustering Procedures, J. Am. Stat. Assoc. 69, pp. 86-101, 1967.

    Article  Google Scholar 

  • Larsen, B. and Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM SIGKDD, 16-22, San Diego, CA.

    Google Scholar 

  • Maimon O., and Rokach, L. Data Mining by Attribute Decomposition with semiconductors manufacturing case study, in Data Mining for Design and Manufacturing: Methods and Applications, D. Braha (ed.), Kluwer Academic Publishers, pp. 311–336, 2001.

    Google Scholar 

  • Maimon O. and Rokach L., “Improving supervised learning by feature decomposition”, Proceedings of the Second International Symposium on Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, Springer, pp. 178-196, 2002.

    Google Scholar 

  • Maimon, O. and Rokach, L., Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications, Series in Machine Perception and Artificial Intelligence - Vol. 61, World Scientific Publishing, ISBN:981-256-079-3, 2005.

    Google Scholar 

  • Marcotorchino, J.F. and Michaud, P. Optimisation en Analyse Ordinale des Donns. Masson, Paris.

    Google Scholar 

  • Mishra, S. K. and Raghavan, V. V., An empirical study of the performance of heuristic methods for clustering. In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, Eds. 425436, 1994.

    Google Scholar 

  • Moskovitch R, Elovici Y, Rokach L, Detection of unknown computer worms based on behavioral classification of the host, Computational Statistics and Data Analysis, 52(9):4544– 4566, 2008.

    Article  MATH  MathSciNet  Google Scholar 

  • Murtagh, F. A survey of recent advances in hierarchical clustering algorithms which use cluster centers. Comput. J. 26 354-359, 1984.

    Google Scholar 

  • Ng, R. and Han, J. 1994. Very large data bases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB94, Santiago, Chile, Sept.), VLDB Endowment, Berkeley, CA, 144155.

    Google Scholar 

  • Rand,W. M., Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66: 846–850, 1971.

    Article  Google Scholar 

  • Ray, S., and Turi, R.H. Determination of Number of Clusters in K-Means Clustering and Application in Color Image Segmentation. Monash university, 1999.

    Google Scholar 

  • Rokach, L., Decomposition methodology for classification tasks: a meta decomposer framework, Pattern Analysis and Applications, 9(2006):257–271.

    Article  MathSciNet  Google Scholar 

  • Rokach L., Genetic algorithm-based feature set partitioning for classification problems, Pattern Recognition, 41(5):1676–1700, 2008.

    Article  MATH  Google Scholar 

  • Rokach L., Mining manufacturing data using genetic algorithm-based feature set decomposition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008.

    Article  Google Scholar 

  • Rokach, L. and Maimon, O., Theory and applications of attribute decomposition, IEEE International Conference on Data Mining, IEEE Computer Society Press, pp. 473–480, 2001.

    Google Scholar 

  • Rokach L. and Maimon O., Feature Set Decomposition for Decision Trees, Journal of Intelligent Data Analysis, Volume 9, Number 2, 2005b, pp 131–158.

    Google Scholar 

  • Rokach, L. and Maimon, O., Clustering methods, Data Mining and Knowledge Discovery Handbook, pp. 321–352, 2005, Springer.

    Google Scholar 

  • Rokach, L. and Maimon, O., Data mining for improving the quality of manufacturing: a feature set decomposition approach, Journal of Intelligent Manufacturing, 17(3):285– 299, 2006, Springer.

    Article  Google Scholar 

  • Rokach, L., Maimon, O., Data Mining with Decision Trees: Theory and Applications,World Scientific Publishing, 2008.

    Google Scholar 

  • Rokach L., Maimon O. and Lavi I., Space Decomposition In Data Mining: A Clustering Approach, Proceedings of the 14th International Symposium On Methodologies For Intelligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag, 2003, pp. 24–31.

    Google Scholar 

  • Rokach, L. and Maimon, O. and Averbuch, M., Information Retrieval System for Medical Narrative Reports, Lecture Notes in Artificial intelligence 3055, page 217-228 Springer- Verlag, 2004.

    Google Scholar 

  • Rokach, L. and Maimon, O. and Arbel, R., Selective voting-getting more for less in sensor fusion, International Journal of Pattern Recognition and Artificial Intelligence 20 (3) (2006), pp. 329–350.

    Article  Google Scholar 

  • Selim, S.Z., and Ismail, M.A. K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. In IEEE transactions on pattern analysis and machine learning, vol. PAMI-6, no. 1, January, 1984.

    Google Scholar 

  • Selim, S. Z. AND Al-Sultan, K. 1991. A simulated annealing algorithm for the clustering problem. Pattern Recogn. 24, 10 (1991), 10031008.

    Article  MathSciNet  Google Scholar 

  • Sneath, P., and Sokal, R. Numerical Taxonomy.W.H. Freeman Co., San Francisco, CA, 1973.

    Google Scholar 

  • Strehl A. and Ghosh J., Clustering Guidance and Quality Evaluation Using Relationship-based Visualization, Proceedings of Intelligent Engineering Systems Through Artificial Neural Networks, 2000, St. Louis, Missouri, USA, pp 483-488.

    Google Scholar 

  • Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In Proc. AAAI Workshop on AI for Web Search, pp 58–64, 2000.

    Google Scholar 

  • Tibshirani, R.,Walther, G. and Hastie, T., 2000. Estimating the number of clusters in a dataset via the gap statistic. Tech. Rep. 208, Dept. of Statistics, Stanford University.

    Google Scholar 

  • Tyron R. C. and Bailey D.E. Cluster Analysis. McGraw-Hill, 1970.

    Google Scholar 

  • Urquhart, R. Graph-theoretical clustering, based on limited neighborhood sets. Pattern recognition, vol. 15, pp. 173-187, 1982.

    Article  MATH  Google Scholar 

  • Veyssieres, M.P. and Plant, R.E. Identification of vegetation state and transition domains in California’s hardwood rangelands. University of California, 1998.

    Google Scholar 

  • Wallace C. S. and Dowe D. L., Intrinsic classification by mml – the snob program. In Proceedings of the 7th Australian Joint Conference on Artificial Intelligence, pages 37-44, 1994.

    Google Scholar 

  • Wang, X. and Yu, Q. Estimate the number of clusters in web documents via gap statistic. May 2001.

    Google Scholar 

  • Ward, J. H. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236-244, 1963.

    Article  MathSciNet  Google Scholar 

  • Zahn, C. T., Graph-theoretical methods for detecting and describing gestalt clusters. IEEE trans. Comput. C-20 (Apr.), 68-86, 1971.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lior Rokach .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Rokach, L. (2009). A survey of Clustering Algorithms. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-09823-4_14

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-09822-7

  • Online ISBN: 978-0-387-09823-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics