Art

Density-based Algorithms for Active and Anytime Clustering. Mai Thai Son

Categories
Published
of 35
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Description
Density-based Algorithms for Active and Anytime Clustering Mai Thai Son München Density-based Algorithms for Active and Anytime Clustering Mai Thai Son Dissertation an der Fakultät Für Mathematik,
Transcript
Density-based Algorithms for Active and Anytime Clustering Mai Thai Son München 2014 Density-based Algorithms for Active and Anytime Clustering Mai Thai Son Dissertation an der Fakultät Für Mathematik, Informatik und Statistik, Institut Für Informatik der Ludwig Maximilians Universität München vorgelegt von Mai Thai Son aus Vietnam München, den 09 Sep 2013 Erstgutachter: Prof. Dr. Christian Böhm Zweitgutachter: Prof. Dr. Xiaowei Xu Tag der mündlichen Prüfung: 26 Sep 2014 Abstract Data intensive applications like biology, medicine, and neuroscience require effective and efficient data mining technologies. Advanced data acquisition methods produce a constantly increasing volume and complexity. As a consequence, the need of new data mining technologies to deal with complex data has emerged during the last decades. In this thesis, we focus on the data mining task of clustering in which objects are separated in different groups (clusters) such that objects inside a cluster are more similar than objects in different clusters. Particularly, we consider density-based clustering algorithms and their applications in biomedicine. The core idea of the density-based clustering algorithm DBSCAN is that each object within a cluster must have a certain number of other objects inside its neighborhood. Compared with other clustering algorithms, DBSCAN has many attractive benefits, e.g., it can detect clusters with arbitrary shape and is robust to outliers, etc. Thus, DBSCAN has attracted a lot of research interest during the last decades with many extensions and applications. In the first part of this thesis, we aim at developing new algorithms based on the DBSCAN paradigm to deal with the new challenges of complex data, particularly expensive distance measures and incomplete availability of the distance matrix. Like many other clustering algorithms, DBSCAN suffers from poor performance when facing expensive distance measures for complex data. To tackle this problem, we propose a new algorithm based on the DBSCAN paradigm, called Anytime Density-based Clustering (A-DBSCAN), that works in an anytime scheme: in contrast to the original batch scheme of DBSCAN, the algorithm A-DBSCAN first produces a quick approximation of the clustering result and then continuously refines the result during the further run. Experts can interrupt the algorithm, examine the results, and choose between (1) stopping the algorithm at any time whenever they are satisfied with the result to save runtime and (2) continuing the algorithm to achieve better results. Such kind of anytime scheme has been proven in the literature as a very useful technique when dealing with vi time consuming problems. We also introduced an extended version of A-DBSCAN called A-DBSCAN-XS which is more efficient and effective than A-DBSCAN when dealing with expensive distance measures. Since DBSCAN relies on the cardinality of the neighborhood of objects, it requires the full distance matrix to perform. For complex data, these distances are usually expensive, time consuming or even impossible to acquire due to high cost, high time complexity, noisy and missing data, etc. Motivated by these potential difficulties of acquiring the distances among objects, we propose another approach for DBSCAN, called Active Density-based Clustering (Act-DBSCAN). Given a budget limitation B, Act-DBSCAN is only allowed to use up to B pairwise distances ideally to produce the same result as if it has the entire distance matrix at hand. The general idea of Act-DBSCAN is that it actively selects the most promising pairs of objects to calculate the distances between them and tries to approximate as much as possible the desired clustering result with each distance calculation. This scheme provides an efficient way to reduce the total cost needed to perform the clustering. Thus it limits the potential weakness of DBSCAN when dealing with the distance sparseness problem of complex data. As a fundamental data clustering algorithm, density-based clustering has many applications in diverse fields. In the second part of this thesis, we focus on an application of density-based clustering in neuroscience: the segmentation of the white matter fiber tracts in human brain acquired from Diffusion Tensor Imaging (DTI). We propose a model to evaluate the similarity between two fibers as a combination of structural similarity and connectivity-related similarity of fiber tracts. Various distance measure techniques from fields like time-sequence mining are adapted to calculate the structural similarity of fibers. Density-based clustering is used as the segmentation algorithm. We show how A-DBSCAN and A-DBSCAN-XS are used as novel solutions for the segmentation of massive fiber datasets and provide unique features to assist experts during the fiber segmentation process. Zusammenfassung Datenintensive Anwendungen wie Biologie, Medizin und Neurowissenschaften erfordern effektive und effiziente Data-Mining-Technologien. Erweiterte Methoden der Datenerfassung erzeugen stetig wachsende Datenmengen und Komplexität. In den letzten Jahrzehnten hat sich daher ein Bedarf an neuen Data-Mining- Technologien für komplexe Daten ergeben. In dieser Arbeit konzentrieren wir uns auf die Data-Mining-Aufgabe des Clusterings, in der Objekte in verschiedenen Gruppen (Cluster) getrennt werden, so dass Objekte in einem Cluster untereinander viel ähnlicher sind als Objekte in verschiedenen Clustern. Insbesondere betrachten wir dichtebasierte Clustering-Algorithmen und ihre Anwendungen in der Biomedizin. Der Kerngedanke des dichtebasierten Clustering-Algorithmus DBSCAN ist, dass jedes Objekt in einem Cluster eine bestimmte Anzahl von anderen Objekten in seiner Nachbarschaft haben muss. Im Vergleich mit anderen Clustering- Algorithmen hat DBSCAN viele attraktive Vorteile, zum Beispiel kann es Cluster mit beliebiger Form erkennen und ist robust gegenüber Ausreißern. So hat DBSCAN in den letzten Jahrzehnten großes Forschungsinteresse mit vielen Erweiterungen und Anwendungen auf sich gezogen. Im ersten Teil dieser Arbeit wollen wir auf die Entwicklung neuer Algorithmen eingehen, die auf dem DB- SCAN Paradigma basieren, um mit den neuen Herausforderungen der komplexen Daten, insbesondere teurer Abstandsmaße und unvollständiger Verfügbarkeit der Distanzmatrix umzugehen. Wie viele andere Clustering-Algorithmen leidet DBSCAN an schlechter Performanz, wenn es teuren Abstandsmaßen für komplexe Daten gegenüber steht. Um dieses Problem zu lösen, schlagen wir einen neuen Algorithmus vor, der auf dem DBSCAN Paradigma basiert, genannt Anytime Density-based Clustering (A- DBSCAN), der mit einem Anytime Schema funktioniert. Im Gegensatz zu dem ursprünglichen Schema DBSCAN, erzeugt der Algorithmus A-DBSCAN zuerst eine schnelle Annäherung des Clusterings-Ergebnisses und verfeinert dann kon- viii tinuierlich das Ergebnis im weiteren Verlauf. Experten können den Algorithmus unterbrechen, die Ergebnisse prüfen und wählen zwischen (1) Anhalten des Algorithmus zu jeder Zeit, wann immer sie mit dem Ergebnis zufrieden sind, um Laufzeit sparen und (2) Fortsetzen des Algorithmus, um bessere Ergebnisse zu erzielen. Eine solche Art eines Anytime Schemas ist in der Literatur als eine sehr nützliche Technik erprobt, wenn zeitaufwendige Problemen anfallen. Wir stellen auch eine erweiterte Version von A-DBSCAN als A-DBSCAN-XS vor, die effizienter und effektiver als A-DBSCAN beim Umgang mit teuren Abstandsmaßen ist. Da DBSCAN auf der Kardinalität der Nachbarschaftsobjekte beruht, ist es notwendig, die volle Distanzmatrix auszurechen. Für komplexe Daten sind diese Distanzen in der Regel teuer, zeitaufwendig oder sogar unmöglich zu errechnen, aufgrund der hohen Kosten, einer hohen Zeitkomplexität oder verrauschten und fehlende Daten. Motiviert durch diese möglichen Schwierigkeiten der Berechnung von Entfernungen zwischen Objekten, schlagen wir einen anderen Ansatz für DB- SCAN vor, namentlich Active Density-based Clustering (Act-DBSCAN). Bei einer Budgetbegrenzung B, darf Act-DBSCAN nur bis zu B ideale paarweise Distanzen verwenden, um das gleiche Ergebnis zu produzieren, wie wenn es die gesamte Distanzmatrix zur Hand hätte. Die allgemeine Idee von Act-DBSCAN ist, dass es aktiv die erfolgversprechendsten Paare von Objekten wählt, um die Abstände zwischen ihnen zu berechnen, und versucht, sich so viel wie möglich dem gewünschten Clustering mit jeder Abstandsberechnung zu nähern. Dieses Schema bietet eine effiziente Möglichkeit, die Gesamtkosten der Durchführung des Clusterings zu reduzieren. So schränkt sie die potenzielle Schwäche des DBSCAN beim Umgang mit dem Distance Sparseness Problem von komplexen Daten ein. Als fundamentaler Clustering-Algorithmus, hat dichte-basiertes Clustering viele Anwendungen in den unterschiedlichen Bereichen. Im zweiten Teil dieser Arbeit konzentrieren wir uns auf eine Anwendung des dichte-basierten Clusterings in den Neurowissenschaften: Die Segmentierung der weißen Substanz bei Faserbahnen im menschlichen Gehirn, die vom Diffusion Tensor Imaging (DTI) erfasst werden. Wir schlagen ein Modell vor, um die Ähnlichkeit zwischen zwei Fasern als einer Kombination von struktureller und konnektivitätsbezogener Ähnlichkeit von Faserbahnen zu beurteilen. Verschiedene Abstandsmaße aus Bereichen wie dem Time-Sequence Mining werden angepasst, um die strukturelle Ähnlichkeit von Fasern zu berechnen. Dichte-basiertes Clustering wird als Segmentierungsalgorithmus verwendet. Wir zeigen, wie A-DBSCAN und A-DBSCAN-XS als neuartige Lösungen für die Segmentierung von sehr großen Faserdatensätzen verwendet Zusammenfassung ix werden, und bieten innovative Funktionen, um Experten während des Fasersegmentierungsprozesses zu unterstützen. x Contents Abstract Zusammenfassung Table of Contents v vii x I Introduction 1 1 Introduction Density-based Clustering Challenges of Complex Data Contributions and Thesis Outline Density-based Clustering Algorithms Introduction The Algorithm DBSCAN The Algorithm OPTICS Other Algorithms Applications of DBSCAN Extensions of DBSCAN Parameter Optimization Clustering with Varying Densities Speeding up the Algorithm xii CONTENTS Parallel and Distributed Clustering Incremental Clustering Subspace Clustering Semi-supervised Clustering Clustering Complex Data Other Algorithms Conclusions Preliminaries Cluster Validation Information Theoretic Validation Techniques Other Validation Techniques Lower bounding Distance Piecewise Aggregate Approximation Convergent Bounds on the Euclidean Distance Discrete Wavelet Transform Haar Wavelet Transform Applications of Haar Wavelet Transform II Density-based Clustering of Complex Data 59 4 Anytime and Active Clustering Anytime Clustering Anytime Algorithms Applications of Anytime Algorithms Anytime Clustering Conclusions Active Clustering Active Learning Applications of Active Algorithms CONTENTS xiii Active Clustering Conclusions Anytime Density-based Clustering Introduction Backgrounds Anytime Clustering The Algorithm DBSCAN Anytime Density-based Clustering Extended A-DBSCAN The Xseedlist The Algorithm A-DBSCAN-XS Similarity Measure and Lower bounding Experiments Evaluation Methodology Performance Evaluation Parameter Analysis Discussions Conclusions Active Density-based Clustering Introduction Backgrounds Density-based Clustering Active Clustering Active Density-based Clustering The Algorithm Act-DBSCAN Monotonicity Property Edge Probability Core Object Probability xiv CONTENTS Edge Score and Comparison Reduction Property Algorithm Analysis Similarity Measures Experiments Algorithms and Comparison Criteria Performance Analysis Parameter Analysis Comparison with Spectral Clustering How many similarities do other algorithms use? Runtime Analysis Discussions Conclusions III Application for Fiber Clustering Background on Fiber Segmentation Diffuse Tensor Imaging Fiber Similarity Measures Fiber Segmentation Algorithms Conclusions Advantage Fiber Similarity Measure Techniques Introduction Fiber Similarity Measure Shape Similarity of Two Fibers Lower Bounding Distance for WLCS Connection Similarity of Two Fibers Unified Fiber Similarity Measure Other Important Characteristics of Fiber Similarity Table of Contents xv 8.3 Fiber Segmentation Empirical Evaluation Effectiveness of The Shape Similarity Measures Effectiveness of the Unified Similarity Measure SIM Characteristics of SIM Efficiency of the Unified Similarity Measure Discussions Conclusions Advantage Fiber Clustering Techniques Introduction Fiber Similarity Measure Empirical Evaluation Conclusions IV Summary Summary Conclusions Future Works Bibliography 203 List of Figures 228 List of Tables 235 Publications 239 Acknowledgements 241 Curriculum Vitae 243 xvi CONTENTS Part I Introduction Chapter 1 Introduction Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [87]. The KDD process consists of several steps as illustrated in Figure 1.1. Preprocessing Data mining Evaluation Raw Data Processed Data Patterns Knowledge Figure 1.1: Knowledge Discovery in Databases (KDD) process. In the beginning of the KDD process, raw data are collected from different data sources. These data are usually not in a good form as they may contain noise, missing entries, inconsistencies, etc. From these data, most relevant data are then selected and preprocessed, e.g., noise removal, by the KDD process in order to increase data quality to support the subsequent processes. In the next step, data mining, as the key component of the KDD process, is performed to extract previously unknown and useful patterns from the data using automatic or semi-automatic algorithms. At the end of the KDD process, these patterns are evaluated in order to extract useful knowledge from data. In this thesis, we focus on data clustering [121], a central task of the data 4 1. Introduction mining step, in which objects are separated into different groups (clusters) so that the ones inside a cluster are more similar than those in different clusters. In particular, we aim at the density-based clustering algorithm DBSCAN [83], a fundamental data clustering technique proposed in the literature, its extensions and its applications in various fields. 1.1 Density-based Clustering In density-based clustering, clusters are regarded as areas of high object density in the data space separated by areas of lower object density. The algorithm DBSCAN [83] formalizes a density notion for clustering using two parameters: ɛ denoting a volume and µ denoting a minimal number of objects. An object belongs to a cluster if it has at least µ objects inside its ɛ-neighborhood. Compared with other clustering algorithms, DBSCAN has several attractive benefits, e.g., it can detect clusters with arbitrary shapes and is robust to outliers. Moreover, the total number of clusters does not need to be specified beforehand. Thus, DBSCAN has attracted much research interest during the last decades with many extensions and applications in various fields, e.g., [45, 160, 228, 157, 16, 272, 32]. Among many different extensions of DBSCAN, density-based clustering algorithms for complex data have become an emerging research topic with many proposed techniques in the literature recently, e.g., [84, 228, 272, 284, 86, 206, 157, 158]. However, the rapid growth of advanced data acquisition methods in many fields, e.g., medicine, biology and environment, has continuously produced a large amount of data with increasing volume and complexity, e.g., stream, time-series, graph or uncertain data. As a consequence, many challenges have been constantly arisen in order to provide efficient and effective data mining algorithms to extract knowledge from these data, in particular density-based clustering algorithms. In the following Section, we address some challenges of complex data for the density-based clustering algorithm DBSCAN. 1.2 Challenges of Complex Data Since DBSCAN relies on the pairwise similarities (dissimilarities or distances) among objects to operate, it suffers from two important problems including expensive similarity measures and similarity sparseness as described below: 1.2 Challenges of Complex Data 5 Expensive Simmilarity Measures. For complex data, there exist many effective (dis)similarity measures between objects such as Dynamic Time Warping [132] and Longest Common Subsequence [261]. However, most of these similarity measures have high time complexity (usually quadratic time complexity) which obviously becomes a bottle neck in many applications, e.g., the clustering of start light curves in [300]. A star light curve is the measurement of light intensity of a celestial object or region as a function of time. A light curve can be used to estimate the rotation period of a planet or comet nucleus in planetology or to discover supernovas in astronomy, etc. For these tasks, clustering is commonly used to support the analysis of the digital star light curves. In [300], Dynamic Time Warping (DTW) is used as an effective similarity measure for the clustering process. However, it consumes around 127 days to cluster a mere 9236 curves due to the quadratic time complexity of DTW [300]. Obviously, such the runtime bottle neck is undesired, especially in real time and interactive systems [303]. Thus, it poses an important challenge: how to improve the performance of clustering algorithms when coping with these expensive similarity measures. Among various existing approaches for density-based clustering algorithms, only some of them are designed to handle this problem, e.g., [45, 44]. Similarity Sparseness. In many applications, obtaining all similarities among objects is a nontrivial process since they may be difficult or even unavailable to obtain. For example, in transportation monitoring and control systems, GPS is usually used to collect the positions of vehicles, people, airplanes, etc. Then, clustering algorithms are used to discover common or unusual movement patterns as in [201, 93]. However, in many cases, the GPS signals may be very noisy or may be lost due to bad weather, obstacles, etc. Thus, measuring the similarities among moving object trajectories becomes very hard or even infeasible. Another example comes from the task of clustering of photos acquired from a wearable camera in [275], which plays an important role in a variety of applications, e.g., improving life quality for Alzheimer s patients or summarizing personal memories. Since the photos are collected in an arbitrary manner, assessing the similarities among these photo is out of the capability of existing automatic image processing techniques. Consequently, human annotators must be involved to rate the similarities among photos. It therefore makes the clustering an expensive process in terms of both time and money. The potential difficulties of acquiring the similarities among all objects raise another important challenge: 6 1. Introduction how to perform clustering without accessing the whole similarity matrix in order to reduce the potential costs or to cope with the unavailability of pairwise similarities. Though there are many density-based clustering algorithms proposed in the literature [155, 9], to the best of our knowledge, none of them are designed to deal with this challenge. Recently, interactive exploring of data has become a significant feature in many data mining algorithms, especially for complex data, e.g., [234, 33], since it allows domain experts to be involved into the clustering process to improve the performance and outcome. However, throughout the literature review, all the existing extensions of DBSCAN only work in a batch scheme. They produce a single result at the end and do not allow user interaction during their runtime. Providing an interactive extension of DBSCAN, therefore, is another chal
Search
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks