Sequence Analysis for Large Databases
Author | |
---|---|
Résumé |
This article develops and reviews methods for the creation of sequence analysis typologies in large databases. The creation of sequence analysis typologies relies on the computation of distances between all observations, which quickly becomes intractable with large databases, even with modern computers. We start by discussing the CLARA algorithm before extending it with methods recently proposed for sequence analysis. The strengths of the approaches are assessed using simulations, which further allows drawing practical guidelines. Next, we discuss three approaches to measure the quality of the clustering without computing all distances. The first is based on representative sequences (i.e., medoids) while the second is based on bootstrapping. We then introduce a third innovative approach based on clustering stability, which further allows assessing the convergence of the clustering algorithm. The methods are illustrated through a study of family trajectories in India with more than 180,000 cases. All the methods are made available in the WeightedCluster R package. |
Année de publication |
2024
|
Journal |
LIVES Working papers
|
Volume |
104
|
Start Page |
1
|
Nombre de pages |
42
|
Date de publication |
09/2024
|
Numéro ISSN |
2296-1658
|
URL |
http://dx.doi.org/10.12682/lives.2296-1658.2024.104
|
DOI |
10.12682/lives.2296-1658.2024.104
|
Mots-clés | |
Download citation | |
File (PDF) |