Aller au contenu principal

Sequence Analysis for Large Databases

Author
Résumé

This article develops and reviews methods for the creation of sequence analysis typologies in large databases. The creation of sequence analysis typologies relies on the computation of distances between all observations, which quickly becomes intractable with large databases, even with modern computers. We start by discussing the CLARA algorithm before extending it with methods recently proposed for sequence analysis. The strengths of the approaches are assessed using simulations, which further allows drawing practical guidelines. Next, we discuss three approaches to measure the quality of the clustering without computing all distances. The first is based on representative sequences (i.e., medoids) while the second is based on bootstrapping. We then introduce a third innovative approach based on clustering stability, which further allows assessing the convergence of the clustering algorithm. The methods are illustrated through a study of family trajectories in India with more than 180,000 cases. All the methods are made available in the WeightedCluster R package.

Année de publication
2024
Journal
LIVES Working papers
Volume
104
Start Page
1
Nombre de pages
42
Date de publication
09/2024
Numéro ISSN
2296-1658
URL
http://dx.doi.org/10.12682/lives.2296-1658.2024.104
DOI
10.12682/lives.2296-1658.2024.104
Mots-clés
Download citation
File (PDF)