Sequence Analysis for Large Databases

Author	Matthias Studer Rojin Sadeghi Louis Tochon
Abstract	This article develops and reviews methods for the creation of sequence analysis typologies in large databases. The creation of sequence analysis typologies relies on the computation of distances between all observations, which quickly becomes intractable with large databases, even with modern computers. We start by discussing the CLARA algorithm before extending it with methods recently proposed for sequence analysis. The strengths of the approaches are assessed using simulations, which further allows drawing practical guidelines. Next, we discuss three approaches to measure the quality of the clustering without computing all distances. The first is based on representative sequences (i.e., medoids) while the second is based on bootstrapping. We then introduce a third innovative approach based on clustering stability, which further allows assessing the convergence of the clustering algorithm. The methods are illustrated through a study of family trajectories in India with more than 180,000 cases. All the methods are made available in the WeightedCluster R package.
Year of Publication	2024
Journal	LIVES Working papers
Volume	104
Start Page	1
Number of Pages	42
Date Published	09/2024
ISSN Number	2296-1658
URL	http://dx.doi.org/10.12682/lives.2296-1658.2024.104
DOI	10.12682/lives.2296-1658.2024.104
Keywords	Typologies Sequence analysis Large Databases Clustering Algorithms Family Trajectories
Download citation	DOI BibTeX
File (PDF)	Article (PDF)