Skip to main content

Sequence Analysis for Large Databases

Author
Abstract

This article develops and reviews methods for the creation of sequence analysis typologies in large databases. The creation of sequence analysis typologies relies on the computation of distances between all observations, which quickly becomes intractable with large databases, even with modern computers. We start by discussing the CLARA algorithm before extending it with methods recently proposed for sequence analysis. The strengths of the approaches are assessed using simulations, which further allows drawing practical guidelines. Next, we discuss three approaches to measure the quality of the clustering without computing all distances. The first is based on representative sequences (i.e., medoids) while the second is based on bootstrapping. We then introduce a third innovative approach based on clustering stability, which further allows assessing the convergence of the clustering algorithm. The methods are illustrated through a study of family trajectories in India with more than 180,000 cases. All the methods are made available in the WeightedCluster R package.

Year of Publication
2024
Journal
LIVES Working papers
Volume
104
Start Page
1
Number of Pages
42
Date Published
09/2024
ISSN Number
2296-1658
URL
http://dx.doi.org/10.12682/lives.2296-1658.2024.104
DOI
10.12682/lives.2296-1658.2024.104
Keywords
Download citation
File (PDF)