Text Clustering - Latent Dirichlet Allocation
- ckevinkusuma
- Jun 3, 2021
- 2 min read
Summary
DOMO is a technology company that provides a cloud-based business intelligence software. Their customers store and process billions of rows of data every single week for various purposes such as tracking sales, goal management, human resources, etc. It is very difficult to determine how customers use the platform because each customer have countless cards, pages, dataflows, and connectors in their account. It is also very troublesome to get that particular insight from customer directly because the answers are dependent on which customer, what industry they're in, and what's the role of the person you're gathering the information from. There is a value in understanding how their customers use their products. It will them in a number of ways such as improving product features, upsell opportunity, and targeted marketing campaign.
By utilizing machine learning algorithm, we can gather all card, page, and dataset names and analyze them efficiently. Each card, page, and dataset names are rolled up to a certain usage level where we can identify their usage. We then apply various cleaning and preprocessing techniques to bring the data to a level where text clustering is possible. We pick Latent Dirichlet Allocation as the main engine to cluster the words because this algorithm will not only give us the grouping but also the words that belong to each group for labeling.

Latent Dirichlet Allocation algorithm gives us clean cuts of the clusters and the figure on the left shows most likely words that belong to one of the clusters. Based on the words on this cluster, we can determine that there are a lot of customers that use the platform to track their digital marketing efforts.
The output of the Latent Dirichlet Allocation algorithm is then used to create a multi-classification model using a different machine learning algorithm. The overall accuracy of the model performance on the testing data is around 75%, but they vary across different cluster. Some clusters have accuracies above 85% and there are others that have accuracies just above 60%.



Comments