Open Access Open Access  Restricted Access Subscription Access

Topic Detection in Tweets using Hashtags

Shazia Manzoor


Data clustering is a common technique used for topic detection and identification in text. In this paper we propose tokenizing of hashtags as an effective and efficient way of generating bags of words for tweets (text messages on Twitter).  The document- term matrix generated using these terms is smaller and more relevant compared to the full text tokenization. Tweets from the 2018 Oscars event tagged with #oscars hashtag were collected and split into sets of 50000 tweets. Three strategies for tokenization of text and hashtags were tested using two popular data clustering algorithms, K-Means++ and Non-negative Matrix Factorization (NMF). The results were compared in terms of accuracy and performance. We concluded that tokenizing hashtags is a better alternative in terms of performance and achieves results equivalent to that of tokenizing tweet text.

Keywords: Tweets, Hashtags, Clustering, Non-negative Matrix factorization, kmeans++, Differential Cluster Labeling, Point-wise Mutual Information

Cite this Article
Shazia Manzoor, Taha Hafeez Siddiqi. Topic Detection in Tweets using Hashtags. Recent Trends in Programming Languages. 2018; 5(2): 6–14p.

Full Text:



  • There are currently no refbacks.

This site has been shifted to