Open Access Open Access  Restricted Access Subscription Access

Topic Detection in Tweets using Hashtags

Shazia Manzoor

Abstract


Data clustering is a common technique used for topic detection and identification in text. In this paper we propose tokenizing of hashtags as an effective and efficient way of generating bags of words for tweets (text messages on Twitter).  The document- term matrix generated using these terms is smaller and more relevant compared to the full text tokenization. Tweets from the 2018 Oscars event tagged with #oscars hashtag were collected and split into sets of 50000 tweets. Three strategies for tokenization of text and hashtags were tested using two popular data clustering algorithms, K-Means++ and Non-negative Matrix Factorization (NMF). The results were compared in terms of accuracy and performance. We concluded that tokenizing hashtags is a better alternative in terms of performance and achieves results equivalent to that of tokenizing tweet text.

Full Text:

PDF

Refbacks

  • There are currently no refbacks.