Open Access Subscription Access
Topic Detection in Tweets using Hashtags
Data clustering is a common technique used for topic detection and identification in text. In this paper we propose tokenizing of hashtags as an effective and efficient way of generating bags of words for tweets (text messages on Twitter). The document- term matrix generated using these terms is smaller and more relevant compared to the full text tokenization. Tweets from the 2018 Oscars event tagged with #oscars hashtag were collected and split into sets of 50000 tweets. Three strategies for tokenization of text and hashtags were tested using two popular data clustering algorithms, K-Means++ and Non-negative Matrix Factorization (NMF). The results were compared in terms of accuracy and performance. We concluded that tokenizing hashtags is a better alternative in terms of performance and achieves results equivalent to that of tokenizing tweet text.
- There are currently no refbacks.