Top 20 Twitter Datasets for Natural Language Processing and Machine Learning
Computers & Technology → Technology
- Author Limarc Ambalina
- Published March 13, 2020
- Word count 686
While it may be difficult for AI researchers and developers to find social media data for machine learning, one open source of data is Twitter. Numerous educational organizations, research teams, and independent researchers have scraped tweets from Twitter and made the data available for public use.
From sentiment analysis models to content moderation models and other NLP use cases, Twitter data can be used to train various machine learning algorithms.
Below is a list of 20 open Twitter datasets for machine learning.
Best Twitter Datasets for Natural Language Processing and Machine learning
- Apple Twitter Sentiment
A dataset containing tweets about the large tech company, Apple. The tweets in this dataset were compiled using tweets containing the hashtag #AAPL, the reference @apple, and others. The tweets were then divided into positive, negative, or neutral sentiments.
- Avengers Endgame Tweets
This dataset for machine learning consists of 10,000 tweets which include the hashtag #AvengersEndgame.
- Charlottesville on Twitter
This dataset contains 150,000 tweets mentioning Charlottesville or containing the #Charlottesville hashtag.
- Credibility Corpus in French and English
The Credibility Corpus in French and English was created to analyze information credibility and detect misinformation and rumors. The dataset is comprised of both French and English tweets about rumors.
- Customer Support on Twitter
This dataset is a large corpus of tweets and replies to and from customer service support lines on Twitter.
- Every Donald Trump Tweet
The Every Donald Trump Tweet dataset is a compilation of every tweet the president has ever posted. The data was later moved to the TrumpTwitterArchive, but can still be accessed.
- FollowTheHashtag: Tokyo
From FollowtheHashtag, this dataset is a collection of 200,000 geolocated tweets from Tokyo.
- FollowTheHashtag: USA
Also from FollowtheHashtag, this dataset is a collection of 200,000 geolocated tweets from the United States of America.
- Game of Thrones Season 8 Tweets
The tweets collected for this dataset capture audience reactions for each episode by collecting Game of Thrones related tweets after each episode of season 8 was released.
- Pre-processed Twitter Tweets
This is a simple social media dataset comprised of pre-processed tweets for sentiment analysis. The tweets have been organized into positive, neutral, and negative categories.
- Russian Troll Tweets
During an investigation into Russia’s influence on the 2016 US election, Twitter deleted 200,000 Russian troll tweets. This Twitter dataset includes details on both the individual tweets and accounts from which they were posted.
- Sentiment 140
Sentiment 140 is a tool for discovering the overall sentiment for a brand, topic, or product on Twitter. The company has also made their training data available for download on their site.
- SMILE Twitter Emotion
A simple dataset for sentiment analysis, the SMILE Twitter Emoticon Dataset contains 3,085 tweets each expressing a different emotion: anger, disgust, happiness, surprise, and sadness.
- Stanford SNAP Twitter Dataset
From the SNAP library database at Stanford University, this dataset contains 476 million tweets from 20 million users over a 7-month period.
- Top 20 Most-Followed Users on Twitter
This Twitter dataset is composed of over 52,000 tweets from the 20 most-followed Twitter profiles. For this dataset retweets were not collected.
- Twitter Airline Sentiment
The Twitter US Airline Sentiment Dataset contains tweets about major US airlines classified into the following categories: positive, neutral, and negative.
- Twitter Friends
Twitter Friends is a dataset for machine learning which contains user information. The dataset contains the following information: avatar, follower count, friends count, account name, user ID, accounts the user is following, user’s language, last post info, hashtags used by the user, ID of user’s last tweet.
- Twitter News Dataset
This Twitter dataset contains 5234 news events from Twitter, as well as the tweets talking about those news events.
- Twitter User Data
A Twitter dataset composed of 20,000 rows, Twitter User Data includes the following information: user name, random tweet, account profile, image, and location information.
- UMass Global English on Twitter Dataset
Including over 10,000 tweets, this dataset was created to build classifiers that identify the language of tweets. Each tweet is annotated as English, non-English, includes code switching, language ambiguity, or automatically generated. The tweets came from 130 countries.
Can’t find the data you’re looking for? From linguistic annotation to text classification, translation corpus data, and more, Lionbridge provides a wide array of AI training data services.
Rate article
Article comments
There are no posted comments.
Related articles
- Master the Art of Gamification with Our Engaging App
- 10 Reasons Business Central Users Leverage Advanced Inventory Count
- The Ultimate Guide to 3D Animation: From Basics to Advanced Techniques
- Mitsubishi Electric proves heat pump compatibility with microbore pipework
- The Role of AI Services in Customer Experience and Satisfaction
- Google DeepMind Launches Gemma 2: A New AI Model Revolutionizing Research and Development
- How Do AI Solutions Drive Productivity And ROI In Business?
- Is Verizon Total the same as Verizon Prepaid?
- What is the best prepaid phone company?
- Why Small to Large Companies Continue to Use Dated/Dinosaur Technology
- 10 Ways Business Central’s Quality Inspector App Streamlines Quality Assurance
- 10 Ways Business Central’s Quality Inspector App Streamlines Quality Assurance
- The Rise of Sustainable Technology: Shaping a Greener Future
- Why Bullseye Engagement Offers the Best OKR Software for Businesses
- Web Development Companies in Canada
- How EasyPDF™ Forms Save Time & Money at Home and in the Workplace
- The One and Only 15-Second Digital Lien Waiver to Complete and Submit in Record Time Using the Free Adobe Reader
- The Impact of Employer Branding on Leadership Recruitment
- Augmented Reality (AR) in Business: Why Your Company Needs It
- Top 10 Reasons to Use Business Central’s License Plating App
- The Hidden Advantages of European Offshore Development Companies
- App Development: Transforming Ideas into Reality
- Automate you Chauffeur Service with A to Z Dispatch
- The Impact of Machine Learning and AI on Business: What the Future Holds In the modern busine
- Generate Flashcards Fast with AI: The Ultimate Solution for Developers
- Blockchain Interview Guide: Essential Questions and Answers for Success
- Eight Free Business Central Apps That You’ll Wish You Had
- How Artificial Intelligence (AI) and Machine Learning (ML) Are Transforming Computer-Based Trading Platforms
- The Role of Gas Engineers in Modern Energy Systems: Linking to Sustainability and Innovation
- The Significance of Stars in the Universe and Their Impact on Human Culture Throughout Evolution