- Who is the data set about? Who were sampled in this data set? Who were over sampled or under sampled? Are they representative of the main characters in Assignment 1? Is there any identifiable information or is there any risk of disclose identifiable information? This is fundamentally about the sampling issue, and anonymity.
The dataset is about the users who tweeted with the keyword ‘AAPL’ (in this case) in the twitter. There are many numbers of users who tweets in twitter. Thus, it help us in analyzing more data. This dataset contains only public available information . It doesn’t contain any undisclosable information. These users are the main representatives stated in the assignment 1.
- What events, activities, behaviors, and observations etc. are recorded by the data set? Does the data set record the targeted events, activities, behaviors, etc. in Assignment 1? This is fundamentally about the variables.
The dataset contains the tweets extracted from the twitter which contains user information, tweet text, hashtags, created time and their entity information. This are the important variables used in the project. This covers all the required variables stated in the assignment 1.
- When did the event, activity, behavior, and observation, etc. take place? When were the data collected? Is it longitudinal or cross-sectional? Are they real time data? How old or fresh are the data? To what extent generalization can be made across time to inform Assignment 1? This is fundamentally about the temporal structure of the data set, and the external validity of the data set across time.
The data is generated when the user tweets about AAPL stock in the twitter. There will be a greater number of users who tweets same. The data in out dataset consists of all the tweets which contains AAPL keyword in Oct 1st to Oct 28th. Hence this is the old data not the real time data. This is permanent data which might need to extract more data in future for further analysis of other data range. It satisfied the abstract stated in assignment 1.
- Where did the event, activity, behavior, and observation, etc. take place? Where were the data collected if the information is available? What does the geographical coverage of the data set look like? Does the data set contain geographical information (GIS)? Is this a local, regional, national, or global data set? To what extent generalization can be made across settings to inform Assignment 1? This is fundamentally about geographic variables in the data set, and the external validity of the data set across settings.
The dataset consists of tweets which are extracted from twitter. Twitter consists of huge amount of data. It is an online social media platform. It has more number of users. This data set contains geographical information of the tweeted location. This dataset doesn’t have geographical limitations. This contains the data as stated in Assignment 1.
- Why did the event, activity, behavior, or observation etc. take place? Why were the data collected?
Twitter is one of the important sources for the data. Many number of people puts their opinion in the twitter. In the same way, many companies or investors puts their news and opinions in twitter which effects in stock trend. Hence we are using their opinions(tweets) to analyze the stock movement.
- How: If you would like, you can add a dimension of how. How did it happen? Sometimes, the answer to how can be covered by what, when and where.
Most of this part was already covered above. But to be precise, we are collecting the tweets from twitter and doing sentiment analysis to find whether their opinion is positive or negative and do the analysis on the whole data using hadoop ecosystem.
This work was partially sponsored by NSF.
NSF IUSE #1935076
CUE Ethics: Collaborative Research: Open Collaborative Experiential Learning (OCEL.AI): Bridging Digital Divides in Undergraduate Education of Data Science
01/01/2020 – 6/30/2021, $ 350,000
Copyright © 2020 OCEL.AI.