Who is the data set about? Who were sampled in this data set? Who were over sampled or under sampled? Are they representative of the main characters in Assignment 1? Is there any identifiable information or is there any risk of disclose identifiable information? This is fundamentally about the sampling issue, and anonymity.
There are two datasets considered in the model.
The first dataset considered is the stock market dataset. It has the values of the majority of the U.S. stock price data listed in the stock exchange. We are using this data and trying to predict the stock values of the future.
The second dataset considered is the news article dataset. It contains news articles related to the stocks in it. We have considered this dataset in order to use it for sentiment analysis for the stock value prediction
Stock data and the news archive have been sampled in here. The use case itself presents about the analysis of the news data being relatable to the stock price fluctuations. This is based on the continuous variable prediction (price) based on the different parameters in place by understanding their fluctuations.
The problem of oversampling is very much related to different companies and their archives which can fundamentally derive the price. There are about 725 firms that have converging data with terms of news and the stocks. The graph below shows the imbalance between both the datasets.
The dataset is public, there is identifiable information about the firms we are doing analysis but its publicly available stock prices and the news articles which we could for analysis.
What events, activities, behaviors, and observations etc. are recorded by the data set? Does the data set record the targeted events, activities, behaviors, etc. in Assignment 1? This is fundamentally about the variables.
The first dataset holds the company’s ticker values, the open, close, high, low, volume values of the corresponding company.
The second dataset holds the values of the company’s ticker value, the title, category and content of the company’s news, the release date of the article along with its provider and the url link of the news article. Below tables show what data is being recorded in the tables, first being the news data and second is the stock data.
News Archive Data:
Content |
Original Field |
Summary |
Generated Field |
Sentiment |
Generated Field |
Ticker Label |
Original Field |
Stock Data:
Stock Ticker |
Generated Field |
Close |
Original Field |
The above variables from the two dataset plays a vital role in the model predictions and the time series analysis. Considering News Archive Data, The content is more like the description that’s provided in the news article based on which we could perform analysis on the type of sentiment the news article shares. The Ticker labels inform the firm we are analyzing the stocks with. Considering the Stock Data, stock ticker is the label of the firm and Close is the field which gives us continuous variables on how much the price is valued that particular day.
When did the event, activity, behavior, and observation, etc. take place? When were the data collected? Is it longitudinal or cross-sectional? Are they real time data? How old or fresh are the data? To what extent generalization can be made across time to inform Assignment 1? This is fundamentally about the temporal structure of the data set, and the external validity of the data set across time.
Our goal is to prove the model’s performance on historic data and the historic stock prices, but the prediction can be made on the current day’s stock value prediction based on historic knowledge.
- Firstly, the dataset entirety focusses purely on the companies and stocks in the USA. We don’t have data that is longitudinal or cross-sectional. However, in the news there is some reference in the articles about different countries’ impact in the financial news data, but it doesn’t stand relevant in terms of analysis as its more focussed in the United states.
- Secondly, dataset is not real time data, it’s the data derived from the historic stock and historic archives of the news regarding the firms that we will be evaluating for the stocks.
- Furthermore, the datasets worked on are both fairly new datasets. The datasets include the latest news articles related to some of the current stock market listed companies. Also, the datasets show changes in the values over time. As our goal includes whether or not the sentiment value of particular company (based on news archives) is altering the price of the stock value makes the generalization in the dataset, or if not see, how the values are being changed over time.
Where did the event, activity, behavior, and observation, etc. take place? Where were the data collected if the information is available? What does the geographical coverage of the data set look like? Does the data set contain geographical information (GIS)? Is this a local, regional, national, or global data set? To what extent generalization can be made across settings to inform Assignment 1? This is fundamentally about geographic variables in the data set, and the external validity of the data set across settings.
The datasets are shown below and they are collected from Kaggle:
https://www.kaggle.com/gennadiyr/us-equities-news-data
https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
Both the datasets considered are:
- U.S. based stock market and news archive datasets. Hence they cover most of the U.S. stock listings.
- based on the U.S. stock market values, making it easier and more familiar to work and understand with.
In our attempt we tried to bring out the common time-series analysis between the data. We analyzed the common companies where stocks are reported and the news archives have reported about their financial standings.
Why did the event, activity, behavior, or observation etc. take place? Why were the data collected?
Accessing high-quality financial data is expensive to acquire and is therefore rarely shared for free of cost. Also, the stock news data which establishes the interrelations between price movement and news content is not very much to be found. Based on this rare combination of dataset, at the end of the day, we were able to analyze the sentiments of each article by using some techniques, and are able to predict the change of stock prices over time so that end users decide if a stock is worthy giving a shot or not.
How: If you would like, you can add a dimension of how. How did it happen? Sometimes, the answer to how can be covered by what, when and where.
Summarization of data was performed using BERT abstractive summarizer instead of extractive summarizer and metrics like recall (ROUGE) ,precision(BLEU), F1 score (combination of ROUGE and BLEU) are evaluated. In addition to that, the sentiment analysis for summarized data was performed using BERT extractive summarizer. After performing abstractive summarization, it is provided to sentiment analyzer to generate scores for each record based on stock news.
For the stock data analysis we are using the time series prediction model to predict the price of the stock, and finally we will be integrating the sentiment analysis with the time series model to see how the stock prices vary based on the sentiment value provided by the news articles.
Upload / share your folder with ocela.ai@gmail.com with images for all classes.
Add an image for each class as an example.
Reference: EduKC
Open the following Google Drive folder (which only your team has access to) and upload the data for your project.
Who is the data set about? Who were sampled in this data set? Who were over sampled or under sampled? Are they representative of the main characters in Assignment 1? Is there any identifiable information or is there any risk of disclose identifiable information? This is fundamentally about the sampling issue, and anonymity.
There are two datasets considered in the model.
The first dataset considered is the stock market dataset. It has the values of the majority of the U.S. stock price data listed in the stock exchange. We are using this data and trying to predict the stock values of the future.
The second dataset considered is the news article dataset. It contains news articles related to the stocks in it. We have considered this dataset in order to use it for sentiment analysis for the stock value prediction
Stock data and the news archive have been sampled in here. The use case itself presents about the analysis of the news data being relatable to the stock price fluctuations. This is based on the continuous variable prediction (price) based on the different parameters in place by understanding their fluctuations.
The problem of oversampling is very much related to different companies and their archives which can fundamentally derive the price. There are about 725 firms that have converging data with terms of news and the stocks. The graph below shows the imbalance between both the datasets.
The dataset is public, there is identifiable information about the firms we are doing analysis but its publicly available stock prices and the news articles which we could for analysis.
What events, activities, behaviors, and observations etc. are recorded by the data set? Does the data set record the targeted events, activities, behaviors, etc. in Assignment 1? This is fundamentally about the variables.
The first dataset holds the company’s ticker values, the open, close, high, low, volume values of the corresponding company.
The second dataset holds the values of the company’s ticker value, the title, category and content of the company’s news, the release date of the article along with its provider and the url link of the news article. Below tables show what data is being recorded in the tables, first being the news data and second is the stock data.
News Archive Data:
Content |
Original Field |
Summary |
Generated Field |
Sentiment |
Generated Field |
Ticker Label |
Original Field |
Stock Data:
Stock Ticker |
Generated Field |
Close |
Original Field |
The above variables from the two dataset plays a vital role in the model predictions and the time series analysis. Considering News Archive Data, The content is more like the description that’s provided in the news article based on which we could perform analysis on the type of sentiment the news article shares. The Ticker labels inform the firm we are analyzing the stocks with. Considering the Stock Data, stock ticker is the label of the firm and Close is the field which gives us continuous variables on how much the price is valued that particular day.
When did the event, activity, behavior, and observation, etc. take place? When were the data collected? Is it longitudinal or cross-sectional? Are they real time data? How old or fresh are the data? To what extent generalization can be made across time to inform Assignment 1? This is fundamentally about the temporal structure of the data set, and the external validity of the data set across time.
Our goal is to prove the model’s performance on historic data and the historic stock prices, but the prediction can be made on the current day’s stock value prediction based on historic knowledge.
- Firstly, the dataset entirety focusses purely on the companies and stocks in the USA. We don’t have data that is longitudinal or cross-sectional. However, in the news there is some reference in the articles about different countries’ impact in the financial news data, but it doesn’t stand relevant in terms of analysis as its more focussed in the United states.
- Secondly, dataset is not real time data, it’s the data derived from the historic stock and historic archives of the news regarding the firms that we will be evaluating for the stocks.
- Furthermore, the datasets worked on are both fairly new datasets. The datasets include the latest news articles related to some of the current stock market listed companies. Also, the datasets show changes in the values over time. As our goal includes whether or not the sentiment value of particular company (based on news archives) is altering the price of the stock value makes the generalization in the dataset, or if not see, how the values are being changed over time.
Where did the event, activity, behavior, and observation, etc. take place? Where were the data collected if the information is available? What does the geographical coverage of the data set look like? Does the data set contain geographical information (GIS)? Is this a local, regional, national, or global data set? To what extent generalization can be made across settings to inform Assignment 1? This is fundamentally about geographic variables in the data set, and the external validity of the data set across settings.
The datasets are shown below and they are collected from Kaggle:
https://www.kaggle.com/gennadiyr/us-equities-news-data
https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
Both the datasets considered are:
- U.S. based stock market and news archive datasets. Hence they cover most of the U.S. stock listings.
- based on the U.S. stock market values, making it easier and more familiar to work and understand with.
In our attempt we tried to bring out the common time-series analysis between the data. We analyzed the common companies where stocks are reported and the news archives have reported about their financial standings.
Why did the event, activity, behavior, or observation etc. take place? Why were the data collected?
Accessing high-quality financial data is expensive to acquire and is therefore rarely shared for free of cost. Also, the stock news data which establishes the interrelations between price movement and news content is not very much to be found. Based on this rare combination of dataset, at the end of the day, we were able to analyze the sentiments of each article by using some techniques, and are able to predict the change of stock prices over time so that end users decide if a stock is worthy giving a shot or not.
How: If you would like, you can add a dimension of how. How did it happen? Sometimes, the answer to how can be covered by what, when and where.
Summarization of data was performed using BERT abstractive summarizer instead of extractive summarizer and metrics like recall (ROUGE) ,precision(BLEU), F1 score (combination of ROUGE and BLEU) are evaluated. In addition to that, the sentiment analysis for summarized data was performed using BERT extractive summarizer. After performing abstractive summarization, it is provided to sentiment analyzer to generate scores for each record based on stock news.
For the stock data analysis we are using the time series prediction model to predict the price of the stock, and finally we will be integrating the sentiment analysis with the time series model to see how the stock prices vary based on the sentiment value provided by the news articles.
Upload / share your folder with ocela.ai@gmail.com with images for all classes.
Add an image for each class as an example.
Reference: EduKC
Open the following Google Drive folder (which only your team has access to) and upload the data for your project.