Guideline for Machine Learning
There are three steps to Machine Learning:
- Data cleaning (e.g. outliers, z score, missing data, etc.)
- Data preprocessing: Experimenting with different models.
- Machine learning: Now, you are ready to train the machine using the data and the model.
Step 1: Data cleaning is a process of ensuring data quality.
You may have already known how to deal with these data cleaning issues in Statistics 101, but here is a quick recap:
- How do you deal with duplicated data and missing data? Sometimes, these data issues occur simply by human mistakes or incomplete input. Common sense with a few simple statistic procedures could easily address these issues.
- How do you deal with multiple datasets that are not following the same format?
- How do you deal with different formats of data?
- Do you have enough data to run your models?
Step 2: Data preprocessing is a process of building statistic models that will be implemented in machine learning. This is a very important process that ensures the results are interpretable, and accurate.
A unique feature of OCEL.AI is to use domain expertise to create interpretable models that can bring AI out of black box. This process is like building a regression model using the hierarchical method vs. the enter method. Current machine learning is like the exploratory “enter method:” you put all variables in the regression model and see which ones are significant. The problem with this “enter method” approach is the results are often not interpretable, despite accuracy. For example, if you predict tax return by using the “enter method,” your regression model shows that communal distance is a significant predictor of the amount of tax return. How do you interpret the results? What does it mean? Is this a correlation or causal relationship? Any mediators or moderators? If you cannot interpret the model, people would ask: “Are you kidding me? I need to live close to my workplace to get more tax return?”
To avoid this kind of black-box situation, OCEL.AI emphasizes interpretable modeling: We implement use cases and storytelling in building a theoretical model that guides machine learning. This process is like hierarchical regression: you are guided by a theoretical framework (the use case), with additional help from the enter method. This process relies heavily on domain knowledge, like knowledge of economy, journalism, advertising, etc.. Based upon domain knowledge, you deliberately build models around hypotheses. This is like what we do in hierarchical regression: you have a theory to guide the decisions on which variables to enter first, which ones to enter next, and which ones to drop from the modeling.
In the current training round, we encourage you to use personal experience to build use cases (“theoretical frameworks.”) This is different from what domain experts often do: starting with surveying existing theories. There are two reasons for this difference: 1) it saves time. This is a training process. 2) In reality, big data approach often gets into uncharted waters. That is why I called “use case” development “the grounded theory approach.”
Applying this process for real problem solving beyond this training, we will have to build use cases by relying on existing theories (domain knowledge) and “the grounded theory approach.”
A lot of these data preprocessing questions can be answered with knowledge and training in statistics:
A bigger picture is: transforming data, checking assumptions, and then building statistical models.
- Transforming data: Conducting necessary data transformation on categorical or continuous variables. For example, dummy coding on nominal variables, or standardization of continuous variables, if necessary.
- Scaling: ensuring balanced weight of different variables. For example, age of cars has a range of 1 to 20 years, but mileage of cars has a range of 0 to 200,000 miles.
- Outliers: this includes univariate and multivariate outliers. How are you going to identify outliers and deal with them?
- Skewed data: certain statistic models assume normal distribution of data. How do you know whether your data are normally distributed? If not, what transformation are you going to do?
- How are you going to deal with imbalance in data?
- Unequal sample size. For example, comments on male STEM professors are 5 times of comments on female STEM professors.
- Too many data of one variable or too little data of one variable. For example, you have large amount of tweets on road condition but very little information on household income.
- Missing data: too many missing data in one variable. For example, a lot of missing data in gender.
- Checking assumptions: different statistic models have different assumptions, like independence of observation, normality, etc.. How do you check whether all assumptions are met?
- What is your theory about important/significant features/variables you want to analyze? Which ones are you going to select? What is your basis of including or excluding a variable from modeling? Do not forget to revisit your use case (“theoretical framework”) or to check significant variables using the “enter method.”
- Discovering relationships: association, co-occurrence, correlation, causality, etc.
- How do you protect privacy and prevent re-identifying individuals using anonymous data? We should, starting this moment, to also think about the ethical issues of data science.
At the end of data preprocessing, you should be able to know which variables, and which kind of statistic model (for example, regression, clustering, or PCA) you are going to apply in training the machine.