The first step in any Data Science or Machine Learning workflow is cleaning the data obtained in the raw form. The raw data is too messy and unstructured cannot be directly used to build Machine Learning models. Without a clean and formatted data, it is difficult to see the existing insights in the Data Exploration process. The real challenge faced with messy or raw data, is when the real training of the Machine Learning model begins. The central idea in building any Machine Learning model is to make the data should be clean and ready to be used in the model. In context to Machine Learning and Data Science, Cleaning the data is a process of altering and filtering that paves the way out to easily derive and explore insights and model in a format that is required. Filtering the data in a way that is irreverent to the business problem at hand and altering the parts of data that is needed in a format to be able to use it in the business context.
Broadly, the steps in any machine learning workflow after data collection and data wrangling, the data cleaning plays a vital role for preparing the data to feed it into a model. This part of data cleaning deals with handling missing values, encoding categorical data that is collected in a text format, dropping the redundant features and reducing the dimensionality using standard dimensionality reduction techniques. This step prepares the data as a whole to be applied to any machine learning algorithms.
The raw data can contain various different types of data which can be both structured and unstructured and needs to be processed in order to bring to form that is usable in the Machine Learning models. At a high level, the types of data are divided into Structured Data and Unstructured Data. The structured data is again classified into Numeric and Categorical data. Some of the common structured data types that are used in Machine Learning and Data Science point of view are as listed:
- Continuous Data: The data is classified into numeric data and can take any value in a specified interval. These types of data are also called interval, float or numeric data. Mean and standard deviation are the arithmetic operations that can be performed on continuous data. The statistical operations such as Pearson correlation coefficient and t-test and F-test can also be applied to this data to gain insights into it.
Examples: Height or Weight of an individual, Rate of Interest on loans, etc.
- Discrete Data: The data which can take on only integer values are said to be discrete data. This type of data is usually used in counting the number of occurrences of the event. The discrete data cannot take on floating or decimal values.
Examples: Student count in a class, Colour count in a Rainbow.
- Nominal Data: Nominal data can be categorized into categorical that has no explicit ordering associated with it. The nominal data are plainly used as labelled data. No statistical operations such as calculation of mean, median or standard deviation can be performed on the nominal data as performing such statistical operations on such data doesn’t imply anything insightful.
Examples: States in a country, zip codes of areas.
- Ordinal Data: The data which has an explicit ordering associated with it is known as Ordinal data. The ordinal data is also a type of categorical data and has a specific definitive order with it. Calculations like frequency distribution, percentage of total calculation and other non-parametric statistics with ordinal data. However, mean calculation, standard deviation calculation and other parametric tests of statistics makes no sense for this kind of data.
Examples: Ratings for a restaurant (e.g. very good, good, bad, very bad), Level of Education of an individual (e.g. Doctorate, Post Graduate, UnderGraduate), etc.
- Binary Data: This is a special case of Nominal data which does not take any order and can take only two values as input. The kind of operations that can be performed on this data are the same as that are performed on the nominal data.
Examples: Gender (male or female), Fraudulent transaction (Yes or No), Cancerous Cell (True or False).
After the identification of the data types of the features present in the data set, the next step is to process the data in a way that is suitable to put to Machine Learning models. This step of pre-processing consists of various steps which include Missing Value Treatment, Feature Encoding, Dimensionality Reduction and so on. Out of these techniques the one that closely relates to above discussed data types is Feature Encoding. Feature Encoding is the conversion of Categorical features to numeric values as Machine Learning models cannot handle the text data directly. Most of the Machine Learning Algorithms performance vary based on the way in which the Categorical data is encoded. The three popular techniques of converting Categorical values to Numeric values are done in two different methods.
- Label Encoding.
- One Hot Encoding.
- Binary Encoding.
- Label Encoding: In this encoding technique, the categorical data is assigned a value from 1 to N (N is the number for different categories present in the data). This kind of an encoding technique is applied to the ordinal data. The assigning of the value from 1 to N happens either in an increasing or a decreasing order. Once if the order is chosen to be ascending or descending it is fixed throughout for all the values in the column and cannot be changed randomly or in between. The only restriction that comes with the ordinal data is the definitive order must be in either increasing or decreasing order which means for level of education it should be either Doctorate, Post Graduate and Undergraduate or Undergraduate, Post Graduate and Doctorate. It makes no sense in ordering them randomly like Post Graduate, Doctorate and UnderGraduate. The application of this technique to a level of Education of an individual would look something like:
|Level of Education||Label Encoding on |
Level of Education
An alternative way would look like:
|Level of Education||Label Encoding on |
Level of Education
- One Hot Encoding Technique: In this technique, mapping is done for the different categories present in the feature to a vector consisting of 1 or 0 depending on the presence or the absence of the feature. The count for the number of vectors depends on the number of categories of data present in that particular feature. If for a particular feature, the number of categories present is huge then this technique increases the number of columns present in the dataset and reduces the learning rate of the algorithm significantly. The One hot encoding technique is usually applied to nominal data present in the data set. The application of One Hot Encoding for a particular column in a data set can be done as under.
- Binary Encoding: Binary Encoding is a special case of One Hot Encoding in which the columns have only two categories. The data in that particular column is either replaced with 0 or 1 with no order which means that here 1 is not greater than zero. The binary encoding for a particular column of data is done in the following way.
|Fraudulant_Transaction||Binary Encoded |
Applying the above discussed techniques after identifying the data types of the features in the data set is not only important but also necessary to make the data ready for the machine learning model. The encoding techniques discussed are the most popular and are applicable to most of the data set that needs to be modelled using Machine Learning Algorithms.