In my last blog, we looked at steps 1 and 2 in the following diagram:
In this episode of the continuing saga that is data science (the general stuff, not TDSP), we will look at parts 3 and 4.
- Put the Data in a Table
Neither you, nor an AI can work with un-structured data. If you are dealing with a Data Lake, you will need to extract the necessary data and put it in structured table that you can work with. The amount of detail you want to put in your table will depend on your project goals. Each row in your table will show one event or item or instance. Each column will be a single feature of each of those rows.
If you are lucky, the raw data may correspond to your target. If not, you may have to massage it to make it fit. If, for example, the raw data is being streamed in from an IoT device with hourly time stamps, but your target is looking for an answer based on months, then you will have to work to make your data fit a column heading that corresponds to the target. The same rule applies to your rows. Each row must have a single instance of your target.
- Check for Quality
Now it is time to step back and spend some time perusing your data. The garbage-in-garbage-out rule is your guide here. You want to ensure that any bad data is either fixed or removed. You will also want the rows and columns to be as familiar to you as the proverbial “back of your hand”. This is necessary if you are going to start designing good algorithms to extract your target from this mountain of data. Look at and evaluate each column. Ensure that the label is appropriate for the data and is meaningful. Be sure that, although you have data that fits under that label, it is accurate and reliable data.
It is helpful to plot each column of numbers as a histogram. This will give you a quick picture that will highlight outliers that may be candidates for further inspection.
Having thoroughly inspected the data, it is now time to correct the things that can be corrected and remove the things that cannot. Correcting labels is the easy part. What do you do if you find values that are simply wrong? A value that shows that a product failed after 20 years of use, but that product has only been in production for 5 years will require further scrutiny. Perhaps the correct value was 20 months, or perhaps it was 2 years? If the correction isn’t obvious, you have a fewer choices here. You can correct it, if the correction is obvious. You can remove the entire row if the egregious value is critical to understanding the row. Or, you can try to track down the correct answer (not generally feasible).
Whatever choice you make, make sure the decision pertains to a known incorrect value, not just that the value does not fit with your preconceived ideas as to what it should be. Removing data that you think is simply undesirable is unethical and will likely lead you to a wrong conclusion.
Replacing Missing Values
Missing values in large data sets is par for the course. No matter the reason for the missing data, you will have to make some adjustments. Most machine learning algorithms do not deal well with blanks, so you will have to fill them in.
As you are now the expert, having thoroughly perused and familiarized yourself with the data, you are going to have to make some judgement calls on how best to fill in those blanks. Every data set and every column will need to be massaged in its own special way. This is where your knowledge and judgement are used to earn your keep.
If you find after all this work that you have very little data left, then it is time to head back to step 1 and get more data. It is better to figure this out here than to build a model using little or bad data. You are looking for the right answer, or perhaps a good answer, not any answer.
In my next blog, I will look at step 5 – Transform the Features.