Today, in this unconscionably long side bar, I will look at step five in our DS diagram.
- Transform the Features
At this point you are chomping at the bit to start playing with some machine learning algorithms on Azure. That is the sort of thing that we geeks live for! But you have one more intervening step: feature engineering. Feature engineering is the art of rearranging your data to make it more useful. This can be the difference between poor and great results. I call it an “art” because in many cases, knowing which features to engineer and how can make a huge difference in your results, but there are no hard and fast rules. That is why, when you start building those machine learning algorithms, it is best to play around with them looking for results that best fit the known outcomes.
Depending on your project, this step can be quite complicated, or as simple as combining two data points to make a third. A good example of how small changes in the features can have unexpectedly large effects on the results can be found here: https://gallery.azure.ai/Experiment/Create-useful-features-for-trains-1
Even experienced data scientists will often use some trial and error here. You may end up building several tables that you will run through the machine learning algorithm in your search for results that fit what is known.
As a starting point, you can color code your target and plot it against every pair of variables in the table. This may bring hidden relationships to the surface but be careful not to confuse correlation with causation. When you do see some correlation, and the features are plausibly correlated, you can seriously look at some possible feature engineering. It is precisely the finding of a combination of two variables that were hitherto unknown to correlate that makes data science a powerful tool. Correlations like this can make your target result more decisive and enlightening.
Or, you may discover that none of the variables or combinations of variables are helpful in predicting your target. That means that you probably just landed on a snake and need to go back to square one and get more data.
That’s it for today. In my next blog I will finish this side bar by looking at steps 6 and 7.