TDSP Lifecycle – Data Acquisition and Understanding

Microsoft in the News

In one of my previous blogs, I talked briefly about the rate of change in our industry and how hard it is to keep up.  You really have to pick your battles.  There are so many industry changing technologies being developed that you cannot possibly be current in a handful of them.  To add insult to this conundrum, I keep throwing new things at you and encouraging you to explore more and even make a hobby out of some of them because they are poised to change our industry (Blockchain, for example).

Sorry, but I am not going to stop doing that.  As Thomas Huxley once said, “Try to learn something about everything and everything about something.”  To my core, I’m a data-geek with a predilection for SQL.  But to do my job well, I must also know how to tweak servers, process a bank loan, conduct data science analysis, program in a few languages, manage employees, and pre-order my Starbucks using an app.  There will likely come a day when I will have to work on a project involving Blockchain.  I need to have at least a basic understanding of Blockchain for me to have meaningful input as to how my data background can help.

So, with that understanding, for those of you who want to continue being indispensable, here is another tech avenue to explore:

On July 23rd, Microsoft announced an open source project that allows you to learn how to program for quantum computing at your own pace.  One day, perhaps in the not too distant future, someone is going to announce that they have a quantum computer for sale.  You have to know that this will change everything, and the demand for programmers will explode overnight.

If you have any interest in exploring quantum computing, or riding that demand wave, check out the self-paced tutorials here:  https://github.com/Microsoft/QuantumKatas

From Karate, “katas” are detailed patterns of movements to practice.  In programming, a kata is an exercise designed to help hone skills through practice and repetition.  These Quantum Katas will teach you elements of quantum computing as well as the Q# programming language.

 

TDSP Lifecycle – Data Acquisition and Understanding

 

Ok, we are back on track and again talking about TDSP (Microsoft’s “Team Data Science Process”).  For those of you who are new to my blog, I started this TDSP series back on March 16.  I posted three TDSP blogs in quick succession, followed by a four-part sidebar on how to ask the right questions when applying Data Science.

With today’s blog, we pick up where we left off, but here is a visual reminder of the process, and where we are in that process:

 

With each step in the TDSP system, Microsoft helps by providing goals, tasks, and artifacts in order to help ensure the success of your DS project and make it easier for you to organize your team.

Goals

Within the Data Acquisition & Understanding phase, the goals you will want to achieve are:

  1. Create a data set that is clean, reliable and useable.
    1. There must be a clearly understood relationship between the data and the target variables.
    2. The data must be clean and organized so that it can be used by machine learning or AI models.
    3. The data must be located in the analytics environment.
  2. Create a solution architecture of the data pipeline that refreshes and scores the data in regular intervals.

How to do it

Move the data:  You will want to move the data you want to use into the analytics environment.  Only move the data you think contribute to achieving your target.  Noise and poor quality data will just make the next step that much more difficult.

Explore the data:  You cannot rely on machine learning to do all the heavy lifting.  These algorithms are only as good as the data and instructions being fed to it.  In order to provide to proper food for your ML or AI, you must be intimately familiar with the data, and the relationships within that data.  You can start the process by going through and cleaning it up.

In a perfect world, everyone remembers to fill in all the blanks using proper units.  When you see “6” in a field that is supposed to represent someone’s height in inches, you know somebody dun messed up.  Is that supposed to be six feet (72 inches) or was it supposed to be 62 inches?  Guessing doesn’t help here, so, we clean.  To help you find the messy bits, you can use data summarization and visualization.

The TDSP makes the IDEAR automated utility available within the environment in order to help you visualize the data and generate reports.  This is an ideal way to start your exploration of the data.  It is a quick and easy way to dive in and get your first taste of what you are dealing with.

If you find that cleaning your data set has left you with too small a sample, or you have come to distrust the overall integrity of the data, it is time to circle back and get more data.

If you are satisfied with the data after it has been cleaned and organized, it is now time to start looking for patterns.  If you can discover patterns or correlations before you start modeling, you increase your odds of finding your target and providing meaningful solutions.  You want to ensure that the data has a connection to the target and look for data correlations that strengthen the connection to the target.

At this point you should be very familiar with the data and the patterns within it.  Now is a good time to reflect on any other data you would like to see added in order to highlight the patterns and correlations you have found.

Get new data:  Data never sleeps.  There is always new or updated data being generated and if you want your model to be based on the latest information, you need to create a data pipeline that brings you that data.  The Azure Data Factory can help you with this.  In developing your pipeline, you will have to consider the possibilities for a batch-based, streaming, or hybrid stream.

Artifacts

The deliverables for the Data Acquisition and Understanding stage are:

  • Data Quality Report: If you are using the IDEAR utility, you can generate this report quickly and easily.  You are looking to summarize the data, describe discovered relationships between the attribute and the target, correlations found where one data set may strengthen or weaken the relationship between another attribute and the target, and anything else that you, as the data expert found curious or important.
  • Solution Architecture: This is either a description or diagram of your data pipeline being used to run scoring or predictions on new data once the model is built.  It will include the pipeline to retrain your model using new data.  This document should be stored in the TDSP Project directory for ease of access.
  • Checkpoint Decision: Now is a good time to determine if the project should move forward.  If you no longer feel there is a business case for pursuing the target, then this is where you would make that recommendation.  On the other hand, if the business case continues to be strong, you may find that you need more or different data, or you are ready to proceed to the next step.

 

In my next blog I will look at Modeling.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

X