The starting point in the TDSP Lifecycle is Business Understanding. In my last blog, I mentioned that each step in the TDSP Lifecycle comes with goals, instructions, and artifacts. That is the format we will be following throughout this blog series.
Goals:
Obviously if you are starting a project, you already have some ideas about what it is you want to accomplish. However, as you well know, taking the clients ideas and setting them out as actual deliverables is not always straight forward. That is why we start by setting some goals that can be used to determine the success of the project. Until this is completed, you probably don’t really understand the business goals well enough to ensure success. Here, you want to flesh out two things:
- The key variables that will function as model targets. and,
- The relevant data sources that will be used.
Instructions:
The two critical tasks that you will address at this stage are:
Define Objectives: This is where you get to work with the stakeholders to fully understand and identify the business problems. Once you understand the problem to be solved, you can define the business goals that are to be solved using data science techniques.
Identify Data Sources: With your objectives defined, you now need to determine what data or data sources you need in order to achieve those objectives.
Defining Your Objectives
The first step in defining your objectives is to identify the key business variables that you want to predict. These are called your model targets and the metrics associated with them will be used to determine the success of the project. Some examples of model targets include predicting whether a customer will return (churn), whether a transaction is fraudulent, which product makes a good upsell candidate, sales forecasts for a given month, or what variables correlate with higher sales.
In order to define your project goals as precisely as possible, you will need to ask questions that are specific and unambiguous (“sharp questions”). In my next blog I will deal with the art of asking questions in a DS process. The definition of your goals is likely to fall within the following five question categories:
- Which category? This is a classification question.
- How much or how many? This is a regression question.
- Which group? This is a clustering question.
- Which option is best? This is a recommendation question.
- Is this weird? This is an anomaly detection question.
You will want to figure out which of these five questions is being asked, and how finding an answer to it will allow you to reach your business goals.
Why are these 5 questions so important? They correlate directly to the 5 questions that machine learning, at this time, can answer. It might surprise you, but there are only five questions that data science answers:
- Is this A or B?
- Is this weird?
- How much – or – How many?
- How is this organized?
- What should I do next?
Each one of these questions is answered by a separate family of machine learning methods, called algorithms.
As you are defining your business goals, you will want to create your project team by determining who you will need, and the roles and responsibilities of each person.
With any large project, you must have interim milestones. Without these, you are virtually guaranteed failure. Create a high-level plan with specific milestones and check points so that you can assimilate new information and make adjustments to the project as needed.
Finally, you will define your success metrics using the SMART method:
Specific
Measurable
Achievable
Relevant
Time-bound
Identify Data Sources
It is now time to find relevant data sources. The data must contain known examples of answers to your sharp questions. Specifically, you want data that is:
- Relevant to the question. It must have measures of the target and features related to the target.
- An accurate measure of the model target and the features of interest.
You may find that you have a readily available source that meets all your needs. Or, you may have to change the data collection currently being done in order to start collecting specific data that is missing from the overall data set. Often you will find multiple data sources that can be combined based on key factors that will comprise your data set, and then you will be challenged to determine how to combine them.
Artifacts:
These are the deliverables for the Business Understanding stage:
- Charter document: Microsoft provides you with a template to work with. This is a living document that is meant to be updated throughout the project. It should be updated regularly as discoveries are made and as the business requirements change. All the stakeholders should be tasked with the responsibility to communicate any discovery or requirements change that may affect the charter. A Project Charter template can be found here: http://bit.ly/2HM3y35
- Data sources: The raw data sources section of the data definitions report found in the TDSP project Data report folder contains the data sources. This section specifies the original and destination locations for the raw data. Scripts designed to move the data to the analytic environment can be add here as they are created. A template for the Data and Feature Definitions report can be found here: http://bit.ly/2FT00LT
- Data dictionaries: The data dictionary will provide descriptions of the data being provided. It may include details such as the data type(s) and any validation rules that may apply. Entity-relation diagrams can also be included.
And that is the first step in a DS project that is organized under the TDSP system. In my next blog, I will address the issue of how to ask a question in a DS project. In that blog, I will also remind you of the DS process by giving you an overview of the process. Then, I will circle back to the specifics by continuing our exploration of the TDSP process.