I’m sorry for this inconvenience. A long while back I started a blog series about data science. I managed to get 3 blogs up and then I was interrupted by the very topical subject, Blockchain. Having covered that topic, and hopefully I’ve gotten you exploring Blockchain as a Service on Azure, I now resume your previously scheduled blog series on data science. Feel free to look back to February 16 when the first three blogs in this series started. Hopefully there will be no further interruptions.
. . .
We have all heard the term “rocket scientist”, but what is a rocket scientist? Do they optimize the aerodynamics of rockets? Do they create rocket fuel? Do they calculate the trajectory of the rocket to ensure it ends up where it is supposed to be? Do they create or select the materials used to build the rocket? Do they build the rocket engine? Do they design the launch pad?
The answer to all of those questions is, yes and no. The person creating rocket fuel is likely a chemist or chemical engineer. The person calculating the trajectory is likely a physicist or mathematician. But they are all doing scientific work on rockets. So technically, they are both.
As for the person designing the launch pad, again it could be argued both ways. Although that person isn’t working on a rocket, good luck launching one without a properly engineered launch pad. This person is an integral part of the field or rocketry, so arguably, also a rocket scientist.
That is where we are when it comes to Data Scientists. The science of data is such a broad field that it encompasses many fields of study. Programmers, analysts, DBAs, statisticians, can all lay claim to the title of Data Scientist. So, what about the sys-admin? Are they not, by analogy, similar to the launch pad engineer?
If you look back a few blog posts ago to my Data Science, and (SIC) Introduction, I gave you the definition of a Data Scientist as set out by the National Science Board. It included a long list of professions and then ended on the key factor: …whose primary activity is to conduct creative inquiry and analysis [on data]. So, a DBA who spends most of their time on back-ups and maintenance would not be a Data Scientist. A DBA who spends most of their time on creative inquiry and analysis of data would be a Data Scientist.
The Team Data Science Process (TDSP) is a package of tools, structure, algorithms, and methodology created by Microsoft. It includes the following:
- A definition of the data science lifecycle.
- A template that standardizes the structure of a data science project.
- Recommended infrastructure (provided on Azure) to help ensure success.
- The provision of tools and scripts to get your team started.
Data Science Lifecycle
The TDSP lifecycle takes you through from start to finish, detailing all the steps that are usually taken in a data science project. Others have come up with lifecycles for data science and, at a high level, they all share a similar path. If you are already familiar with a data science (“DS”) path, you can still use it while taking advantage of the TDSP tools and infrastructure. The TDSP lifecycle was designed for DS projects tied to intelligent applications deploying machine learning models for predictive analytics. Other data science projects while still able to benefit from TDSP, may not need all the steps set out. Either way, it is better to have the step and skip it than to not have the step and need it. So, TDSP may be more comprehensive than you need, but that is ok.
The lifecycle’s major stages can be used on a repetitive basis with each iteration bringing you closer to the desired goal. The stages are defined as follows:
- Business Understanding
- Data Acquisition and Understanding
- Customer Acceptance
Standardized Project Structure
As you well know, Apples and PCs don’t mix well. Standardization keeps everyone interacting and collaborating seamlessly (in theory). Under TDSP, projects share a directory structure and templates in order to avoid confusion. All code and documents produced and stored under TDSP use a version control system that allows for collaboration while avoiding mayhem. Tasks and features are tracked using an agile project tracking system.
All the templates for folder structure, and recommended documents are included. Using a folder structure and document templates that are consistent across teams and projects makes it easier for everyone to communicate, and even transfer in new members or transfer out a team member to a different team. The checklist templates ensure that all the “i’s” are dotted and “t’s” are crossed.
Infrastructure and Resources
Everything you need to help your team succeed with a DS project is found on Azure in the TDSP cloud. This includes all the things I have blogged about in the past like cloud storage, databases, big data clusters, and machine learning tools.
With data, tools, and processing power all kept in Azure, infrastructure costs are minimized, security is maximized, the testing environment is consistent, and collaboration is easy.
Tools and Scripts
TDSP provides the tools and scripts that allow a team to easily adopt a single process. This lowers the barriers to adoption, increases consistency of adoption across the team, and ensures you are starting with a tested system that consistently leads to successful completion of a DS project. People don’t like to start something if they are unsure of the end result or how they are going to get to that end result. TDSP breaks down those barriers and instills confidence by providing the road map.
Scripts found within TDSP will help automate many of the common tasks in the DS lifecycle.
Clearly TDSP is a large environment and I hope you will continue to follow my blog as I take you on a tour. I will be diving into the many sections of TDSP that I have outlined above and providing you with links to resources so you can continue the exploration on your own.