In my last blog, I said that I would, as a side bar, look at the art of asking questions in a Data Science (“DS”) process. This is that blog. To keep it in context, I have decided to follow through with an overview of the DS process, so this side bar will be four blogs long. At the end of this side bar, I will return to the more specific and guided process provided by Microsoft’s TDSP.
Yes, really – how to ask a question. I know you have been asking questions all your life and you may feel that you have this down pat. Asking questions and getting answers is as basic a function in life and business as it gets. However, there is a world of difference between asking, “How do I retain more customers?” and “Which customer is about to leave for a competitor?”. In the first question, you have a broad question that is open to many answers, some of which may be conflicting with each other. One answer may be to raise quality. Another may be to lower quality and allow room for lowering price. This sort of open ended question is not helpful in DS. The second question is precise and is the sort of question that machine learning is geared up to answer.
So, although you have been asking questions all your life, you likely haven’t been asking questions that a Data Scientist can ply their trade upon. To get us started, I’d like to introduce you to another diagram that shows the iterative DS process. Asking the right questions is actually step two according to this process. That is because at the heart of a data scientist lies a constant and burning desire for more data!
- Get More Data
This is of course your goal when you start out with your TDSP. What you are looking for is either numbers and/or names. Everything else takes you out of the realm of DS. Some examples:
Names: Product, title, action, image, gender, color, good, broken, etc.
Numbers: Date, price, quantity, time, age, temperature, height, etc.
Anything can be “data”, even something like a video, so long as you are describing it in terms of names or numbers.
- Ask a Sharp Question
Most people want to be helpful, but we are not generally wired to think in precise terms. Ask someone what the temperature is outside, and they are likely to tell you it is “warm”, “cold”, or “hot”. That kind of answer is usually not helpful to a data scientist. If you ask that same person what the temperature is outside in degrees Fahrenheit, you will get an odd look, but as a data scientist, that is the precise question that leads to a number you can work with.
If you want to know how long a warranty you should be putting on your product, you may ask the question, “How many hours will this product function before it fails?” and “How many hours do our clients use our product each month on average?” The answers to these questions are your “target”. If your data set does not contain data pertaining to these questions (“target data”), then you need to go back to step 1 and get more data.
In my next blog I will look at steps 3 and 4.