By Michael Burke | November 11, 2019
Data mining, or the process of drawing patterns from an organization’s data, is something of the wild west right now, with new technologies and techniques popping up faster than speeding tickets. So, it might surprise you to know that there is an industry standard process for data mining, referred to as CRISP-DM, intended to a comprehensive blueprint for conducting a data mining project. CRISP-DM reduces everything to six phases:
Phase One: Business Understanding. As you peruse big data analytics literature, you’ll see all kinds of warnings about not diving into analysis before you know what you want to accomplish. This all relates back to this first phase of the data mining life cycle, where you understand the business problem you’re trying to solve. This is also why turning a group of computer science graduates loose without any input from business experts will typically not produce meaningful analytics. During this phase, you determine the business objectives, assess the situation (including what data you have available to analyze), determine the project goals (as well as defining success), and make a game plan. Are you trying to make one of your products or services better? Do you need to better identify customers who will purchase more? The answers to these questions often guide every other phase of the data mining lifecycle.
Phase Two: Data Understanding. There’s a lot of talk about simplifying big data analytics using AI to make it so that computer programs just tell us what data means. In reality, data in its raw form, or frankly even in its refined form, requires a human eye to even start mining for insight, much less draw meaningful conclusions. This starts with an analyst taking a high level look, where they may discover initial insights and start to form a hypothesis. Additionally, they may notice problems with data quality. All of this requires a bit of domain expertise. For instance, someone unfamiliar with media metrics would not necessarily know that “audience reach” and “UMV” are measuring similar things, and including them both in a data model will probably render its insights meaningless. The analyst will also have a hunch regarding what is relevant, and things that simply defy common sense and therefore are probably either a mistake or such an extreme anomaly that they don’t belong in the model (such as a dataset with Los Angeles housing process showing one in Beverly Hills selling for $350K).
Phase Three: Data Preparation. Sadly most datasets are not ready to be fed into analytics or machine learning algorithms. They typically come with all kinds of problems–from inconsistent formatting to missing values–that basically make it so that your computer model will return an error rather than a result. A single unexpected ‘$’ or ‘%’ can render the model useless. So data scientists, architects and engineers have to spend significant time rectifying this before they can go any further (often referred to as ‘data cleansing’). Beyond this, there’s also the issue of deciding what portions of the data are relevant to the objectives of the project, and understanding what volume of data your tools can handle.
The data that you need might not come in one handy dataset–it might come from dozens of data sources, and need to be assembled into a single table to answer your questions. Some of this is done using SQL, while some actions require more sophisticated techniques using developer languages like Python.
Phase Four: Modeling. This is probably the most esoteric phase, where you decide what kind of algorithms or ‘modeling techniques’ you’ll employ. For the most part, it’s the domain of data scientists and machine learning engineers, and not something that makes its way into business conversations. However, many of the terms that get thrown around in the popular press, such as ‘deep learning’ and ‘natural language processing’ are actually describing modeling techniques.
Actually, you might already be familiar with machine learning modeling at a high level. If you took high school algebra, you have a sense for what it looks like. Remember graphing lines based on linear equations (e.g. y = 2x + 3)? That is actually a form of machine learning called “regression”–and in fact, it’s quite an important one too, and is used to make all kinds of predictions. A regression-based machine learning model would generate the equation, and your prediction, ‘y’, would depend on the value of ‘x’.
Phase Five: Evaluation. At this point, the analyst team will evaluate the model, or models, before moving forward, asking such questions as: How predictive is it? Does it answer the questions we sought to answer from the start of the project? Is it accurate enough for our purposes? Could we do anything to make it better? Can the model be repeated and integrated into business processes?
Phase Six: Deployment. Most analytics projects involving data mining aren’t intended to be one-offs. Rather, the intent is to integrate them into organizational or business practices. For instance, if you created a model that predicted with 80 percent accuracy when a customer is likely to leave your service and switch to a competitor’s service, you’d want to somehow build this into the workflow of your sales and marketing teams, so that they could be automatically notified when a customer is at high risk of churn. To do this, your data science team would work together with the software development team to build this capability into your software architecture. This is often referred to as ‘operationalizing’ your analytics model.
CRISP-DM isn’t the only model, and in fact alternate models like TDSP (Team Data Science Process) are gaining steam. However, CRISP-DM is still the dominant process.