Data Mining Demystified - I

Based on past sales, viewing history and customer profile, which item is the consumer most likely to buy next? (Netflix, Amazon)

Which advertisement should be displayed on web page to maximize the chances of user clicking it? (Internet advertisers like Google and Yahoo)

What is the value of a house based on historical sales data and other economic factors? (real estate sites like Zillow)

Is this transaction on credit card possibly fraudulent? (Credit card companies like American Express)

Knowing answers to questions like these and discovering this information about your business can be a huge competitive advantage.

Data mining processes and techniques can answers these questions and discover hidden patterns in your business data.

Many companies realize the strategic importance and competitive advantage of data mining. Yahoo has world renowned data mining expert Dr. Usama Fayyed as it Chief Data Officer.

What is Data mining?

Data mining refers to the process of finding interesting patterns in data that are not explicitly part of the data (Witten & Frank, 2005).

Other good definition is - Data mining is process of automatically discovering useful information in large data repositories (Tan, Steinbach, Kumar).

Data being mined is not collected specifically for purpose of answering questions which data mining answers. Data could be in logs, data warehouses, credit card and retail transactions, text document, web pages, financial systems etc. It is any data which is being stored by companies to run its business. Data mining techniques and processes discover unexpected, valuable and interesting patterns and insights in this data.

Data mining operates on data that is usually large, has high number of dimensions and may or may not be structured. Data mining uses techniques from many disciplines - statistics, AI, machine learning. And very interestingly, when you use data mining you may not know what you are really looking for. Valuable insights come as a surprise.

How is data mining different from query/reporting and OLAP?

You can distinguish between query, OLAP and data mining by kind of questions being asked.

Let us take an example of a national grocery store chain. It want to know what was the sale of diapers last quarter. It can run a query on it is sales database and get the answer.

Now let us say the grocery store chain wants know - How did the sales of diapers vary quarter over quarter for each region in last year? One way is to run a query over its sales transaction database. This query will take a huge amount of time and resources and can slow down the transactional database. A better way would be to use OLAP query over a sales data mart. OLAP stands for online analytical processing and provides analysis of data, trends and forecasts based on the trend.

Now store wants to know customers who buy diapers what else do they buy at the same time. This is where data mining comes in - discovering hidden patterns in data and predicting based on those patterns. Grocery chain did data mining of its data and found that men who buy diapers on Friday or during weekend also buy beer. Putting beer aisle on near the diaper aisle increased the beer sale. This example is actually a famous legend in data mining circles and well illustrates the power of data mining.

In my next blog on the data mining, I will explain data mining processes and techniques with a real world example.

Leave a Reply