We might have heard of big data at one time or the other. Wikipedia defines big data as “data sets so large or complex that they are difficult to process using traditional data processing applications”. Big data refers to really large data sets that can be analyzed to reveal trends and associations relating to human behavior and interactions.
Everyone leaves a data trail one way or the other. When I use my bank card to pay for groceries, there is a data trail with the grocery store, keeping track of what items were sold, there is also a data trail with the bank that owns the card I paid with, keeping track of how much I spent. The grocery store can use data about my purchase and that of other customers to analyze the buying trend of customers at the store and come up with products that are often bought together and display those products side by side at the store. The bank on the other hand can use the data I and other customers generated that day to come up with a new product, say a credit card that offers cash back or store credit at that store. When I use a movie streaming service online, I leave a data trail that shows when I’m usually online watching movies, the type of movies I watch and the ratings I give such movie. The movie streaming company can use this data I generate to recommend movies to me, inform me of when movies similar to that which I’ve rated highly in the past have been added to their list or offer me free movie passes to see certain movies based on the choices I’ve made in the past. Companies now invest a lot of money collecting data based on the trail we leave daily, trying to make business sense of it.
All the data stored by companies will be of no use if there is no way of interpreting it. That’s where data mining comes in. Wikipedia defines data mining as the process of exploring large amounts of data in order to find meaningful and useful patterns and relationships.
Data mining uses machine learning algorithms to find useful patterns in data. Of the various machine learning algorithms, I’ll only talk about Association today. Hopefully I will talk on other algorithms over the course of the week.
Associations are relationships that exist in data sets. These relationships are discovered using association rules. Wikipedia (yeah, I know using Wikipedia isn’t very “academic” in nature but hey, it’s my blog!) defines association rule learning as a method for “discovering interesting relations between variables in large databases”. Okay let’s break this down. Let’s imagine a database from the grocery store I mentioned in the second paragraph. Part of the big data it’s likely to collect daily is that of customers’ purchases. Variables in this case will refer to items that were purchased by customers. For me, my typical weekend grocery list will contain apples, carrots, banana, bread and eggs. These items are variables. “Interesting relations” as in the definition, could include the fact that most people who bought product A, say eggs also bought product B, say bread. Based on this association, the store could place product A and B close to each other or offer special discounts on products A with the hope that product B will sell also.
Association rules are used by many organizations to make recommendations to customers. For example, a movie streaming company, using association rules could discover that customers that watched movie A usually watched movie B also, and thus recommend movie B to customers that have seen A. If I as a customer always get movie recommendations that I love, it will be unlikely for me to cancel my subscription to such a company. Association rules are used by several e-commerce sites to subtly “remind” customers of what to buy at check out or what pair or set of items are usually bought together. The sole aim of association rules in my opinion (when used in business) is to increase the purchase quantity and frequency of customers which in turn means more profit for the organization. :)
For more on association rules, kindly see Margaret Rouse’s post here.