We might have heard of big data
at one time or the other. Wikipedia defines big data as “data sets so large or
complex that they are difficult to process using traditional data processing
applications”. Big data refers to really
large data sets that can be analyzed to reveal trends and associations relating
to human behavior and interactions.
Everyone leaves a data trail one way or the other. When
I use my bank card to pay for groceries, there is a data trail with the grocery
store, keeping track of what items were sold, there is also a data trail with
the bank that owns the card I paid with, keeping track of how much I spent. The
grocery store can use data about my purchase and that of other customers to
analyze the buying trend of customers at the store and come up with products
that are often bought together and display those products side by side at the
store. The bank on the other hand can use the data I and other customers
generated that day to come up with a new product, say a credit card that offers
cash back or store credit at that store. When I use a movie streaming service
online, I leave a data trail that shows when I’m usually online watching
movies, the type of movies I watch and the ratings I give such movie. The movie
streaming company can use this data I generate to recommend movies to me,
inform me of when movies similar to that which I’ve rated highly in the past
have been added to their list or offer me free movie passes to see certain
movies based on the choices I’ve made in the past. Companies now invest a lot
of money collecting data based on the trail we leave daily, trying to make
business sense of it.
All the data stored by companies
will be of no use if there is no way of interpreting it. That’s where data
mining comes in. Wikipedia defines data mining as the process of exploring
large amounts of data in order to find meaningful and useful patterns and
relationships.
Data mining uses machine learning
algorithms to find useful patterns in data. Of the various machine learning
algorithms, I’ll only talk about Association today. Hopefully I will talk on
other algorithms over the course of the week.
Associations are relationships
that exist in data sets. These relationships are discovered using association
rules. Wikipedia (yeah, I know using Wikipedia isn’t very “academic” in nature
but hey, it’s my blog!) defines association rule learning as a method for “discovering
interesting relations between variables in large databases”. Okay let’s break
this down. Let’s imagine a database from the grocery store I mentioned in the second
paragraph. Part of the big data it’s likely to collect daily is that of
customers’ purchases. Variables in this case will refer to items that were
purchased by customers. For me, my typical weekend grocery list will contain
apples, carrots, banana, bread and eggs. These items are variables. “Interesting
relations” as in the definition, could include the fact that most people who bought
product A, say eggs also bought product B, say bread. Based on this
association, the store could place product A and B close to each other or offer
special discounts on products A with the hope that product B will sell also.
Association rules are used by
many organizations to make recommendations to customers. For example, a movie
streaming company, using association rules could discover that customers that
watched movie A usually watched movie B also, and thus recommend movie B to
customers that have seen A. If I as a customer always get movie recommendations that I love, it will be unlikely for me to cancel my subscription to such a company. Association rules are used by several e-commerce
sites to subtly “remind” customers of what to buy at check out or what pair or
set of items are usually bought together. The sole aim of association rules in my opinion (when used in business) is to increase the purchase quantity and frequency of customers which in turn means more profit for the organization. :)
For more on association rules,
kindly see Margaret Rouse’s post
here.
References
http://en.wikipedia.org/wiki/Big_data
http://simple.wikipedia.org/wiki/Data_mining
http://www.statsoft.com/Textbook/Data-Mining-Techniques
http://en.wikipedia.org/wiki/Association_rule_learning