Sunday, November 20, 2016

NEW DISCOVERY OF ASSOCIATION RULES IN DATA MINING THAT IS IMPLEMENTED IN THE REAL LIFE


NEW DISCOVERY OF ASSOCIATION RULES IN DATA MINING THAT IS IMPLEMENTED IN THE REAL LIFE



What is Association Rules?

According to blog of Margaret Rouse, Association rules are if/then statements that help uncover relationships between seemingly unrelated data in a relational database or other information repository. An example of an association rule would be "If a customer buys a dozen eggs, he is 80% likely to also purchase milk."

An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item found in the data. A consequent is an item that is found in combination with the antecedent. Association rules are created by analyzing data for frequent if/then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true.



Association Rules to Predict the Weather

I would like to create a dummy data for BMKG that can be used to predict the weather and this kind of information can be shared to the citizens. Given a set of data or information, that will predict the occurrence of an item based on the occurrences of other items

TID
ITEMS
1
Storm, Rainy, Thunder, Drizzle
2
Thunder, Rainy, Windy, Storm, Drizzle
3
Windy, Rainy, Drizzle, Storm
4
Drizzle, Rainy, Thunder
5
Thunder
6
Thunder, Drizzle
7
Rainy, Thunder, Storm
8
Drizzle, Windy

Mining Association Rules
There are two approaches of mining association rules:
1.      Frequent Itemset Generation
In order to generate a frequent itemset list one must avoid using the brute force approach because it can be very expensive to search through the whole data set to find the support count of each itemset. Some of the strategies that used to fix this problem are:
Ø  Reduce the number of candidates (Apriori Principle): use pruning techniques such as the Apriori principle to eliminate some of the candidate itemsets without counting their support values
Ø  Reduce the number of transactions: by combining transactions together we can reduce the total number of transactions
Ø  Reduce the number of comparisons (FP-Growth): use efficient data structures to store the candidates thereby eliminating the need to match every candidate against every transaction.
2.      Rule Generation
Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset. Frequent itemset generation is still computationally expensive

Predict the Dummy Data of the Weather by Using Mining Association Rules
In this case, I would like to use the first Mining Association Rules to predict and analyze the dummy data of the weather. In the first rule there are three strategies that can be fixed to slolve the problem and I will only use two strategies (Apriori Principle and FP-Growth).
The Dummy Data (I give an example here for minimum support is 30%)
TID
ITEMS
1
Storm, Rainy, Thunder, Drizzle
2
Thunder, Rainy, Windy, Storm, Drizzle
3
Windy, Rainy, Drizzle, Storm
4
Drizzle, Rainy, Thunder
5
Thunder
6
Thunder, Drizzle
7
Rainy, Thunder, Storm
8
Drizzle, Windy
I give an example here for minimum support is 30%
Calculate minimum support
30% * 8 = 2.4
Frequency of occurrence
Items
Frequency
A
5
B
6
C
3
D
6
E
4
Prioritize the Item
Items
Frequency
Priority
A
5
3
B
6
1
C
3
5
D
6
2
E
4
4
Order the items according to the priority
TID
Items
Ordered Items
1
Storm, Rainy, Thunder, Drizzle
Drizzle, Thunder, Rainy, Storm
2
Thunder, Rainy, Windy, Storm, Drizzle
Drizzle, Thunder, Rainy, Storm, Windy
3
Windy, Rainy, Drizzle, Storm
Drizzle, Rainy, Storm, Windy
4
Drizzle, Rainy, Thunder
Drizzle, Thunder, Rainy
5
Thunder
Thunder
6
Thunder, Drizzle
Drizzle, Thunder
7
Rainy, Thunder, Storm
Thunder, Rainy, Storm
8
Drizzle, Windy
Drizzle, Windy
FP-Tree
Validation



REFERENCES
http://searchbusinessanalytics.techtarget.com/definition/association-rules-in-data-mining
http://www.belajaringgris.net/weather-vocabulary-3252.html
http://www.hypertextbookshop.com/dataminingbook/working_version/contents/chapters/chapter002/section002/blue/page001.html

Sunday, November 13, 2016

SUMMARY OF WORKSHOP 1: DATA EXCHANGE IN A DIGITAL WORLD

SUMMARY OF WORKSHOP 1: DATA EXCHANGE IN A DIGITAL WORLD

INTRODUCTION TO BIG DATA
Big data is a term that describes the large volume of data both structured and unstructured that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
5V of Big Data
1.      Volume refers to the vast amounts of data generated every second. Just think of all the emails, twitter messages, photos, video clips, sensor data etc. We produce and share every second. For example is on Facebook we send 10 billion messages per day, click the "like' button 4.5 billion times and upload 350 million new pictures each and every day.
2.      Velocity refers to the speed at which new data is generated and the speed at which data moves around. For example is social media messages going viral in seconds.
3.      Value: It is all well and good having access to big data but unless we can’t turn it into value it is useless.
4.      Veracity refers to the messiness or trustworthiness of the data.
5.      Variety refers to the different types of data we can now use.
BENEFIT OF BIG DATA
Research: doing the research to know deeply about big data and collect the data to be analyzed
Business Intelegence: the data that have been analyzed can be an  business information that can be used to make a decision.
New Business Opportunity: with more knowledge about business information and ecosystem in a digital world, there will be new and big potential business

Monday, October 24, 2016

PREDICTION MODEL WITH RAPID MINER

RAPIDMINER
I use RapidMiner in analysing the data pemilu dataset, to make a prediction model of Elektabilitas Caleg.
RapidMiner is a software platform developed by the company of the same name that provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. It is used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the data mining process including data preparation, results visualization, validation and optimization. RapidMiner is developed on an open core model. The RapidMiner (free) Basic Edition is limited to 1 logical processor and 10,000 data rows is available under the AGPL license.
I am going to create prediction model of prediksi elektabilitas caleg using data sets given on the link (datapemilukpu.xls) that the lecturer given to us in slide chapter 7. There are many algorithms and operators available in RapidMiner, but in this prediction, I will use three main algorithms, which are; Decision Tree (C4.5), Naïve Bayes (NB) and K-Nearest Neighbor (K-NN). I am creating the prediction model in order to know the legislative prediction, whether he/she are going to be elected or not.

PREDICTION MODEL

1.      DECISION TREE (C4.5)
For the first algorithms, we will use Decision Tree as our modelling, it generates classification of both nominal and numerical data. In RapidMiner an attribute with label role will be predicted by the Decision Tree operator. According to RapidMiner website, we could know that each interior node of tree corresponds to one of the input attributes. The number of edges of a nominal interior node is equal to the number of possible values of the corresponding input attribute. Outgoing edges of numerical attributes are labeled with disjoint ranges. Each leaf node represents a value of the label attribute given the values of the input attributes represented by the path from the root to the leaf.
Example of decision tree model’s result and workspace in Rapidminer

 

 



2.      NAIVE BAYES
Acccording to RapidMiner website, A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be ‘independent feature model’. In simple terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a Naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
The advantage of the Naive Bayes classifier is that it only requires a small amount of training data to estimate the means and variances of the variables necessary for classification. Because independent variables are assumed, only the variances of the variables for each label need to be determined and not the entire covariance matrix.

 

 

 

 

3.      K-Nearest Neighbor (KNN)
From RapidMiner website, K-Nearest Neighbour model is to generates from the input ExampleSet, this model can be a classification or regression model depending on the input ExampleSet. The k-Nearest Neighbor algorithm is based on learning by analogy, that is, by comparing a given test example with training examples that are similar to it. The training examples are described by n attributes. Each example represents a point in an n-dimensional space. In this way, all of the training examples are stored in an n-dimensional pattern space. When given an unknown example, a k-nearest neighbor algorithm searches the pattern space for the k training examples that are closest to the unknown example. These k training examples are the k “nearest neighbors” of the unknown example. “Closeness” is defined in terms of a distance metric, such as the Euclidean distance.






REFERENCES

https://en.wikipedia.org/wiki/RapidMiner