Monday, October 24, 2016

PREDICTION MODEL WITH RAPID MINER

RAPIDMINER
I use RapidMiner in analysing the data pemilu dataset, to make a prediction model of Elektabilitas Caleg.
RapidMiner is a software platform developed by the company of the same name that provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. It is used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the data mining process including data preparation, results visualization, validation and optimization. RapidMiner is developed on an open core model. The RapidMiner (free) Basic Edition is limited to 1 logical processor and 10,000 data rows is available under the AGPL license.
I am going to create prediction model of prediksi elektabilitas caleg using data sets given on the link (datapemilukpu.xls) that the lecturer given to us in slide chapter 7. There are many algorithms and operators available in RapidMiner, but in this prediction, I will use three main algorithms, which are; Decision Tree (C4.5), Naïve Bayes (NB) and K-Nearest Neighbor (K-NN). I am creating the prediction model in order to know the legislative prediction, whether he/she are going to be elected or not.

PREDICTION MODEL

1.      DECISION TREE (C4.5)
For the first algorithms, we will use Decision Tree as our modelling, it generates classification of both nominal and numerical data. In RapidMiner an attribute with label role will be predicted by the Decision Tree operator. According to RapidMiner website, we could know that each interior node of tree corresponds to one of the input attributes. The number of edges of a nominal interior node is equal to the number of possible values of the corresponding input attribute. Outgoing edges of numerical attributes are labeled with disjoint ranges. Each leaf node represents a value of the label attribute given the values of the input attributes represented by the path from the root to the leaf.
Example of decision tree model’s result and workspace in Rapidminer

 

 



2.      NAIVE BAYES
Acccording to RapidMiner website, A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be ‘independent feature model’. In simple terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a Naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
The advantage of the Naive Bayes classifier is that it only requires a small amount of training data to estimate the means and variances of the variables necessary for classification. Because independent variables are assumed, only the variances of the variables for each label need to be determined and not the entire covariance matrix.

 

 

 

 

3.      K-Nearest Neighbor (KNN)
From RapidMiner website, K-Nearest Neighbour model is to generates from the input ExampleSet, this model can be a classification or regression model depending on the input ExampleSet. The k-Nearest Neighbor algorithm is based on learning by analogy, that is, by comparing a given test example with training examples that are similar to it. The training examples are described by n attributes. Each example represents a point in an n-dimensional space. In this way, all of the training examples are stored in an n-dimensional pattern space. When given an unknown example, a k-nearest neighbor algorithm searches the pattern space for the k training examples that are closest to the unknown example. These k training examples are the k “nearest neighbors” of the unknown example. “Closeness” is defined in terms of a distance metric, such as the Euclidean distance.






REFERENCES

https://en.wikipedia.org/wiki/RapidMiner