Introduc tion to R for Data Mining
2013 Webinar S eries
J os eph B . R ic kert F ebruary 14, 2013
1
F irs t P olling Ques tion
Revolution Confidential
What is your favorite data mining software tool?
1. 2. 3. 4. 5. R SAS MapReduce Weka Other
2
My goal for today’s webinar is to c onvinc e you that:
Seriously, it is not difficult to learn enough R to do some serious data mining
Revolution Confidential
R is a serious platform for data mining
Revolution R Enterprise is the platform for serious data mining
3
Revolution Confidential
A word about Data Mining
We assume that you know a little bit about data mining and this is your context for learning R
4
Applications
Actions
Algorithms
Data Mining
Revolution Confidential
Credit Scoring
Acquire Data
CART
Fraud Detection
Prepare
Random Forests
Ad Optimization
Classify
SVM
Targeted Marketing
Predict
KMeans
Gene Detection
Visualize
Hierarchical clustering
Recommendation systems
Optimize
Ensemble Techniques
Social Networks
Interpret
5
Revolution Confidential
Getting Orientated
WHAT IS R ?
6
Is :
Revolution Confidential
The way to do statistical computing A full blown programming language The home of nearly every data mining algorithm known to data science. A vibrant world-wide community
R was written in early 1990’s by Robert Gentleman Ross Ihaka
Since 1997 a core group of ~ 20 developers guides the evolution of the language
7
is organized into libraries of func tions c alled pac kages
R Package Growth
Revolution Confidential
4,332 packages as of 2/13/13
CRAN R download Base Recommended packages User contributed packages
8
F inding Your Way A round world of
Machine Learning Data Mining Visualization Finding Packages
Task Views crantastic.org Revolutions R-Bloggers Quick-R Inside-R
Revolution Confidential
Blogs
Getting Help Finding R People
User Groups worldwide
Twitter : #rstats
9
Revolution Confidential
Learning R
T HE S T R UC T UR E OF R FA C IL ITAT E S L E A R NING
10
L earning R ?
Levels of R Skill Write production grade code Write an R package Write code and algorithms Use R functions Use a GUI
R programmer R contributor R developer
Revolution Confidential
R user
R aware
10 Hours of use
10,000
The Malcolm Gladwell “Outlier” Scale
11
B as ic Mac hine L earning F unc tions
Function Cluster Classifiers hclust kmeans glm rpart ksvm apriori Ensemble ada randomForest Library stats stats stats rpart kernlab arules ada randomForest Description Kmeans clustering Logistic Regression Recursive partitioning and regression trees Support Vector Machine Rule based classification Stochastic boosting
Revolution Confidential
Hierarchical cluster analysis
Random Forests classification and regression
12
Noteworthy Data Mining P ac kages
Package caret Comment Well organized and remarkably complete collection of functions to facilitate model building for regression and classification problems A very intuitive GUI for data mining that produces useful R code
Revolution Confidential
rattle
13
Revolution Confidential
Script 1 2 3 4 5 6 GETTING STARTED .R ROLL with RATTLE .R IN THE TREES . R INTRO to CARET .R BIG DATA with RevoScaleR .R WORDCLOUD .R
Doing a lot with a little R
T IME TO R UN S OME C ODE
The R Scripts are available at: https://gist.github.com/joseph-rickert/4742529
14
S ec ond P olling Ques tion
What are your favorite data mining techniques?
Revolution Confidential
1. Clustering techniques such as K-means 2. Single model classifiers such as decision trees, or SVMs 3. Ensemble classifiers such as Random Forests or boosting models 4. Text mining techniques 5. Other
15
T hird P olling Ques tion (ins ert after running s c ript IN T HE T R E E S What kind of data do you analyze?
1. 2. 3. 4. 5. Financial data Customer data (e.g. for recommendations) Website data (e.g. for ads) Health Care data Other
Revolution Confidential
16
Revolution Confidential
Working with B ig Data
RevoScaleR and Revolution R Enterprise
17
Too B ig for Open S ourc e R
Revolution Confidential
mortDF <- rxXdfToDataFrame(mdata,maxRowsByCols=300000000) model <- glm(default ~ .,data=mortDF,family="binomial")
18
R evoS c aleR brings the power of B ig Data to R
Parallel External Memory Algorithms that are distributed among available compute resources (cores & computers) independent of platform API for integrating external data sources (files, databases, HDFS) that provides optimized reading of rows and columns in blocks
Revolution Confidential
Distributed Statistical Algorithms
Communications Framework
Abstracted layer for providing communication between compute nodes in a cluster (MPI, MapReduce, InDatabase)
R Language Interface Data Source API
Familiar, highprodictivity programming paradigm for R users
19
R evoS c aleR P E MA s P arallel E xternal Memory A lgorithms
XDF File
Block 1
Revolution Confidential
Read blocks and compute intermediate results in parallel, iterating as necessary
Block 1 results
Block 2 Block i results Block i Block i+1 Block i+2 Block i+2 results
Block i
Block i+1 results
Block i+1
Results from last block
1st pass
Block i+2
R based algorithms Work on blocks of data Inherently parallel and distributed Do not require all data to be in memory at one time Can deal with distributed and streaming data
2nd pass
3rd pass
20
Revolution Confidential
More than code, R is a community
WHE R E TO G O F R OM HE R E ?
21
C ontinuing to L earn R
Resources RevoJoe: How to Learn R More R Documentation
The R Journal Books Reference Card and more
Revolution Confidential
Examples
Thomson Nguyen on the Heritage Health Prize Shannon Terry & Ben Ogorek (Nationwide Insurance): A Direct Marketing In-Flight Forecasting System Jeffrey Breen: Mining Twitter for Airline Consumer Sentiment Joe Rothermich: Alternative Data Sources for Measuring Market Sentiment and Events (Using R)
Classes
Coursera Revolution Analytics
22
S ome B ooks
Revolution Confidential
23
Revolution Confidential
The R Scripts are available at: https://gist.github.com/joseph-rickert/4742529