14Feb13 Intro to R for Data Mining

Published on June 2016 | Categories: Types, Speeches | Downloads: 27 | Comments: 0 | Views: 201
of 24
Download PDF   Embed   Report

Comments

Content

Revolution Confidential

Introduc tion to R for Data Mining
2013 Webinar S eries
J os eph B . R ic kert F ebruary 14, 2013

1

F irs t P olling Ques tion

Revolution Confidential

 What is your favorite data mining software tool?
1. 2. 3. 4. 5. R SAS MapReduce Weka Other

2

My goal for today’s webinar is to c onvinc e you that:
Seriously, it is not difficult to learn enough R to do some serious data mining

Revolution Confidential

R is a serious platform for data mining

Revolution R Enterprise is the platform for serious data mining
3

Revolution Confidential

A word about Data Mining
We assume that you know a little bit about data mining and this is your context for learning R

4

Applications

Actions

Algorithms

Data Mining

Revolution Confidential

Credit Scoring

Acquire Data

CART

Fraud Detection

Prepare

Random Forests

Ad Optimization

Classify

SVM

Targeted Marketing

Predict

KMeans

Gene Detection

Visualize

Hierarchical clustering

Recommendation systems

Optimize

Ensemble Techniques

Social Networks

Interpret

5

Revolution Confidential

Getting Orientated

WHAT IS R ?

6

Is :

Revolution Confidential

 The way to do statistical computing  A full blown programming language  The home of nearly every data mining algorithm known to data science.  A vibrant world-wide community
R was written in early 1990’s by Robert Gentleman Ross Ihaka

Since 1997 a core group of ~ 20 developers guides the evolution of the language

7

is organized into libraries of func tions c alled pac kages
R Package Growth

Revolution Confidential

4,332 packages as of 2/13/13

 CRAN R download  Base  Recommended packages  User contributed packages

8

F inding Your Way A round world of
     Machine Learning Data Mining Visualization Finding Packages
      Task Views crantastic.org Revolutions R-Bloggers Quick-R Inside-R

Revolution Confidential

Blogs

  

Getting Help Finding R People
 User Groups worldwide

Twitter : #rstats

9

Revolution Confidential

Learning R

T HE S T R UC T UR E OF R FA C IL ITAT E S L E A R NING
10

L earning R ?
Levels of R Skill Write production grade code Write an R package Write code and algorithms Use R functions Use a GUI
R programmer R contributor R developer

Revolution Confidential

R user

R aware

10 Hours of use

10,000

The Malcolm Gladwell “Outlier” Scale

11

B as ic Mac hine L earning F unc tions
Function Cluster Classifiers hclust kmeans glm rpart ksvm apriori Ensemble ada randomForest Library stats stats stats rpart kernlab arules ada randomForest Description Kmeans clustering Logistic Regression Recursive partitioning and regression trees Support Vector Machine Rule based classification Stochastic boosting

Revolution Confidential

Hierarchical cluster analysis

Random Forests classification and regression

12

Noteworthy Data Mining P ac kages
Package caret Comment Well organized and remarkably complete collection of functions to facilitate model building for regression and classification problems A very intuitive GUI for data mining that produces useful R code

Revolution Confidential

rattle

13

Revolution Confidential

Script 1 2 3 4 5 6 GETTING STARTED .R ROLL with RATTLE .R IN THE TREES . R INTRO to CARET .R BIG DATA with RevoScaleR .R WORDCLOUD .R

Doing a lot with a little R

T IME TO R UN S OME C ODE
The R Scripts are available at: https://gist.github.com/joseph-rickert/4742529

14

S ec ond P olling Ques tion
 What are your favorite data mining techniques?

Revolution Confidential

1. Clustering techniques such as K-means 2. Single model classifiers such as decision trees, or SVMs 3. Ensemble classifiers such as Random Forests or boosting models 4. Text mining techniques 5. Other
15

T hird P olling Ques tion (ins ert after running s c ript IN T HE T R E E S  What kind of data do you analyze?
1. 2. 3. 4. 5. Financial data Customer data (e.g. for recommendations) Website data (e.g. for ads) Health Care data Other

Revolution Confidential

16

Revolution Confidential

Working with B ig Data
RevoScaleR and Revolution R Enterprise

17

Too B ig for Open S ourc e R

Revolution Confidential

mortDF <- rxXdfToDataFrame(mdata,maxRowsByCols=300000000) model <- glm(default ~ .,data=mortDF,family="binomial")
18

R evoS c aleR brings the power of B ig Data to R
Parallel External Memory Algorithms that are distributed among available compute resources (cores & computers) independent of platform API for integrating external data sources (files, databases, HDFS) that provides optimized reading of rows and columns in blocks

Revolution Confidential

Distributed Statistical Algorithms

Communications Framework

Abstracted layer for providing communication between compute nodes in a cluster (MPI, MapReduce, InDatabase)

R Language Interface Data Source API

Familiar, highprodictivity programming paradigm for R users

19

R evoS c aleR P E MA s P arallel E xternal Memory A lgorithms
XDF File
Block 1

Revolution Confidential

Read blocks and compute intermediate results in parallel, iterating as necessary


Block 1 results

Block 2 Block i results Block i Block i+1 Block i+2 Block i+2 results

   

Block i

Block i+1 results

Block i+1

Results from last block
1st pass

Block i+2

R based algorithms Work on blocks of data Inherently parallel and distributed Do not require all data to be in memory at one time Can deal with distributed and streaming data

2nd pass

3rd pass

20

Revolution Confidential

More than code, R is a community

WHE R E TO G O F R OM HE R E ?

21

C ontinuing to L earn R
Resources  RevoJoe: How to Learn R  More R Documentation
 The R Journal  Books  Reference Card and more

Revolution Confidential

Examples
  Thomson Nguyen on the Heritage Health Prize Shannon Terry & Ben Ogorek (Nationwide Insurance): A Direct Marketing In-Flight Forecasting System Jeffrey Breen: Mining Twitter for Airline Consumer Sentiment Joe Rothermich: Alternative Data Sources for Measuring Market Sentiment and Events (Using R)



 Classes
 Coursera  Revolution Analytics


22

S ome B ooks

Revolution Confidential

23

Revolution Confidential

The R Scripts are available at: https://gist.github.com/joseph-rickert/4742529

24

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close