Spark Summit 2013 Spark Streaming Real Time big data processing

Published on May 2016 | Categories: Types, Presentations | Downloads: 30 | Comments: 0 | Views: 199
of 31
Download PDF   Embed   Report

Spark Streaming:Scales to hundreds of nodesAchieves second-scale latenciesEfficiently recover from failuresIntegrates with batch and interactive processing

Comments

Content


Spark Streaming
Real-time big-data processing


Tathagata Das (TD)


UC BERKELEY
What is Spark Streaming?
 Extends Spark for doing big data stream
processing
 Project started in early 2012, alpha released in
Spring 2013 with Spark 0.7
 Moving out of alpha in Spark 0.9
Spark
Spark
Streaming
GraphX

Shark
MLlib
BlinkDB
Why Spark Streaming?
Many big-data applications need to process large
data streams in realtime

Website monitoring
Fraud detection
Ad monetization
Why Spark Streaming?
Need a framework for big data
stream processing that

Website monitoring
Fraud detection
Ad monetization
Scales to hundreds of nodes
Achieves second-scale latencies
Efficiently recover from failures
Integrates with batch and interactive processing

Integration with Batch
Processing
 Many environments require processing same
data in live streaming as well as batch post-
processing

 Existing frameworks cannot do both
- Either, stream processing of 100s of MB/s with low
latency
- Or, batch processing of TBs of data with high latency

 Extremely painful to maintain two different stacks
- Different programming models
- Double implementation effort
Stateful Stream Processing
 Traditional model





 Mutable state is lost if node fails

 Making stateful stream processing
fault tolerant is challenging!


– Processing pipeline of nodes
– Each node maintains mutable state
– Each input record updates the state
and new records are sent out

mutable state
node 1
node 3
input
records
node 2
input
records
Existing Streaming Systems

 Storm
- Replays record if not processed by a node
- Processes each record at least once
- May update mutable state twice!
- Mutable state can be lost due to failure!

 Trident – Use transactions to update state
- Processes each record exactly once
- Per-state transaction to external database is slow
7
Spark Streaming
8
Spark Streaming
Run a streaming computation as a series of very
small, deterministic batch jobs
9
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
 Chop up the live stream into
batches of X seconds
 Spark treats each batch of data
as RDDs and processes them
using RDD operations
 Finally, the processed results of
the RDD operations are
returned in batches
Spark Streaming
Run a streaming computation as a series of very
small, deterministic batch jobs
10
 Batch sizes as low as ½
second, latency of about 1
second
 Potential for combining batch
processing and streaming
processing in the same system
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream()


DStream: a sequence of RDDs representing a stream of data
batch @ t+1 batch @ t batch @ t+2
tweets DStream
stored in memory as an RDD
(immutable, distributed)
Twitter Streaming API
Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))


flatMap flatMap flatMap


transformation: modify data in one DStream to create
another DStream
new DStream
new RDDs created
for every batch
batch @ t+1 batch @ t batch @ t+2
tweets DStream
hashTags Dstream
[#cat, #dog, … ]
Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")


output operation: to push data to external storage
flatMap flatMap flatMap
save save save
batch @ t+1 batch @ t batch @ t+2
tweets DStream
hashTags DStream
every batch
saved to HDFS
Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.foreach(hashTagRDD => { ... })


foreach: do whatever you want with the processed data
flatMap flatMap flatMap
foreach foreach foreach
batch @ t+1 batch @ t batch @ t+2
tweets DStream
hashTags DStream
Write to a database, update analytics
UI, do whatever you want
Demo
Java Example

Scala

val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")


Java

JavaDStream<Status> tweets = ssc.twitterStream()
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
hashTags.saveAsHadoopFiles("hdfs://...")
Function object
DStream of data
Window-based
Transformations
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()

sliding window
operation
window length sliding interval
window length
sliding interval
Arbitrary Stateful
Computations

Specify function to generate new state based on
previous state and new data
- Example: Maintain per-user mood as state, and update
it with their tweets

def updateMood(newTweets, lastMood) => newMood

moods = tweetsByUser.updateStateByKey(updateMood _)

Arbitrary Combinations of
Batch and Streaming
Computations

Inter-mix RDD and DStream operations!
- Example: Join incoming tweets with a spam HDFS file
to filter out bad tweets

tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})



DStreams + RDDs = Power
 Online machine learning
- Continuously learn and update data models
(updateStateByKey and transform)

 Combine live data streams with historical data
- Generate historical data models with Spark, etc.
- Use data models to process live data stream (transform)

 CEP-style processing
- window-based operations (reduceByWindow, etc.)




Input Sources
 Out of the box, we provide
- Kafka, HDFS, Flume, Akka Actors, Raw TCP sockets,
etc.

 Very easy to write a receiver for your own data
source

 Also, generate your own RDDs from Spark, etc.
and push them in as a “stream”


Fault-tolerance
 Batches of input data are
replicated in memory for
fault-tolerance

 Data lost due to worker
failure, can be
recomputed from
replicated input data
input data
replicated
in memory
flatMap
lost partitions
recomputed on
other workers
tweets
RDD
hashTags
RDD
 All transformations are fault-
tolerant, and exactly-once
transformations
Performance
Can process 60M records/sec (6 GB/sec) on
100 nodes at sub-second latency
0
0.5
1
1.5
2
2.5
3
3.5
0 50 100
C
l
u
s
t
e
r

T
h
r
o
u
g
h
p
u
t

(
G
B
/
s
)

# Nodes in Cluster
WordCount
1 sec
2 sec
0
1
2
3
4
5
6
7
0 50 100
C
l
u
s
t
e
r

T
h
h
r
o
u
g
h
p
u
t

(
G
B
/
s
)

# Nodes in Cluster
Grep
1 sec
2 sec
Comparison with other
systems
Higher throughput than Storm
- Spark Streaming: 670k records/sec/node
- Storm: 115k records/sec/node
- Commercial systems: 100-500k records/sec/node
0
10
20
30
100 1000
T
h
r
o
u
g
h
p
u
t

p
e
r

n
o
d
e

(
M
B
/
s
)

Record Size (bytes)
WordCount
Spark
Storm
0
20
40
60
100 1000
T
h
r
o
u
g
h
p
u
t

p
e
r

n
o
d
e

(
M
B
/
s
)

Record Size (bytes)
Grep
Spark
Storm
Fast Fault Recovery

Recovers from faults/stragglers within 1 sec
Mobile Millennium Project
Traffic transit time estimation using online machine
learning on GPS observations
0
400
800
1200
1600
2000
0 20 40 60 80
G
P
S

o
b
s
e
r
v
a
t
i
o
n
s

p
e
r

s
e
c

# Nodes in Cluster
 Markov-chain Monte Carlo
simulations on GPS
observations
 Very CPU intensive, requires
dozens of machines for useful
computation
 Scales linearly with cluster size
Advantage of an unified stack
 Explore data
interactively to
identify problems

 Use same code in
Spark for processing
large logs

 Use similar code in
Spark Streaming for
realtime processing
$ ./spark-shell
scala> val file = sc.hadoopFile(“smallLogs”)
...
scala> val filtered = file.filter(_.contains(“ERROR”))
...
scala> val mapped = filtered.map(...)
...

object ProcessProductionData {
def main(args: Array[String]) {
val sc = new SparkContext(...)
val file = sc.hadoopFile(“productionLogs”)
val filtered = file.filter(_.contains(“ERROR”))
val mapped = filtered.map(...)
...
}
}
object ProcessLiveStream {
def main(args: Array[String]) {
val sc = new StreamingContext(...)
val stream = sc.kafkaStream(...)
val filtered = stream.filter(_.contains(“ERROR”))
val mapped = filtered.map(...)
...
}
}
Roadmap
 Spark 0.8.1
- Marked alpha, but has been quite stable
- Master fault tolerance – manual recovery
- Restart computation from a checkpoint file saved to HDFS

 Spark 0.9 in Jan 2014 – out of alpha!
- Automated master fault recovery
- Performance optimizations
- Web UI, and better monitoring capabilities


Roadmap
 Long term goals
- Python API
- MLlib for Spark Streaming
- Shark Streaming

 Community feedback is crucial!
- Helps us prioritize the goals

 Contributions are more than welcome!!


Today’s Tutorial
 Process Twitter data stream to find most popular
hashtags over a window

 Requires a Twitter account
- Need to setup Twitter OAuth keys to access tweets
- All the instructions are in the tutorial

 Your account will be safe!
- No need to enter your password anywhere, only the keys
- Destroy the keys after the tutorial is done



Conclusion

 Streaming programming guide –
spark.incubator.apache.org/docs/latest/streaming-
programming-guide.html

 Research Paper –
tinyurl.com/dstreams




Thank you!

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close