Spark Summit 2013 Spark Streaming Real Time big data processing

Published on May 2016 | Categories: Types, Presentations | Downloads: 30 | Comments: 0 | Views: 199

of 31

Spark Streaming:Scales to hundreds of nodesAchieves second-scale latenciesEfficiently recover from failuresIntegrates with batch and interactive processing

Content

Spark Streaming
Real-time big-data processing

Tathagata Das (TD)

UC BERKELEY
What is Spark Streaming?
 Extends Spark for doing big data stream
processing
 Project started in early 2012, alpha released in
Spring 2013 with Spark 0.7
 Moving out of alpha in Spark 0.9
Spark
Spark
Streaming
GraphX
…
Shark
MLlib
BlinkDB
Why Spark Streaming?
Many big-data applications need to process large
data streams in realtime

Website monitoring
Fraud detection
Ad monetization
Why Spark Streaming?
Need a framework for big data
stream processing that

Website monitoring
Fraud detection
Ad monetization
Scales to hundreds of nodes
Achieves second-scale latencies
Efficiently recover from failures
Integrates with batch and interactive processing

Integration with Batch
Processing
 Many environments require processing same
data in live streaming as well as batch post-
processing

 Existing frameworks cannot do both
- Either, stream processing of 100s of MB/s with low
latency
- Or, batch processing of TBs of data with high latency

 Extremely painful to maintain two different stacks
- Different programming models
- Double implementation effort
Stateful Stream Processing
 Traditional model

 Mutable state is lost if node fails

 Making stateful stream processing
fault tolerant is challenging!

– Processing pipeline of nodes
– Each node maintains mutable state
– Each input record updates the state
and new records are sent out

mutable state
node 1
node 3
input
records
node 2
input
records
Existing Streaming Systems

 Storm
- Replays record if not processed by a node
- Processes each record at least once
- May update mutable state twice!
- Mutable state can be lost due to failure!

 Trident – Use transactions to update state
- Processes each record exactly once
- Per-state transaction to external database is slow
7
Spark Streaming
8
Spark Streaming
Run a streaming computation as a series of very
small, deterministic batch jobs
9
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
 Chop up the live stream into
batches of X seconds
 Spark treats each batch of data
as RDDs and processes them
using RDD operations
 Finally, the processed results of
the RDD operations are
returned in batches
Spark Streaming
Run a streaming computation as a series of very
small, deterministic batch jobs
10
 Batch sizes as low as ½
second, latency of about 1
second
 Potential for combining batch
processing and streaming
processing in the same system
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream()

DStream: a sequence of RDDs representing a stream of data
batch @ t+1 batch @ t batch @ t+2
tweets DStream
stored in memory as an RDD
(immutable, distributed)
Twitter Streaming API
Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))

flatMap flatMap flatMap
…

transformation: modify data in one DStream to create
another DStream
new DStream
new RDDs created
for every batch
batch @ t+1 batch @ t batch @ t+2
tweets DStream
hashTags Dstream
[#cat, #dog, … ]
Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

output operation: to push data to external storage
flatMap flatMap flatMap
save save save
batch @ t+1 batch @ t batch @ t+2
tweets DStream
hashTags DStream
every batch
saved to HDFS
Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.foreach(hashTagRDD => { ... })

foreach: do whatever you want with the processed data
flatMap flatMap flatMap
foreach foreach foreach
batch @ t+1 batch @ t batch @ t+2
tweets DStream
hashTags DStream
Write to a database, update analytics
UI, do whatever you want
Demo
Java Example

Scala

val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

Java

JavaDStream<Status> tweets = ssc.twitterStream()
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
hashTags.saveAsHadoopFiles("hdfs://...")
Function object
DStream of data
Window-based
Transformations
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()

sliding window
operation
window length sliding interval
window length
sliding interval
Arbitrary Stateful
Computations

Specify function to generate new state based on
previous state and new data
- Example: Maintain per-user mood as state, and update
it with their tweets

def updateMood(newTweets, lastMood) => newMood

moods = tweetsByUser.updateStateByKey(updateMood _)

Arbitrary Combinations of
Batch and Streaming
Computations

Inter-mix RDD and DStream operations!
- Example: Join incoming tweets with a spam HDFS file
to filter out bad tweets

tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})

DStreams + RDDs = Power
 Online machine learning
- Continuously learn and update data models
(updateStateByKey and transform)

 Combine live data streams with historical data
- Generate historical data models with Spark, etc.
- Use data models to process live data stream (transform)

 CEP-style processing
- window-based operations (reduceByWindow, etc.)

Input Sources
 Out of the box, we provide
- Kafka, HDFS, Flume, Akka Actors, Raw TCP sockets,
etc.

 Very easy to write a receiver for your own data
source

 Also, generate your own RDDs from Spark, etc.
and push them in as a “stream”

Fault-tolerance
 Batches of input data are
replicated in memory for
fault-tolerance

 Data lost due to worker
failure, can be
recomputed from
replicated input data
input data
replicated
in memory
flatMap
lost partitions
recomputed on
other workers
tweets
RDD
hashTags
RDD
 All transformations are fault-
tolerant, and exactly-once
transformations
Performance
Can process 60M records/sec (6 GB/sec) on
100 nodes at sub-second latency
0
0.5
1
1.5
2
2.5
3
3.5
0 50 100
C
l
u
s
t
e
r

T
h
r
o
u
g
h
p
u
t

(
G
B
/
s
)

# Nodes in Cluster
WordCount
1 sec
2 sec
0
1
2
3
4
5
6
7
0 50 100
C
l
u
s
t
e
r

T
h
h
r
o
u
g
h
p
u
t

(
G
B
/
s
)

# Nodes in Cluster
Grep
1 sec
2 sec
Comparison with other
systems
Higher throughput than Storm
- Spark Streaming: 670k records/sec/node
- Storm: 115k records/sec/node
- Commercial systems: 100-500k records/sec/node
0
10
20
30
100 1000
T
h
r
o
u
g
h
p
u
t

p
e
r

n
o
d
e

(
M
B
/
s
)

Record Size (bytes)
WordCount
Spark
Storm
0
20
40
60
100 1000
T
h
r
o
u
g
h
p
u
t

p
e
r

n
o
d
e

(
M
B
/
s
)

Record Size (bytes)
Grep
Spark
Storm
Fast Fault Recovery

Recovers from faults/stragglers within 1 sec
Mobile Millennium Project
Traffic transit time estimation using online machine
learning on GPS observations
0
400
800
1200
1600
2000
0 20 40 60 80
G
P
S

o
b
s
e
r
v
a
t
i
o
n
s

p
e
r

s
e
c

# Nodes in Cluster
 Markov-chain Monte Carlo
simulations on GPS
observations
 Very CPU intensive, requires
dozens of machines for useful
computation
 Scales linearly with cluster size
Advantage of an unified stack
 Explore data
interactively to
identify problems

 Use same code in
Spark for processing
large logs

 Use similar code in
Spark Streaming for
realtime processing
$ ./spark-shell
scala> val file = sc.hadoopFile(“smallLogs”)
...
scala> val filtered = file.filter(_.contains(“ERROR”))
...
scala> val mapped = filtered.map(...)
...

object ProcessProductionData {
def main(args: Array[String]) {
val sc = new SparkContext(...)
val file = sc.hadoopFile(“productionLogs”)
val filtered = file.filter(_.contains(“ERROR”))
val mapped = filtered.map(...)
...
}
}
object ProcessLiveStream {
def main(args: Array[String]) {
val sc = new StreamingContext(...)
val stream = sc.kafkaStream(...)
val filtered = stream.filter(_.contains(“ERROR”))
val mapped = filtered.map(...)
...
}
}
Roadmap
 Spark 0.8.1
- Marked alpha, but has been quite stable
- Master fault tolerance – manual recovery
- Restart computation from a checkpoint file saved to HDFS

 Spark 0.9 in Jan 2014 – out of alpha!
- Automated master fault recovery
- Performance optimizations
- Web UI, and better monitoring capabilities

Roadmap
 Long term goals
- Python API
- MLlib for Spark Streaming
- Shark Streaming

 Community feedback is crucial!
- Helps us prioritize the goals

 Contributions are more than welcome!!

Today’s Tutorial
 Process Twitter data stream to find most popular
hashtags over a window

 Requires a Twitter account
- Need to setup Twitter OAuth keys to access tweets
- All the instructions are in the tutorial

 Your account will be safe!
- No need to enter your password anywhere, only the keys
- Destroy the keys after the tutorial is done

Conclusion

 Streaming programming guide –
spark.incubator.apache.org/docs/latest/streaming-
programming-guide.html

 Research Paper –
tinyurl.com/dstreams

Thank you!

Spark Summit 2013 Spark Streaming Real Time big data processing

Comments

Content

Sponsor Documents

Recommended