Understanding Apache Flink
By Ivan Mushketyk
Course: https://app.pluralsight.com/library/courses/understanding-apache-flink/table-of-contents
Apache Flink is "Framework for stream and batch processing"
- Has Java, Scala, and Python APIs
- Features: Batch is a special case, Flexible windowing, Iteration support, Memeory management, High-throughput low latency, Exactly-once processing, Query optimization
Batch Processing
- Operations
- map
- filter
- flatMap
- groupBy
- Operation for multiple dataset
- join
- outerJoin
- cross
- Write Data
- writeAsText
- writeAsCsv
- print / printToErr
- write
- output
- collect
- A simple processing flow
- readCsvFile()
- map()
- filter()
- writeAsText()
Stream processing
- Get StreamEcecutionEnvironment
- Operations
- readTextFile
- fromCollection / fromElements
- socketTextStream
- addSource
- map
- filter
- flatMap
- union
- split
- connect
- writeAsText / writeAsCsv
- print / printToErr
- addSink
- A simple processing flow
- addSource()
- map()
- filter()
- print()
- Windows
- Non-keyed: non-parallel processing of a single stream
- Keyed windows: "groups" stream elements to processl in parallel
- Window Assigner (how to split data into windows)
- Tumbling (time)
- Tumbling (count)
- Sliding
- Session
- Global
- Operations in Windows
- reduce/fold
- min/max/sum
- apply
- Late Events – Events that arrive after the a window is processed. By default, these events are dropped. Can define how long to wait for events.
- State in Stream Processing
- State types
- Keyed
- ValueState
- ListState
- ReduceSate
- FoldState
- Operator
- Keyed
- State types
Other topics to explore
- Table API
- Graph processing (gelly)
- Machine learning library (FlinkML)
- Complex event processing (CEP)
- Checkpoints and savepoints
- Apache Kafka integration