A Flink streaming application consists of four key programming constructs.
1. Stream execution environment - Every Flink streaming application requires an environment in which the streaming program is executed.
Flink framework provides the class org.apache.flink.streaming.api.environment.StreamExecutionEnvironment as the environment context in which the streaming program is executed. Flink framework provides sub-classes such as LocalStreamEnvironment - which executes the program in a local JVM, and RemoteStreamEnvironment which executes the streaming program on a remote environment.
Usually you just have to call the getExecutionEnvironment() method on the StreamExecutionEnvironment which returns the appropriate environment based on the context.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
2. Data sources Data sources are applications or data stores from which the Flink application will read the input data from.
In a Flink application you attach a data source to the stream execution environment by using the addSource() method - env.addSource(sourceFunction)
Flink framework comes with support for several in-built data sources that are...
file-based - env.readFile(...)
socket-based - env.sockettextStream(...)
collection-based - env.fromCollection(...)
You can also attach a custom source, such as a source for Apache Kafka - env.addSource(new FlinkKafkaConsumer<>(...))
3. Data streams and transformation operations - Input data from data sources are read by the Flink application in the form of data streams. Flink framework provides the org.apache.flink.streaming.api.datastream.DataStream class, on which operations can be applied to transform the data.
Flink framework provides many transformation functions that can be applied on the data stream such as ...
mapping functions - dataStream.map(...)
filtering functions - dataStream.filter(...)
partitioning functions - dataStream.keyBy(...)
reducing functions - keyedStream.reduce(...)
and more ...
4. Data sinks - The transformed data from data streams are consumed by data sinks which output the data to external systems. Flink framework comes with support for a variety of built-in output formats such as ...
file-based - dataStream.writeAsText()
socket-based - dataStream.writeToScoket()
and more ...
You can also attach a custom sink, such as a sink for Apache Kafka - dataSource.addSink(...)
What are the main architectural components of Flink?
JobManager and TaskManager are the main components of Flink.
JobManager - Flink client triggers the execution of a Flink application by preparing a dataflow and sending it to the JobManager. The JobManager is responsible for coordinating the distributed execution of Flink applications like deciding when to schedule tasks, manage finished tasks, manage and coordinate recovery of task failures, coordinate checkpoints etc.
JobManager contains three different components.
ResourceManager
Dispatcher
JobMaster
TaskManager - TaskManager is responsible for executing the tasks of the dataflow. There can be one or more TaskManagers. JobManagers connect to TaskManagers and assign them tasks to execute.
A TaskManager contains one or more task slots, with each task slot executing a task (consisting of one or more sub-tasks) in a separate thread.