Unified execution engine
Unified engine for real-time and batch data processing
What will we learn in this article?
This article explores the unified execution engine in data processing, covering how it addresses challenges in real-time and batch data processing within a single, integrated execution engine that is Joule.
We will gain insight into Joule’s architecture and how it manages continuous and periodic data without distinction between batch and streaming.
Joule solves how a unified approach overcomes issues with traditional data processing frameworks by eliminating the requirement for data to be complete at time of ingestion by introducing components like stream joins, filters, enrichments, windows and transformation.
What is a unified execution engine?
A unified execution engine processes both unbounded (continuous) and bounded (finite) data without requiring developers to differentiate between them. Because we will not cover what unbounded and bounded data is, this article gives an overview of the concepts.
Ideally we would like to have insights as soon as the initial event has occurred. However, this is entirely dependent upon the frequency the data is presented to the processing engine and the actual use case needs.
Low latency event feeds generally produce a faster time to insight whereas bounded data deliver snapshots that reflect point in time view. Therefore when processing both types of data we need to balance the expectation of an ideal state versus the actual requirements of reality.
In an ideal world
Processing = event creation
Processing would happen instantly as events are created.
In reality
Processing ≠ event creation
Event processing must wait for new events to enter the pipeline before it can start generating actions to act upon.
Joule does not differentiate between bounded and unbounded data
Adapting to continuous data with a flexible unified model
Unlike traditional data processing frameworks, which assume data will eventually become complete; a unified model operates on the assumption that new data may always arrive.
This approach enables flexibility by not tying data infrastructure to specific execution engines and by providing consistency across both unbounded and bounded datasets.
Joule's processors would then allow developers to specify when to emit the output results for a given period of time, enabling responsive processing even in continuous workflows.
i.e., with Joule, you can spin up multiple processors, each with its own scheduler, to run independently within the same environment. Unlike traditional setups where a single scheduler manages all tasks, Joule enables separate, decoupled processing for different use cases.
Understanding data processing with Joule
Joule does not differentiate between batch or stream processing
Batch processing
Batch processing in a unified engine handles large, finite datasets processed periodically. Joule manages batch data by applying a micro batching method which appears as a stream of events internally.
Joule uses the latest Apache Arrow processing techniques to enable large file handling while managing memory efficiently.
With unified execution, batch jobs can also include real-time triggers, allowing them to operate seamlessly alongside streaming data.
Stream processing
Stream processing operates on unbounded datasets, offering near-real-time analytics as new data continuously arrives. Joule uses DuckDB as an internal data storage, enabling high-performance analytics.
This approach allows streaming data to flow without waiting for all data to arrive, overcoming delays caused by continuous event inflow.
How does this work in Joule?
Joule treats batch and streaming data consistently. Allowing seamless processing across different data sources and formats.
Joule’s architecture supports modular pipelines, real-time observability and extensibility. This makes Joule adaptable to diverse data processing demands.
Unified engine architecture
Last updated