Batch and stream processing

Unified platform for batch and stream processing

What we will learn on this article?

This article introduces the foundational concepts of data processing within the Joule platform, highlighting its seamless integration of batch and stream processing into a unified system.

By the end, we will gain a clear understanding of Joule's processing capabilities and how they connect to its broader functionality.

Joule treats stream and batch processing as the same

Processing types

Batch processing

Batch processing requires the data to be bounded in segments. It involves handling data in chunks or sets. It is ideal for scenarios like periodic reporting, where large datasets are processed at once.

Joule adopts a micro-batching method, treating batch data as a stream of events internally. This unified approach allows batch jobs to operate seamlessly alongside real-time streams.

To ensure efficient large-scale data handling, Joule leverages Apache Arrow for memory optimization and fast file handling. With unified execution, batch jobs can also include real-time triggers, enabling dynamic, mixed-mode operations.

Stream processing

Stream processing operates on unbounded datasets. It handles data in real-time, event by event.

Joule’s core strength lies in its ability to process data streams dynamically, enabling tasks like transformations, enrichment and predictive analytics with low latency.

Offering near-real-time analytics as data continuously arrives. Joule relies on DuckDB for high-performance internal data storage, ensuring efficient streaming analytics.

This approach allows data to flow through pipelines without delays caused by waiting for complete datasets, making it ideal for low-latency use cases like predictive analytics or real-time dashboards.

How is this applied in Joule?

Data ingestion

Joule connects seamlessly to a variety of event sources, enabling continuous or periodic data ingestion.

These sources include:

  • Kafka

  • RabbitMQ for streaming data

  • Minio S3 for cloud-based storage

  • File Watcher for monitoring file changes

  • and lightweight systems like REST APIs

  • MQTT

This flexibility ensures that Joule can integrate with diverse systems to gather the necessary input for executing pipelines.

Stream processors

At the heart of Joule’s functionality are its stream processors, which perform distinct tasks such as:

  1. Data enrichment

  2. Transformations

  3. Real-time predictions

  4. Event window analytics

These processors can be chained together into modular pipelines, allowing businesses to design workflows tailored to specific needs.

For example, processors can normalise incoming data, aggregate trends over time, or generate predictive insights, enabling the creation of scalable and flexible event-driven use cases.

Data delivery

Once data is processed, Joule integrates with downstream systems using its flexible data sinks.

These include SQL databases for structured storage, InfluxDB for time-series analytics, Kafka for redistributing processed streams, WebSocket systems for real-time dashboards and file outputs for exporting data in custom formats.

This ensures that the processed data is delivered to the right systems to provide maximum business value.

Unified processing

Joule’s unified engine enables seamless integration of batch and stream processing within a single platform. Mixed-mode pipelines allow businesses to process historical data and live streams simultaneously.

For instance, a batch job could generate periodic reports from historical datasets while triggering real-time alerts based on live data. This combination enhances operational flexibility, making Joule suitable for a wide range of applications, from real-time analytics to long-term trend reporting.

Extensibility

Joule’s Processor SDK allows developers to build custom processors, extending its capabilities to meet unique business requirements.


By combining advanced processing techniques, extensibility and unified execution, Joule offers a comprehensive solution for managing complex data workflows, empowering businesses to handle everything from real-time event streams to large-scale batch processing with ease.

Last updated