Outer stream joins

Immediately pass the first event received and initialised downstream processors

Objective

Outer joins are different from inner joins by immediately passing the first event received on the outer stream into the processing pipeline, which can be useful for initialising downstream processors.

Any additional events that meet the join criteria are also emitted as they arrive.

Outer stream joins are extremely useful to tackle the cold start problem in stream recommendation engines.

Uses

It is extremely useful for ideal to prime processing.

Priming can improve performance by reducing the latency to get initial results, which are then refined or updated as more data becomes available.

What is priming?

In stream processing, priming refers to initialising or setting up a process to start working as soon as possible with partial or incomplete data, often before all necessary data has arrived.

For example, with an outer join in stream processing, as soon as an event arrives on the outer stream, it’s immediately processed and sent downstream. This priming action allows downstream processors to begin working with the initial data, while waiting for more events to join and complete the full picture.

It’s especially useful for processes that depend on getting an initial seed of data to start calculations, aggregations, or other operations quickly.

Example

This code defines a join between the sitevisits and webpage.adclicks streams based on customerId.

  1. expression It matches all customerId values from sitevisits to webpage.adclicks.

  2. merge events Matching events are merged.

  3. left policy sitevisits events expire after 30 minutes.

  4. right policy Matched webpage.adclicks events are deleted after the join.

New sitevisits events trigger with a new customerId initial processing and subsequent matching webpage.adclicks events continue until expiration, with cleanup after joining.

Outer joins are set by *=

streams join:
  expression: "sitevisits.customerId *= webpage.adclicks.customerId"
  merge events: true
  left policy:
    time to live: 30 minutes
  right policy:
    delete on join: true

Last updated