About Reference Data

Reference data is often a key requirement for stream processing use cases to enable processing context. Its applications include event enrichment, transformation and advanced analytics plus more


Overview

Reference data is a critical element within stream processing platforms as it provides context to live events, enabling the driving of complex event processing logic. This type of data can be utilised to enrich, transform, and validate attributes through pre-defined patterns, predictive analytics (e.g., machine learning models), dynamic feature engineering, or compute real-time KPIs.

An an example, to gain insights on localized network performance with respect to connected mobile devices, a telco (telecommunications company) can follow a specific approach. By grouping all mobile phone telemetry and enriching events using mobile manufacture, model, and mapping connected cell towers to postal areas, the telco can effectively analyze network performance in different regions1. This analysis can help the telco identify areas where network performance is strong or weak, allowing them to take appropriate actions to improve the quality of service for their customers.

This same process is applicable to real-time prediction due to the underlying machine learning model being considered as a static data structure. Therefore, reference data is considered slow-moving compared to its fast moving event cousin. This means we treat reference data differently by providing localised, possibly cached, data stores within the processing context. Joule provides processors the required implementation interfaces to access and apply this data within a localised stream processing context.

Example of reference data

There are many forms of static data assets which can defined by ISO or industry standards, as pre-computed models and variables, or organisational key data elements. Some examples are:

  • Postal codes

  • Mobile manufacture models

  • ISO-366 country codes

  • Capital market exchange codes

  • Currency codes

  • Machine learning models

  • Pre-computed static variables

  • Charge codes

  • Car 17-digit VIN

  • Regex patterns i.e telephone number patterns

Architecture

The data source interface is accessible within every processor. On startup the Joule runtime connects to each data source, binds the source to a logical name and adds this to the available data sources. A single configuration file defines the required reference data stores, see Configuration section.

When dealing with high event throughputs, it is crucial to carefully consider the data source implementation. Neglecting this aspect can adversely affect the performance of the processing pipeline. To optimise performance and reduce out-of-process I/O calls, it is recommended to cache reference data within the process, especially for high read scenarios.

Configuration

The reference data requirements are defined using a single configuration file, which allows for easy management and customisation. This configuration file specifies one or more reference data sources that are required for the system. Currently, the system supports Apache Geode as a cached reference data source due to its low latency characteristics and MinIO for S3 objects.

Geode example

The below example binds in to the platform a distributed caching solution, Apache Geode, that provides external data. Read the Geode connector documentation for a detailed explanation on how to use this powerful feature.

reference data:
  name: banking market data 
  data sources:
    - geode stores:
        name: us markets
        connection:
          locator address: 192.168.86.39
          locator port: 41111
        stores:
          nasdaqIndexCompanies:
            region: nasdaq-companies
            keyClass : java.lang.String
            gii: true
          holidays:
            region: us-holidays
            keyClass : java.lang.Integer

Application

To add reference data data structures to an event the enricher processor provides the declarative logic to lookup and attach the reference data object to the StreamEvent. Read the reference data enricher documentation for further details on how this is performed.

enricher:
  fields:
    companyInformation:
      key: symbol
      using: nasdaqIndex
      
    stores:
       nasdaqIndex:
          store name: nasdaqIndexCompanies

Core Attributes

AttributeDescriptionData TypeRequired

name

Reference data store namespace.

String

data sources

List of data sources to connect and bind in to the Joule processor

List of connector configurations

Last updated