Comment on page
About Reference Data
Reference data is often a key requirement for stream processing use cases to enable processing context. Its applications include event enrichment, transformation and advanced analytics plus more
Reference data is a critical element within stream processing platforms as it provides context to live events, enabling the driving of complex event processing logic. This type of data can be utilised to enrich, transform, and validate attributes through pre-defined patterns, predictive analytics (e.g., machine learning models), dynamic feature engineering, or compute real-time KPIs.
An an example, to gain insights on localized network performance with respect to connected mobile devices, a telco (telecommunications company) can follow a specific approach. By grouping all mobile phone telemetry and enriching events using mobile manufacture, model, and mapping connected cell towers to postal areas, the telco can effectively analyze network performance in different regions1. This analysis can help the telco identify areas where network performance is strong or weak, allowing them to take appropriate actions to improve the quality of service for their customers.
This same process is applicable to real-time prediction due to the underlying machine learning model being considered as a static data structure. Therefore, reference data is considered slow-moving compared to its fast moving event cousin. This means we treat reference data differently by providing localised, possibly cached, data stores within the processing context. Joule provides processors the required implementation interfaces to access and apply this data within a localised stream processing context.
There are many forms of static data assets which can defined by ISO or industry standards, as pre-computed models and variables, or organisational key data elements. Some examples are:
- Postal codes
- Mobile manufacture models
- ISO-366 country codes
- Capital market exchange codes
- Currency codes
- Machine learning models
- Pre-computed static variables
- Charge codes
- Car 17-digit VIN
- Regex patterns i.e telephone number patterns
The data source interface is accessible within every processor. On startup the Joule runtime connects to each data source, binds the source to a logical name and adds this to the available data sources. A single configuration file defines the required reference data stores, see Configuration section.
Streaming Prediction example
When dealing with high event throughputs, it is crucial to carefully consider the data source implementation. Neglecting this aspect can adversely affect the performance of the processing pipeline. To optimise performance and reduce out-of-process I/O calls, it is recommended to cache reference data within the process, especially for high read scenarios.
The reference data requirements are defined using a single configuration file, which allows for easy management and customisation. This configuration file specifies one or more reference data sources that are required for the system. Currently, the system supports Apache Geode as a cached reference data source due to its low latency characteristics and MinIO for S3 objects.
name: banking market data
- geode stores:
name: us markets
locator address: 192.168.86.39
locator port: 41111
keyClass : java.lang.String
keyClass : java.lang.Integer
To add reference data data structures to an event the enricher processor provides the declarative logic to lookup and attach the reference data object to the StreamEvent. Read the reference data enricher documentation for further details on how this is performed.