Joule
  • Welcome to Joule's Docs
  • Why Joule?
    • Joule capabilities
  • What is Joule?
    • Key features
    • The tech stack
  • Use case enablement
    • Use case building framework
  • Concepts
    • Core concepts
    • Low code development
    • Unified execution engine
    • Batch and stream processing
    • Continuous metrics
    • Key Joule data types
      • StreamEvent object
      • Contextual data
      • GeoNode
  • Tutorials
    • Getting started
    • Build your first use case
    • Stream sliding window quote analytics
    • Advanced tutorials
      • Custom missing value processor
      • Stateless Bollinger band analytics
      • IoT device control
  • FAQ
  • Glossary
  • Components
    • Pipelines
      • Use case anatomy
      • Data priming
        • Types of import
      • Processing unit
      • Group by
      • Emit computed events
      • Telemetry auditing
    • Processors
      • Common attributes
      • Filters
        • By type
        • By expression
        • Send on delta
        • Remove attributes
        • Drop all events
      • Enrichment
        • Key concepts
          • Anatomy of enrichment DSL
          • Banking example
        • Metrics
        • Dynamic contextual data
          • Caching architecture
        • Static contextual data
      • Transformation
        • Field Tokeniser
        • Obfuscation
          • Encryption
          • Masking
          • Bucketing
          • Redaction
      • Triggers
        • Change Data Capture
        • Business rules
      • Stream join
        • Inner stream joins
        • Outer stream joins
        • Join attributes & policy
      • Event tap
        • Anatomy of a Tap
        • SQL Queries
    • Analytics
      • Analytic tools
        • User defined analytics
          • Streaming analytics example
          • User defined analytics
          • User defined scripts
          • User defined functions
            • Average function library
        • Window analytics
          • Tumbling window
          • Sliding window
          • Aggregate functions
        • Analytic functions
          • Stateful
            • Exponential moving average
            • Rolling Sum
          • Stateless
            • Normalisation
              • Absolute max
              • Min max
              • Standardisation
              • Mean
              • Log
              • Z-Score
            • Scaling
              • Unit scale
              • Robust Scale
            • Statistics
              • Statistic summaries
              • Weighted moving average
              • Simple moving average
              • Count
            • General
              • Euclidean
        • Advanced analytics
          • Geospatial
            • Entity geo tracker
            • Geofence occupancy trigger
            • Geo search
            • IP address resolver
            • Reverse geocoding
            • Spatial Index
          • HyperLogLog
          • Distinct counter
      • ML inferencing
        • Feature engineering
          • Scripting
          • Scaling
          • Transform
        • Online predictive analytics
        • Model audit
        • Model management
      • Metrics engine
        • Create metrics
        • Apply metrics
        • Manage metrics
        • Priming metrics
    • Contextual data
      • Architecture
      • Configuration
      • MinIO S3
      • Apache Geode
    • Connectors
      • Sources
        • Kafka
          • Ingestion
        • RabbitMQ
          • Further RabbitMQ configurations
        • MQTT
          • Topic wildcards
          • Session management
          • Last Will and Testament
        • Rest endpoints
        • MinIO S3
        • File watcher
      • Sinks
        • Kafka
        • RabbitMQ
          • Further configurations
        • MQTT
          • Persistent messaging
          • Last Will and Testament
        • SQL databases
        • InfluxDB
        • MongoDB
        • Geode
        • WebSocket endpoint
        • MinIO S3
        • File transport
        • Slack
        • Email
      • Serialisers
        • Serialisation
          • Custom transform example
          • Formatters
        • Deserialisers
          • Custom parsing example
    • Observability
      • Enabling JMX for Joule
      • Meters
      • Metrics API
  • DEVELOPER GUIDES
    • Setting up developer environment
      • Environment setup
      • Build and deploy
      • Install Joule
        • Install Docker demo environment
        • Install with Docker
        • Install from source
        • Install Joule examples
    • Joulectl CLI
    • API Endpoints
      • Mangement API
        • Use case
        • Pipelines
        • Data connectors
        • Contextual data
      • Data access API
        • Query
        • Upload
        • WebSocket
      • SQL support
    • Builder SDK
      • Connector API
        • Sources
          • StreamEventParser API
        • Sinks
          • CustomTransformer API
      • Processor API
      • Analytics API
        • Create custom metrics
        • Define analytics
        • Windows API
        • SQL queries
      • Transformation API
        • Obfuscation API
        • FieldTokenizer API
      • File processing
      • Data types
        • StreamEvent
        • ReferenceDataObject
        • GeoNode
    • System configuration
      • System properties
  • Deployment strategies
    • Deployment Overview
    • Single Node
    • Cluster
    • GuardianDB
    • Packaging
      • Containers
      • Bare metal
  • Product updates
    • Public Roadmap
    • Release Notes
      • v1.2.0 Join Streams with stateful analytics
      • v1.1.0 Streaming analytics enhancements
      • v1.0.4 Predictive stream processing
      • v1.0.3 Contextual SQL based metrics
    • Change history
Powered by GitBook
On this page
  • Objective
  • Max scaler
  • Attributes
  • Example
  • Min-Max Scaler
  • Attributes schema
  • Example
  • Robust scaler
  • Attributes schema
  • Example
  • Standard scaler
  • Attributes schema
  • Example

Was this helpful?

  1. Components
  2. Analytics
  3. ML inferencing
  4. Feature engineering

Scaling

Normalise data with various scaling methods

PreviousScriptingNextTransform

Last updated 6 months ago

Was this helpful?

Objective

Feature scaling normalises data to improve model accuracy and comparability across features.

Key methods include:

  1. Scales features between -1 and 1 using absolute max, sensitive to outliers.

  2. Scales features to a specified range, e.g., [0,1], but sensitive to outliers.

  3. Uses median and interquartile range, resistant to outliers.

  4. Centers data around 0 with standard deviation of 1, ideal for normally distributed data.

These methods offer flexibility by allowing custom variables for optimal data scaling.

Max scaler

The Max Scaler sets the data between -1 and 1. It scales data according to the absolute maximum, this scaler is not suitable for outliers.

It needs data pre-processing, such as handling outliers.

Attributes

Attribute
Description
Type
Required

absolute max

The column absolute max of the feature

Double

Example

feature engineering:
  ...
  features:
    compute:
      scaled_price:
        function:
          max scaler:
            source field: price
            variables:
              absolute max: 12.78

Min-Max Scaler

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, i.e. between zero and one. This scaler shrinks the data within the range of -1 to 1, if there are negative values.

We can set the range [0,1] or [0,5] or [-1,1]. This Scaler responds well if the standard deviation is small and the distribution is not Gaussian.

This scaler is sensitive to outliers.

Attributes schema

Attribute
Description
Type
Required

min

The column min of the feature

Double array

max

The column max of the feature

Double array

interval

Double array

Example

feature engineering:
  ...
  features:
    compute:
      scaled_price:
        function:
          minmax scaler:
            source field: price
            variables:
              min: 10.00
              max: 12.78

Robust scaler

The robust scaler is a median-based scaling method.

The formula of robust scaler is (Xi-Xmedian) Xiqr. This scalar is not affected by outliers.

Since it uses the interquartile range, it absorbs the effects of outliers while scaling. The interquartile range (Q3 — Q1) has half the data point. If we have outliers that might affect the results or statistics and do not want to remove them, robust scaler is the best choice.

Attributes schema

Attribute
Description
Type
Required

median

The column min of the feature

Double

q1

The column Q1 interquartile range of the feature

Double

q3

The column Q3 interquartile range of the feature

Double

iqr

The calculated interquartile range difference of q3 and q1

Double

Example

feature engineering:
  ...
  features:
    compute:
      scaled_price:
        function:
          robust scaler:
            source field: price
            variables:
              median: 8.78
              q3: 11.78 
              q1: 7.67

Standard scaler

The standard scaler assumes data is normally distributed within each feature and scales them such that the distribution centered around 0, with a standard deviation of 1.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.

If data is not normally distributed, this is not the best scaler to use.

Attributes schema

Attribute
Description
Type
Required

population mean

The column population mean of the feature

Double

population std

The column population standard deviation of the feature

Double

Example

feature engineering:
  ...
  features:
    compute:
      scaled_price:
        function:
          standard scaler:
            source field: price
            variables:
              population mean: 11.15
              population std: 1.48
Max scaler
Min-Max scaler
Robust scaler
Standard scaler