# Scaling

## Objective

Feature scaling normalises data to **improve model accuracy and comparability** across features.

Key methods include:

1. [<mark style="color:green;">**Max scaler**</mark>](#max-scaler)\
   Scales features between -1 and 1 using absolute max, sensitive to outliers.
2. [<mark style="color:green;">**Min-Max scaler**</mark>](#min-max-scaler)\
   Scales features to a specified range, e.g., \[0,1], but sensitive to outliers.
3. [<mark style="color:green;">**Robust scaler**</mark>](#robust-scaler)\
   Uses median and interquartile range, resistant to outliers.
4. [<mark style="color:green;">**Standard scaler**</mark>](#standard-scaler)\
   Centers data around 0 with standard deviation of 1, ideal for normally distributed data.

These methods offer flexibility by allowing custom variables for optimal data scaling.

## Max scaler

The Max Scaler sets the data between -1 and 1. It scales data according to the absolute maximum, this scaler is not suitable for outliers.

It needs data pre-processing, such as handling outliers.

### Attributes

<table><thead><tr><th width="176">Attribute</th><th width="301">Description</th><th width="130">Type</th><th data-type="checkbox">Required</th></tr></thead><tbody><tr><td>absolute max</td><td>The column absolute max of the feature</td><td>Double</td><td>true</td></tr></tbody></table>

### Example

```yaml
feature engineering:
  ...
  features:
    compute:
      scaled_price:
        function:
          max scaler:
            source field: price
            variables:
              absolute max: 12.78
```

## Min-Max Scaler

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, i.e. between zero and one. This scaler shrinks the data within the range of -1 to 1, if there are negative values.

We can set the range \[0,1] or \[0,5] or \[-1,1]. This Scaler responds well if the standard deviation is small and the distribution is not Gaussian.

This scaler is sensitive to outliers.

### Attributes schema

<table><thead><tr><th>Attribute</th><th width="334">Description</th><th width="136">Type</th><th data-type="checkbox">Required</th></tr></thead><tbody><tr><td>min</td><td>The column min of the feature</td><td>Double array</td><td>true</td></tr><tr><td>max</td><td>The column max of the feature</td><td>Double array</td><td>true</td></tr><tr><td>interval</td><td></td><td>Double array</td><td>true</td></tr></tbody></table>

### Example

```yaml
feature engineering:
  ...
  features:
    compute:
      scaled_price:
        function:
          minmax scaler:
            source field: price
            variables:
              min: 10.00
              max: 12.78
```

## Robust scaler

The robust scaler is a median-based scaling method.

The formula of `robust scaler` is (Xi-Xmedian) Xiqr. This scalar is not affected by outliers.

Since it uses the interquartile range, it absorbs the effects of outliers while scaling. The interquartile range (Q3 — Q1) has half the data point. If we have outliers that might affect the results or statistics and do not want to remove them, `robust scaler` is the best choice.

### Attributes schema

<table><thead><tr><th>Attribute</th><th width="338">Description</th><th width="152">Type</th><th data-type="checkbox">Required</th></tr></thead><tbody><tr><td>median</td><td>The column min of the feature</td><td>Double</td><td>true</td></tr><tr><td>q1</td><td>The column Q1 interquartile range of the feature</td><td>Double</td><td>true</td></tr><tr><td>q3</td><td>The column Q3 interquartile range of the feature</td><td>Double</td><td>true</td></tr><tr><td>iqr</td><td>The calculated interquartile range difference of q3 and q1</td><td>Double</td><td>false</td></tr></tbody></table>

### Example

```yaml
feature engineering:
  ...
  features:
    compute:
      scaled_price:
        function:
          robust scaler:
            source field: price
            variables:
              median: 8.78
              q3: 11.78 
              q1: 7.67
```

## Standard scaler

The standard scaler assumes data is normally distributed within each feature and scales them such that the distribution centered around 0, with a standard deviation of 1.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.

{% hint style="warning" %}
If data is not normally distributed, this is not the best scaler to use.
{% endhint %}

### Attributes schema

<table><thead><tr><th>Attribute</th><th width="329">Description</th><th width="150">Type</th><th data-type="checkbox">Required</th></tr></thead><tbody><tr><td>population mean</td><td>The column population mean of the feature</td><td>Double</td><td>true</td></tr><tr><td>population std</td><td>The column population standard deviation of the feature</td><td>Double</td><td>true</td></tr></tbody></table>

### Example

```yaml
feature engineering:
  ...
  features:
    compute:
      scaled_price:
        function:
          standard scaler:
            source field: price
            variables:
              population mean: 11.15
              population std: 1.48
```
