# File processing

## Overview

Under the hood Joule uses [Apache Arrow](https://arrow.apache.org/) to read files and thereby enable efficient large file handling and OOTB standard file format support.  The classes that perform this work have been surfaced to developers in the form of a `Callable` task.&#x20;

Two key classes are provided:

1. [FileProcessingTask](#fileprocessingtask)
2. [ReferenceDataFileProcessingTask](#referencedatafileprocessingtask)

### Supported file formats supported

* PARQUET
* ORC
* CSV
* JSON
* ARROW\_IPC

{% hint style="success" %}
The provided classes can be found under the SDK package

```
com.fractalworks.streams.sdk.util.file
```

{% endhint %}

## FileProcessingTask

This processing task class reads a file contents and automatically converts each file logical row in to StreamEvent object. This is performed using micro-batch processing which reduces memory and processing overhead while driving stream processing throughput.&#x20;

### Example

The below example loads&#x20;

```java

FileProcessStatus comsumeFile(String eventType, String filename, String absoluteFilePath, FileFormat fileFormat, AtomicLong counter) throws Exception {
    var listener = new TransportListener() {
        @Override
        public void onEvent(Collection<StreamEvent> events) {
            counter.addAndGet(events.size());
        }
    };

    File fileuri = new File(absoluteFilePath);
    FileProcessingTask task = new FileProcessingTask(eventType, filename, fileuri.getAbsolutePath(), fileFormat, listener);
    FileProcessStatus status = task.call();

    await()
            .pollInterval(100, TimeUnit.MILLISECONDS)
            .until(checkForEvents(counter));
    return status;
}

// Simple event handler to check for number of events received
private Callable<Boolean> checkForEvents(AtomicLong eventsSeen) {
    return () -> (eventsSeen.get() == NUM_EVENTS);
}
```

## ReferenceDataFileProcessingTask &#x20;

This processing task class reads a reference data file contents and automatically converts each file logical row in to [ReferenceData](/joule/developer-guides/builder-sdk/data-types/referencedataobject.md) object. This is performed using micro-batch processing to reduce memory footprint and processing overhead and therefore able to read large files in to memory.

ReferenceData objects are stored within a in-memory data store to reduce the retrieval latency and I/O overhead.

### Example

This example can be found within the [fractalworks-geospatial-processor](https://gitlab.com/joule-platform/fractalworks-stream-processors/-/tree/master/fractalworks-geospatial-processor) project test `CellTowerCSVParserTest` class.

```java
// Create a in-memory store that implements the Store interface
CellTowerStore cellTowerStore = new CellTowerStore(250, 250);
cellTowerStore.setMaxElementsPerLevel(10000);
cellTowerStore.setTreeLevels(5);
cellTowerStore.initialize();

// Load the celltower file contents in to the store
File f = new File(CELLTOWER_FILE);
var task = new ReferenceDataFileProcessingTask<CellTower>((Store)cellTowerStore, f.getName(), f.getAbsolutePath(), FileFormat.CSV);
task.setMoveFileAfterProcessing(false);
task.setParser(new CellTowerArrowParser());
var status = task.call();
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.fractalworks.io/joule/developer-guides/builder-sdk/file-processing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
