Cluster

The Cluster task in Gaio DataOS applies clustering algorithms to group records with similar characteristics. It's ideal for use cases such as customer segmentation, pattern recognition, and data-driven decision-making based on behavioral or structural profiles.

Gaio uses the K-Means technique to identify groups and analysis calculations are made in H2O, whose documentation can be accessed here.

How to Use the Cluster Task

1. Open the Cluster Task

In the Studio, go to the Tasks panel.
Under the Analytics section, select Cluster.

2. Configure the Task

Task label: (optional) Name for identifying this step in your flow.
Result table: Output table that will contain the clustered results. Example: cluster_campaign.
Table name: Automatically populated with the selected table (e.g., new_sales).

3. Exclude Columns (Optional)

In the Exclude columns field, add columns that should not be considered in the clustering process, such as unique IDs (e.g., cod_cliente).
This helps avoid bias or noise in the algorithm.

4. Adjust Execution Settings

Execution time

Defines the maximum runtime of the clustering algorithm (in seconds).
Recommended: between 20 and 60 seconds, depending on dataset size and complexity.

Max cluster size

Sets the maximum number of clusters the algorithm can create.
Example: if set to 3, the output will contain up to 3 distinct groups.

️ Automatic clusters size

When enabled, Gaio will automatically determine the ideal number of clusters based on the data's variability.
When disabled, it will strictly follow the manual limit set in Max cluster size.

5. Save and Run

Click Save to confirm the task configuration.
Run the flow — the output table will contain your clustered data.

Output

The resulting table will include:

All original columns (excluding those set to be ignored)
A new column indicating the assigned cluster ID for each row

Best Practices

Use tasks like Sample or Principal Component Analysis (PCA) beforehand to reduce dimensionality and improve performance.
Remove irrelevant or high-cardinality columns that could distort clustering results.
Leverage clustering to personalize campaigns, identify customer profiles, detect anomalies, or support retention strategies.

PreviousSample NextPrincipal Components

Last updated 3 days ago