Cluster

The Cluster task in Gaio DataOS applies clustering algorithms to group records with similar characteristics. It's ideal for use cases such as customer segmentation, pattern recognition, and data-driven decision-making based on behavioral or structural profiles.

Gaio uses the K-Means technique to identify groups and analysis calculations are made in H2O, whose documentation can be accessed here.


How to Use the Cluster Task


1. Open the Cluster Task

  • In the Studio, go to the Tasks panel.

  • Under the Analytics section, select Cluster.


2. Configure the Task

  • Task label: (optional) Name for identifying this step in your flow.

  • Result table: Output table that will contain the clustered results. Example: cluster_campaign.

  • Table name: Automatically populated with the selected table (e.g., new_sales).


3. Exclude Columns (Optional)

  • In the Exclude columns field, add columns that should not be considered in the clustering process, such as unique IDs (e.g., cod_cliente).

  • This helps avoid bias or noise in the algorithm.


4. Adjust Execution Settings

Execution time

  • Defines the maximum runtime of the clustering algorithm (in seconds).

  • Recommended: between 20 and 60 seconds, depending on dataset size and complexity.

Max cluster size

  • Sets the maximum number of clusters the algorithm can create.

  • Example: if set to 3, the output will contain up to 3 distinct groups.

️ Automatic clusters size

  • When enabled, Gaio will automatically determine the ideal number of clusters based on the data's variability.

  • When disabled, it will strictly follow the manual limit set in Max cluster size.


5. Save and Run

  • Click Save to confirm the task configuration.

  • Run the flow — the output table will contain your clustered data.


Output

The resulting table will include:

  • All original columns (excluding those set to be ignored)

  • A new column indicating the assigned cluster ID for each row


Best Practices

  • Use tasks like Sample or Principal Component Analysis (PCA) beforehand to reduce dimensionality and improve performance.

  • Remove irrelevant or high-cardinality columns that could distort clustering results.

  • Leverage clustering to personalize campaigns, identify customer profiles, detect anomalies, or support retention strategies.

Last updated