AutoML

Gaio uses technology to create predictive models H2O AutoML (Automatic Machine Learning). This means that Gaio operationalizes the connection to data, data processing, delivers training and modeling data and directives to H2O AutoML, retrieves the result of the execution and delivers the results in a user-friendly interface. This entire process can be automated within Gaio.

How to Use the AutoML

1. Access the "AutoML" Task

In the left-side menu, go to Analytics and select the AutoML task.

2. Configure the Model

In the configuration screen:

Model Name(optional): Enter a name for your model (e.g., auto_ML).
Table: Select the data source table.
Target: Choose the variable you want to predict (e.g., status).
Columns to remove: If there are columns that should be excluded (such as IDs), list them here.
Training Time (Seconds): Estimated time the system will use to train the models.
Rows limit: By default, Gaio uses up to 100,000 rows to train the model. You can adjust this, but higher values may overload the server.

The modeling process is often memory and processing intensive. Therefore, special attention to the volume of rows in the table to be used is essential. A good one sample it is an excellent strategy as it generally represents the entire data set well and thus allows more models to be created in less time, in addition to not overloading the server. For large datasets, use the Sample task first to reduce volume and optimize performance.

By default, Gaio limits it to 100 thousand lines, however it is possible to change this value, but it is necessary to be aware of the impact and it is only interesting in cases where the server is very large.

Click Save and Train to begin the process.

3. Track the Progress

While training, the interface displays two progress bars:

Preparation: Data preprocessing stage.
Training: Model construction and testing.

4. Techniques

Several techniques are used in the automatic modeling process. The following list contains the link to the official H2O documentation:

GLM: Generalized Linear Model.
XGBoost: Combination of multiple decision trees created in parallel.
GBM: Gradient Boosting Machine.
DeepLearning: use of Neural Networks.

Training and validation criteria are applied. Gaio uses Cross-Validation to evaluate whether the models are being assertive. A 5-Fold is used to generate 5 random samples of the same size that will be used to train several models, as shown in the image below:

The criterion for prioritizing the model is Accuracy .

Categorical (text) and Numeric are accepted as response variables. In the case of a numerical variable, it will always be considered that the desire is to predict the number and not to indicate the probability of that event occurring.

If the response variable is, for example, Service Cancellation and has values 0 or 1, it will be necessary to transform the values in this column into, for example, R0 or R1. This is because in this case we expect to know the probability of the customer canceling, that is, being 1 and at the same time the probability of the customer being 0, not canceling. However, as it is a numerical variable, Gaio understands that the intention is to predict a number, such as the amount that the customer can purchase . Different techniques and different results are applied to the two different types of response variable.

5. Review the Results

Once completed, the system will display a full report including:

Summary: A summary of the automatic model building process is generated, and the overall quality of the model is reported.
- Model Accuracy: Shows the accuracy of the best model created.
- ROC Curve: A visual representation of model performance.
- Most Important Variables: Lists the top predictive features in order of importance.
Models: The list of all models that were created in the predetermined time with some model quality statistics.
Supporting Tables:
- Cross Validation
- Confusion Matrix
- Gain Table
- Maximum Metrics

6. Apply the Model

The trained model is saved and can be reused through the Scoring task to apply predictions to new data.

PreviousSample NextCluster

Last updated 1 day ago