Gaio DataOS
Gaio DataOS
Gaio DataOS
  • πŸ‘‹ Welcome to Gaio DataOS
  • GETTING STARTED
    • Gaio DataOS Console
    • Quickstart
  • FUNDAMENTALS
    • Data Projects
    • Studio
    • Database
    • Workflow
  • Data Sources
  • TASKS
    • ETL
      • Builder
      • SQL
      • Source SQL
      • Insert Table
      • Insert Row
      • Update
      • Delete
      • Create Table
      • Quick Table
      • Quick Upload
      • Pivot Table
      • Unpivot Table
      • REST
      • Parameters to Table
      • Table to Parameters
      • Define parameter value
      • Users
      • CSV Web
      • CSV Local
      • Google Spreadsheet
    • Analytics
      • Sample
      • Cluster
      • Principal Components
      • Association Rules
      • Forecast
      • Python
    • Delivery
      • Content
      • Form Card
      • Export CSV
    • Map Editor
Powered by GitBook
On this page
  • How to Configure the Python Task
  • 1. Open the Python Task
  • 2. Fill in the Required Fields
  • Examples
  • Practical example
  1. TASKS
  2. Analytics

Python

PreviousForecastNextDelivery

Last updated 2 days ago

This task allows you to run scripts in Python language, the version used can be chosen according to the versions made available by your Gaio administrator.

This task allows you to run scripts in Python language, the version used can be chosen according to the versions made available by your Gaio administrator. Libraries can be installed and managed by Gaio developers. In addition, we provide a class called bucket that allows you to extract and export data that is in the clickhouse database that your application has permission to use.

Memory Limit

The Python task in Gaio is limited by default to a maximum of 80% of the machine's memory, if it exceeds this limit it will return a memory limit error.


How to Configure the Python Task

We will simply navigate through the task interface, and after that we will develop a simple script to serve as an example.


1. Open the Python Task

  • In the Studio, go to the Tasks panel.

  • Under the Analytics section, select on Python.


2. Fill in the Required Fields

The first page is the main one for the task. In it, on the left, we have the space in a blue theme to write the script, while on the right, in a dark theme, the console is located, where we can view the script's output. To run your script, simply click the "run" button and the result will be displayed in the console.

It is possible to save the files generated in the script, such as jpeg, png, mp4, pkl files, among others. The name of this folder is assets.

There are three folders that you can use through the python task, which are the content, inputs and output folders of your application. The path is stored in the following variable: app_inputs, app_outputs, and app_assets.

Below is an example of how to create your path to the outputs folder so you can download the generated image.

path = app_assets + "outputs/imagem_name.png"

In the text box, you must write on each line the correct name of the library you want to install (just the name, without any other characters, as shown in the image below). After choosing the Python version and libraries, simply click the "Install" button for your configurations to be executed.

As previously mentioned, we have a class called bucket, which connects to the clickhouse in an encapsulated way and has the query_df, command, insert_df and create_df methods.

Examples

Function that transforms a clickhouse select into a pandas dataframe in python.

df = bucket.query_df('select columnA, columnB from table where columnB = 'active')

Function that makes a copy of a clickhouse table indicated to a pandas dataframe.

df = bucket . select_df ( 'new_table' )

In the first line we have the function that creates a table in clickhouse that is similar to your pandas dataframe, in the second line we insert the data from your pandas dataframe into the clickhouse table.

bucket . create_df ( 'new_table' , df )
bucket . insert_df ( 'new_table' , df )

Note that to perform the insert_df function we need your pandas dataframe to be similar to your clickhouse table.

Practical example

In this practical example we will go through the part of bringing the data into Python, performing grouping, saving an image in png format, saving the model file, and creating and saving the final table in clickhouse.

First, let's import the libraries that will be used

import pandas as pd
from sklearn . cluster import KMeans
import matplotlib . pyplot as plt
import joblib

For this example we will use the famous iris table provided by several libraries such as scikit-learn. This table is in the clickhouse database within Gaio.

select_df function to bring it to Python, and then apply the kmeans algorithm provided by the scikit-learn library.

# Bring data into python
data = bucket . select_df ( 'iris_table' )

# Apply the K-Means algorithm with 3 clusters (number chosen arbitrarily)
kmeans = KMeans ( n_clusters =3 )
data [ 'cluster' ] = kmeans . fit_predict ( data )
​
# Evaluate the result - for example, viewing the means of each cluster
cluster_means = data . groupby ( 'cluster' ). mean ()

​In this next step, we will visualize the groups found by the model and save the figure in the assets folder .

# Plot the clusters on a graph (considering only the first two columns)
plt . scatter ( data [ 'sepal_length_cm_' ], data [ 'sepal_width_cm_' ], c = data [ 'cluster' ], cmap = 'viridis' )
plt . xlabel ( 'sepal_length_cm_' )
plt . ylabel ( 'sepal_width_cm_' )
​
# Save the chart in png format
plt . savefig ( 'assets/cluster_iris.png' )

Now let's save this model so it can be reused at other times, for this we will use the joblib library.

# Save the model
joblib . dump ( kmeans , 'assets/modelo_kmeans_iris.joblib' )

Now we can send the dataframe with the new column generated by the model to the clickhouse so that it can be used by other Gaio tasks. For this we will use create_df and insert_df .

# Create a table in clickhouse similar to your dataframe
bucket . create_df ( 'tmp_iris_clusterizada' , data )
​
# Insert data from your dataframe into a clickhouse table
bucket . insert_df ( 'tmp_iris_clusterizada' , data )
Python Task Code Page
Python Task Environment Page