Introduction to Byzer machine learning (without Python)

4 min readMay 30, 2021

Byzer can do the end-2end Pipline of machine learning with sql-like style code. Here is the steps we will show in this example:

Load data (may from different datasources)
Processing data
Model training (support multiple sets of parameters, model version control, etc.)
Batch prediction
Model evaluation
ML Serving

Now, let’s start the journey.

First, load the data from parquet files:

We load training data, validation data, and test data respectively. Because the features column is an array, it needs to be converted into a vector, using udf vec_dense to do a conversion:

Notice that we have already known this is a classification problem, this means we need to choose a classification algorithm. We plan to use the random forest algorithm. But first we need to find out if this algorithm is supported in Byzer:

We try to use the keyword Random to search and find it . However, we don't know how to use this module yet, so we can check it with the following command:

The black part is a usage example, and the doc shows some introduction of the algorithm.

And Byzer also provide a way to show the params of the target algorithm:

We follow the same pattern and write code like following:

In this example, we configured two sets of parameters, and the system will train both models at the same time, and calculate which model peformance is better according to the validation dataset. The following is the complete output result:

The blue basket section shows that the F1 score of the second model is 76%. Suppose we accept this result, and then we need to take our still warm model to the test dataset to verify the performance, we can do this:

Just specify the data set by predict statement. Now we just see the predicted value (prediction), and the actual value (label), we still don’t know how many is right and how many is wrong. We use a built-in module to evaluate the result like the following:

ConfusionMatrix calculates many metrics, such as PPV, etc. The meaning of each indicator is described in desc field. Generally, we focus on accuracy: