Introduction to Byzer machine learning (without Python)

祝海林
4 min readMay 30, 2021

Byzer can do the end-2end Pipline of machine learning with sql-like style code. Here is the steps we will show in this example:

  1. Load data (may from different datasources)
  2. Processing data
  3. Model training (support multiple sets of parameters, model version control, etc.)
  4. Batch prediction
  5. Model evaluation
  6. ML Serving

Now, let’s start the journey.

First, load the data from parquet files:

We load training data, validation data, and test data respectively. Because the features column is an array, it needs to be converted into a vector, using udf vec_dense to do a conversion:

Notice that we have already known this is a classification problem, this means we need to choose a classification algorithm. We plan to use the random forest algorithm. But first we need to find out if this algorithm is supported in Byzer:

We try to use the keyword Random to search and find it . However, we don't know how to use this module yet, so we can check it with the following command:

The black part is a usage example, and the doc shows some introduction of the algorithm.

And Byzer also provide a way to show the params of the target algorithm:

We follow the same pattern and write code like following:

In this example, we configured two sets of parameters, and the system will train both models at the same time, and calculate which model peformance is better according to the validation dataset. The following is the complete output result:

The blue basket section shows that the F1 score of the second model is 76%. Suppose we accept this result, and then we need to take our still warm model to the test dataset to verify the performance, we can do this:

Just specify the data set by predict statement. Now we just see the predicted value (prediction), and the actual value (label), we still don’t know how many is right and how many is wrong. We use a built-in module to evaluate the result like the following:

ConfusionMatrix calculates many metrics, such as PPV, etc. The meaning of each indicator is described in desc field. Generally, we focus on accuracy:

Because the label has 2 categories, so we have two accuracy rates. The value is not bad to reach 80%.

Since the result is good, then we can deploy the model to provide services online. The approach is also very simple:

We deploy the model to the specified service through deployModel annotation and register statement.

Now we can predict by calling the http request:

The above statement is equivalent to send a POST request.

Done, our model is already serving thousands of households.

--

--