Introduction to MLSQL deep learning (3) -Feature engineering

祝海林
5 min readJun 16, 2021

All code examples in this article are based on the latest version of MLSQL Engine 2.1.0-SNAPSHOT

This article uses a notebook demonstration of the MLSQL Console.

List of series articles:

  1. MLSQL Machine Learning Minimalist Tutorial (No Python Required!)
  2. MLSQL deep learning introduction [1]
  3. Introduction to MLSQL deep learning [2] -Distributed model training
  4. Introduction to MLSQL deep learning [3] -Feature engineering
  5. Introduction to MLSQL deep learning [4] -Serving

For environmental requirements, please refer to the first article: Introduction to MLSQL deep learning [1].

In this chapter, we will introduce how to use MLSQL to do feature engineering on dataset.

Requirements:

  1. Ray is a must in this article.
  2. Cv2 needs to ensure that every node of Ray is installed. Installation command: pip install opencv-python;

Upload image data

This time, we will use the cifar dataset. The first step is to download the cifar dataset I put on github, download it locally, unzip it, and then repackage it with tar. After packaging it (note that it must be in the tar package format) and then it can be uploaded through the upload tab in web console:

The system will automatically decompress the tar package, so we can see after the command that the pictures are already on the object storage:

Notice that because when I packed it, I packed the tempo directory (the cifar picture is inside), so everyone saw the tempo directory.

Read image data

We can read image data with binary format:

load binaryFile.`/tmp/upload/tempo/*.png` as cifar10;

The results are as follows:

Processing pictures

In MLSQL, we also support the use of Python as ETL. Because in many scenarios, for row processing, the logic of processing may be extraordinary complicated, and the built-in SQL UDF is often cumbersome or does not satisfy requirements. Although we support custom UDF writing by Scala, it is obviously simpler to use Python than Scala.

In this example, in order to improve the robustness of the algorithm, we need to make some changes to the picture, rotating the picture or converting the picture to a uniform size, etc. Suppose we want to reduce the picture size to 28 * 28. Before that, let’s look at the current table schema:

We only focus on the content and path fields. The other two fields are useless and can be used. Now, we can do the following configuration:

!python env "PYTHON_ENV=source /Users/allwefantasy/opt/anaconda3/bin/activate ray1.3.0";!python conf "runIn=driver";!python conf "schema=st(field(content,binary),field(path,string))";!python conf "dataMode=data";

It is worth noting that the dataModel is set to data at this time, which tells the system that we are going to do distributed data ETL. Rub your hands, we are going to use open-cv to process pictures. As usual, firstly write a few comments to tell the current cell what type of code is running and what are the input and output table.

--%python--%input=cifar10--%output=cifar10_resize--%cache=true

Next, we define a resize_image Python method:

def resize_image(row):new_row = {}image_bin = row["content"]oriimg = cv2.imdecode(np.frombuffer(io.BytesIO(image_bin).getbuffer(),np.uint8),1)newimage = cv2.resize(oriimg,(28,28))is_success, buffer = cv2.imencode(".png", newimage)io_buf = io.BytesIO(buffer)new_row["content"]=io_buf.getvalue()new_row["path"]=row["path"]return new_row

Finally, pass the method to the ray_context object:

ray_context.foreach(resize_image)

Execute this Cell, and the final output is as follows:

The complete code for this Cell is at the end of the article.

Save the processed picture back to object storage

Normally, we can save the data directly into a table. However, in some cases, we need to save the picture back to the file system after the processing is completed. This can be done by following instructions:

Let’s take a look at the last saved results:

Get category

Here we use some build-in UDF functions to extract category from the file path, the code is as follows:

However, we hope that the label can be converted into a one_hot vector, and this job is also very simple. Let’s look at the number of categories first:

Because the category is string type, so we need to convert the categories into numbers:

As you can see, the mapping has indeed been done. And this mapping relationship will be saved to object store in order to facilitate reuse when predicting. Now you can use UDF onehot to implement category vectorization:

The label has indeed been vector.

Split dataset into training/test

Usually after we get the data, we also need to split the data, do training and prediction. MLSQL also provides modules to complete this job.

RateSampler will automatically add a field called __split__ to the table. In this example, we want to cut the data into two parts, the ratio is 0.9/0.1. You can also cut into any number of parts. In the current example, then __split__ will be marked as 0,1 respectively.

Now, you can get the training and test data sets respectively:

Bonus

In most cases, MLSQL can be directly based on SQL UDF and built-in SQL modules to complete feature engineering. We also provide a very convenient ability for users to reuse Python’s ecological capabilities. In this article, users only need to define a Python method to get distributed data processing capabilities and complete image scaling.

Image processing code

#%python#%input=cifar10#%output=cifar10_resize#%cache=trueimport io,cv2,numpy as npfrom pyjava.api.mlsql import RayContextray_context = RayContext.connect(globals(),"127.0.0.1:10001")def resize_image(row):new_row = {}image_bin = row["content"]oriimg = cv2.imdecode(np.frombuffer(io.BytesIO(image_bin).getbuffer(),np.uint8),1)newimage = cv2.resize(oriimg,(28,28))is_success, buffer = cv2.imencode(".png", newimage)io_buf = io.BytesIO(buffer)new_row["content"]=io_buf.getvalue()new_row["path"]=row["path"]return new_rowray_context.foreach(resize_image)

Complete notebook

--

--