Real-time Azure Synapse Analytics with Azure Machine Learning

Shashanka M
6 min readFeb 28, 2021

--

What is Azure Synapse Analytics?

Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB, and AzureML.

With this service, data scientists can query both relational and non-relational data using the familiar SQL language (using either serverless on-demand queries for data exploration and ad hoc analysis or provisioned resources for your most demanding data warehousing needs. It is also possible now to connect MS Azure Synapse Analytics to more user-friendly ETL and BI data management solutions

What is Azure Machine Learning in Azure Synapse Analytics?

You may be familiar with how a typical data science process looks. It’s a well-known process, which most machine learning projects follow.

The process contains the following steps:

  • Data Collection.
  • Data analysis.
  • Data Preparation.
  • Choosing a model.
  • Training(Train models on Spark Pools with MLlib).
  • Hyperparameter Tuning.
  • Evaluation.
  • Prediction.

Steps to start with Azure Synapse Analytics:

Step1: Create and setup a Synapse workspace

Step2: Analyse using dedicted SQL pool

Step3:Analyse using Apache Spark

Step4:Analyse using a serverless SQL pool

Step5:Analyse data in a storage account.

Step 6:Orchestrate with pipelines

Step7:Visualize data with Power BI

Before we get started login to Azure-portal

Lets deep dive into the steps

Step1: Create and setup a Synapse workspace:

  • Open the Azure portal, and at the top search for Synapse.
  • In the search results, under Services, select Azure Synapse Analytics.
  • Select Add to create a workspace.
  • In the Basics tab, under Project Details, enter your preferred Subscription, Resource group, Region, and then choose a workspace name.
  • By Select Data Lake Storage Gen 2, click the button for From subscription.
  • By Account name, click Create New and name the new storage account
  • By File system name, click Create New and name it users. This will create a storage container called users
  • The workspace will use this storage account as the “primary” storage account to Spark tables and Spark application logs.
  • Check the “Assign myself the Storage Blob Data Contributor role on the Data Lake Storage Gen2 account” box.
  • Select Review + create > Create. Your workspace is ready in a few minutes.
Create Azure Synapse analytics Workspace from the Portal or use the Azure CLI

Step2:Analyse using dedicated SQL pool:

  • In Synapse Studio, on the left-side pane, select Manage > SQL pools.
  • Select New
  • For SQL pool name select SQLPOOL1
  • For Performance level choose DW100C
  • Select Review + create > Create. Your dedicated SQL pool will be ready in a few minutes. Your dedicated SQL pool is associated with a dedicated SQL pool database that’s also called SQLPOOL1.

A dedicated SQL pool consumes billable resources as long as it’s active. You can pause the pool later to reduce costs.

Step3:Analyse using Apache Spark

Create a Dedicated Spark Pool

Step4:Analyse using a serverless SQL pool

  • In Synapse Studio, on the left-side pane, select Manage > Apache Spark pools.
  • Select New
  • For Apache Spark pool name enter Spark1.
  • For Node size enter Small.
  • For Number of nodes Set the minimum to 3 and the maximum to 3
  • Select Review + create > Create. Your Apache Spark pool will be ready in a few seconds.

The Spark pool tells Azure Synapse how many Spark resources to use. You only pay for the resources that you use. When you actively stop using the pool, the resources automatically time out and are recycled.

Step5:Analyse data in a storage account.

Analysing Data in Storage account

Step6:Orchestrate with pipelines

How to Link Azure Machine Learning Workspace to Synapse workspace?

Create a new Azure Machine Learning linked service in Synapse

Create Dedicated Serverless Apache Spark and SQL Pools

We will use this pool to store the data we want to enhance with our ML model.

Upload and run the Spark Data Notebook to create Spark Database and SQL Test Database

Once we have our serverless Spark and SQL pools up and running we can now ingest our data setup our Spark and SQL tables for training and testing respectively.

Once the notebook is uploaded, change the sql_pool_name value to match the name of your sql pool and then select the desired spark pool and run click all

Follow the steps in below to train your model.

Make sure to set the target column to fareamount and use the onnx model compatibility option.

Enhance SQL Table with Trained Auto ML Model

Once we have the best model we can now evaluate it on our test SQL table using our SQL Pool.

First we need to select the table we want to enhance with the model we just trained

Enrich with existing model

Then we select our new Auto ML model, map the our input table columns to what the model is expecting and choose or create a table for storing our model locally.

This will generate a T-SQL script that evaluate our model against the test data and outputs the predictions.

T-SQL script

Step7:Visualize data with Power BI

Create a Power BI workspace

  • Sign in to powerbi.microsoft.com.
  • Click on Workspaces, then select Create a workspace. Create a new Power BI workspace named MachineLearningmodel or similar, since this name must be unique.

Link your Azure Synapse workspace to your new Power BI workspace

  • In Synapse Studio, go to Manage > Linked Services.
  • Select New > Connect to Power BI.
  • Set Name to MachineLearningmodel .
  • Set Workspace name to the Power BI workspace you created above, similar to MachineLearningmodel .Select Create.

Create a Power BI dataset that uses data in your Azure Synapse workspace

  • In Synapse Studio, go to Develop > Power BI.
  • Go to MachineLearningmodel > Power BI datasets and select New Power BI dataset. Click Start. Select the SQLPOOL1 data source, click Continue.
  • Click Download to download the .pbids file for your MachineLearningmodel.pbids file. Click Continue.
  • Open the downloaded .pbids file. Power BI Desktop opens and automatically connects to SQLDB1 in your Azure Synapse workspace.
  • In the Publish to Power BI window, under Select a destination, choose your MachineLearningmodel, and then click Select.
  • Wait for publishing to finish.

happy tinkering!!

References:

https://docs.microsoft.com/en-us/azure/synapse-analytics/

Footnote

Congratulations! You made it till the end! I do hope you enjoyed the ride into Software Engineering for Data Science and Azure!

I regularly write about Technology & Data on Medium — if you would like to read my future posts then please ‘Follow’ me!

Connect me here on LinkedIn: Shashanka M

--

--

No responses yet