Optimized management of machine learning workflows is vital for organizations seeking agility and innovation in today’s data-driven world.

Kubeflow, combined with Kubernetes, offers the tools and frameworks necessary to streamline ML operations and accelerate model deployment.

In this blog post, we will delve into Kubeflow on Azure Kubernetes Service (AKS), examining its role in enhancing ML workflows and providing a practical guide to get you started with this powerful platform.

Machine Learning

Before we dive into the technology and tooling, it is essential that you have a good idea of โ€‹โ€‹what machine Learning is (and isn’t). By understanding what machine learning truly is, you can better leverage its potential to drive innovation and efficiency. Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit instructions. Instead, these systems learn and improve from experience by being exposed to data. ML algorithms learn patterns from data, allowing them to make predictions or decisions based on new data inputs. ML automates analytical model building, making it possible to analyze larger, more complex datasets quickly and accurately. Additionally, ML models can adapt to new data, making them useful for dynamic environments where conditions and data inputs frequently change. It is used in various applications, such as speech recognition, image classification, recommendation systems, and predictive analytics.

ML-model lifecycle

While ML is powerful, there are several misconceptions and myths that need clarification:

  1. ML is not a magic solution that can solve all problems without human intervention. It requires well-defined problems, quality data, and continuous tuning. Additionally, ML models require regular updates and maintenance to remain accurate and relevant as new data becomes available.
  2. ML does not produce instant results. Training models can be time-consuming and computationally intensive, often requiring substantial experimentation and refinement. ML relies heavily on data. Without sufficient quality data, models cannot learn effectively.
  3. ML is not about completely replacing human jobs. Instead, it is about augmenting human capabilities by automating repetitive tasks and providing insights, allowing humans to focus on more complex and creative work. Not all problems are suitable for ML. Problems need to have patterns and data that an ML algorithm can learn from.
  4. ML is not a one-size-fits-all approach. Different problems require different algorithms and models, and there’s no universal solution applicable to all scenarios. There are many different types of ML algorithms, each suited to different types of tasks (e.g., classification, regression, clustering).
  5. ML models are not always 100% accurate. They make predictions based on patterns in data, and their accuracy depends on the quality and representativeness of the data they were trained on.

Kubeflow

Kubeflow is an open-source machine learning (ML) platform designed to simplify the deployment of ML workflows on Kubernetes, while ensuring they are scalable and portable. It empowers data scientists and machine learning engineers to seamlessly develop, orchestrate, deploy, and manage ML workloads across diverse environments, whether on-premises, in the cloud, or hybrid.

Key to Kubeflow’s functionality is its support for a wide array of ML frameworks like TensorFlow, PyTorch, and XGBoost, facilitating flexibility in model development. It has integrations with popular ML tools and services that enhance data processing, training, and deployment capabilities. Core components of Kubeflow include interactive Jupyter Notebooks within Kubernetes clusters for iterative model development, robust Pipelines for managing end-to-end ML workflows, and KFServing for efficient model deployment and scaling. Additionally, Katib automates hyperparameter tuning, while specialized Training Operators optimize model training with frameworks like TensorFlow and PyTorch. Kubeflow Fairing extends these capabilities to hybrid cloud environments, enabling seamless model building, training, and deployment.

Kubeflow

The platform ensures consistency between development and production environments, fostering collaboration among data scientists, ML engineers, and DevOps teams. Efficient resource management leverages Kubernetes' capabilities, while its extensibility supports integration with new tools and frameworks as required. Kubeflow abstracts Kubernetes complexity, empowering ML practitioners to leverage its comprehensive features without extensive Kubernetes expertise. This integration streamlines the ML lifecycle, from initial experimentation through to scalable production deployment.

Getting started with the Kubeflow

Having explored the key aspects and capabilities of Kubeflow, it’s time to dive into practical steps to get started with this powerful machine learning platform. In the following section, we’ll walk through the basic steps to set up and begin using Kubeflow on Azure Kubernetes Service (AKS), empowering you to harness its capabilities effectively.

In this guide, we’ll deploy an Azure Kubernetes Service (AKS) cluster along with several Azure services including Container Registry, Managed Identity, and Key Vault using Azure CLI and Bicep (as per the official Kubeflow tutorial, leveraging the AKS construction project). Following this setup, we’ll proceed to install Kubeflow. This deployment approach also allowsusing TLS with a self-signed certificate and an ingress controller. For production environments, you can replace the self-signed certificate with your own CA certificates to meet security standards.

Step 1: If you haven’t installed Azure CLI and/or Git yet, this is the time to do it.

Step 2: Login to the Azure CLI using the az login command. If you have more than one subscription, you may need to run the command below to work with your preferred subscription:

az account set --subscription <NAME_OR_ID_OF_SUBSCRIPTION>

Step 3: Download the repository using the command below, which includes the Azure/AKS-Construction and kubeflow/manifests:

git clone --recurse-submodules https://github.com/Azure/kubeflow-aks.git

Change directory to this newly cloned directory, setting it as your working directory, using the cd kubeflow-aks command.

Step 4: Obtain the ID of the signed-in user to ensure administrative access to the cluster we’re going to create, by using the command below:

SIGNEDINUSER=$(az ad signed-in-user show --query id --out tsv)

Step 5: Run the commands below to create a resource group in the North Europe region. You can adjust these parameters to your liking.

RGNAME=rg-p-kubeflow-001
az group create -n $RGNAME -l northeurope

Step 6: Deploy the resources using the Bicep template (main.bicep) from the cloned repository using the command below, while passing the signed-in user ID we retrieved from step 4 as a parameter:

DEP=$(az deployment group create -g $RGNAME --parameters signedinuser=$SIGNEDINUSER -f main.bicep -o json)

๐Ÿ’ก Note: The variable “DEP” stores critical deployment information needed for later steps. You can save it to a file with echo $DEP > test.json and restore it later with export DEP=$(cat test.json).

Looking into the Azure Portal, you’ll find that the following resources were deployed once the deployment process is completed: Resources

Step 7: Extract the names of essential resources created during our deployment, such as the Key Vault, the AKS cluster, the Azure tenant ID and the name of the Azure Container Registry (ACR), by using the commands below:

KVNAME=$(echo $DEP | jq -r '.properties.outputs.kvAppName.value')
AKSCLUSTER=$(echo $DEP | jq -r '.properties.outputs.aksClusterName.value')
TENANTID=$(az account show --query tenantId -o tsv)
ACRNAME=$(az acr list -g $RGNAME --query "[0].name" -o tsv)

Step 8: Install kubelogin if you don’t have it installed on your machine yet.

Step 9: We’re now going to download the kubeconfig file and convert it for use with kubelogin. Log into the deployed AKS cluster running the commands below:

az aks get-credentials --resource-group $RGNAME \
  --name $AKSCLUSTER

kubelogin convert-kubeconfig -l azurecli

If you are prompted for your Azure credentials afterwards, you’ll need to enter them to complete the login. It is important that you log into the cluster to avoid running into issues at a later point. To validate that you’re properly logged in, run the kubectl get nodes command.

Step 9: Install kustomize following the appropriate instructions for your computer if you don’t have it installed on your machine yet.

Step 10: We’re now going to generate a new hash/password combination using bycrypt. You can do this using python, but for demonstration purposes we will be using an online tool Bycrapt Hash Generator from bycrypt.online or coderstool.com.

โš ๏ธ Note: Using a self-signed certificate is used for demonstration purposes and does not adhere to best practices for production scenarios. You can easily swap this self-signed certificate with your own CA certificate.

In the plain text input field, enter a password to your liking, then click on the Generate button. Copy the output value and paste it in the hash value in the deployments/tls/dex-config-map.yaml file on line 22. You can also update the default email address, username and userid in this file, but I left it on the default values. However, if you did update these default values, make sure you also update the appropriate values in the manifests/common/user-namespace/base/params.env file.

Step 11: Update your auth.md file with the plain text password (not the hash). Additionally, if you changed the default email address in the earlier steps above you should change them here as well.

โš ๏ธ Note: Entering passwords as plain text in files is used for demonstration purposes and does not adhere to best practices for production scenarios. In production scenarios you should store the secrets in a more secure way.

Step 12: Copy the contents of your updated manifests folder to the kubeflow manifests folder. Using the command below, we’ll update the files so the deployment includes our config changes:

cp -a deployments/tls manifests/tls

Step 13: Change directory to this folder, setting it as your working directory, using the cd manifests command. Next, install all of the components by running the command below:

while ! kustomize build tls | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

This might take a few minutes to complete. Once the process is done running, validate if all of your pods are ready by running the commands below:

kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n auth
kubectl get pods -n knative-eventing
kubectl get pods -n knative-serving
kubectl get pods -n kubeflow
kubectl get pods -n kubeflow-user-example-com

Step 14: We’re now going to restart the dex pod, because otherwise any previous password (including the default password 12341234) will be used from the time the Service is exposed via LoadBalancer, at least until the time this command is run or the dex is otherwise restarted. Restart the dex pod by running the command below:

kubectl rollout restart deployment dex -n auth

Step 15: To configure TLS, we first need to retrieve the IP address of the istio gateway. Run the command below to see all the services in the istio-system namespace, and copy the EXTERNAL_IP value of the istio-ingressgateway service. This should be the only LoadBalancer service running in this namespace.

kubectl get services -n istio-system

Replace the IP address in the manifests/tls/certificate.yaml file on line 13 with the IP address of the istio gateway you just copied to your clipboard. Don’t forget to save your file.

Step 16: Deploy the certificate manifest file by running the command below:

kubectl apply -f tls/certificate.yaml

๐Ÿ’ก Note: You could also give the LoadBalancer an Azure sub-domain and use that, via the annotation in manifests/common/istio-1-16/istio-install/base/patches/service.yaml.

We’ve now succesfully completed the deployment. You can access the Kubeflow dashboard by entering the EXTERNAL_IP value in your browser. You’ll probably get a warning about the connection being unsafe. This is expected behavior, as we’re using a self-signed certificate. You can log in using the email address and password in the auth.md file.

Kubeflow dashboard

Closing words

By now you should have a solid understanding of how to install Kubeflow on AKS cluster, and having you ready to continue your journey with streamlining your machine learning workflows, enhance collaboration, and ensure scalable, reproducible model deployments. The true power of machine learning lies in its ability to transform vast amounts of data into actionable insights, driving innovation and efficiency across various domains. Kubeflow, with its robust platform and seamless integration with Kubernetes, empowers organizations to unlock this potential, making machine learning accessible, scalable, and operationally efficient.

To learn more about the material we’ve covered in this blog article, and to continue your journey with Kubeflow and machine learning, you can start with reading some of the resources below:

Thank you for taking the time to go through this post and making it to the end. Stay tuned, because weโ€™ll keep continuing providing more content on topics like this in the future.