Arrikto Enterprise Kubeflow Documentation¶
Open Source (OSS) Kubeflow enables you to operationalize much of an ML workflow on top of Kubernetes. It comprises a number of ML components and services; SDKs and APIs; integrated development environments (IDEs); and libraries for data science.
The Arrikto Enterprise Kubeflow (EKF) distribution introduces important additional features to address gaps in OSS Kubeflow and commonly expressed needs of MLOps engineers and data scientists.
- Automation: With Arrikto EKF you can orchestrate an end-to-end ML workflow from your IDE. Start by tagging cells in Jupyter Notebooks to define pipeline steps, hyperparameter tuning, GPU usage, and metrics tracking. Click a button to define the necessary Kubernetes services and run a scalable ML pipeline and serve the best model. Or use the EKF Kale SDK to do all the above within your preferred IDE.
- Reproducibility: Snapshot pipeline code, libraries, and data for every step with the Arrikto Rok data management platform. Roll back to any machine learning pipeline step at its exact execution state for easy debugging. Collaborate with other data scientists through a Git-style publish/subscribe versioning workflow.
- Portability: Arrikto EKF enables you to deploy and upgrade a Kubeflow environment using GitOps processes across all major public clouds and on-prem infrastructure. Move ML workflows seamlessly across with Rok Registry.
- Security: Arrikto EKF security features enable you to manage teams and user access via GitLab or any ID provider via Istio/OIDC. Isolate user ML data access within their own namespace while enabling notebook and pipeline collaboration in shared namespaces. Manage secrets and credentials securely, and efficiently.
Getting Started¶
The easiest way to start with EKF is to follow one or more of the tutorials below!
- Tutorial 1: Kaggle’s Titanic Disaster Machine Learning Example
- Tutorial 2: Udacity’s Dog Breed Classification Example
- Tutorial 3: Kaggle’s Covid-19 OpenVaccine Machine Learning Example
- Tutorial 4: Kaggle’s Blue Book for Bulldozers Machine Learning Example
- Tutorial 5: Distributed PyTorch with Kubeflow and Kale
Installation¶
- Install
- Prepare Management Environment
- Create Virtual Private Cloud
- Create Kubernetes Cluster
- Deploy Rok
- Deploy Rok Scheduler
- Deploy Kubeflow
- Deploy Rok Registry
- Deploy Cluster Autoscaler
- Deploy NVIDIA Device Plugin
- Deploy Kiwi (Arrikto vGPU)
- Expose EKF
- Expose Serving
- Troubleshooting FAQ
- Air gapped Deployments
- Features
- GitOps
- rok-deploy
- Pipelines
- Hyperparameter Tuning
- Kubeflow Notebooks
- Rok Snapshotting
- Kale Kubeflow Pipeline and Rok Snapshots
- Kubeflow Pipeline and Initial Rok Snapshot
- Kubeflow Pipeline Steps and Rok Snapshots
- Rok Snapshot Creation and Rok Buckets
- Rok Snapshots Outside of I/O Path
- Rok Snapshots & Environment Restoration
- Rok Snapshots and Volume Restoration
- Rok Snapshots and Pipeline Restoration
- Scaling
- Kiwi
- Operations Guide
- Automated Deployments
- Manage Your EKS Cluster
- Manage Your GKE Cluster
- Manage Your AKS Cluster
- Manage Rok etcd
- Manage Authentication
- Configure Default Retention Policy for New Buckets
- Update Retention Policy of all Buckets
- Create Privileged Notebook Server
- Migrate EKF Cluster
- Manage Your Kubeflow Deployment
- Manage Security Policies With Kyverno
- Manage Networking
- Manage Your Rok Registry Cluster
- Add an internal GitHub repository as a backup GitOps remote
- Set Up Cluster-Wide Authenticated Access to a Docker Registry
- Disable Automatic Profile Creation
- Scale In Kubernetes Cluster
- Protect Pods from OOM conditions and CPU starvation
- Add Static Users in Dex
- Hot-Patch an Arbitrary Image in Your Deployment
- Expose TokenRequest API for External Clients
- Configure Syncing
- Trust Custom CA
- Add Extra Resources To All User Namespaces
- Gather Logs for Troubleshooting
- Recover RWX Volume After Node Failure
- Recover Pods From Out of Space Errors
- Manage Your Rok Monitoring Stack
- Handle Degraded Nodes
- Manage Your Kiwi-enabled GPUs