views

Search This Blog

Thursday, May 7, 2026

Exploring Automation and Self-Service Enhancements in VMware Cloud Foundation 9.1

With every new release, VMware Cloud Foundation continues to improve how organizations consume and operate private cloud infrastructure. In the recently announced VCF 9.1 release, one of the major focus areas is automation and self-service capabilities designed to simplify private cloud operations and improve deployment efficiency.

As highlighted in the official VMware Cloud Foundation 9.1 Automation announcement, the new release introduces several enhancements around runtime services, Kubernetes lifecycle management, faster provisioning workflows, and tenant networking automation.










In this blog, I will walk through the key automation and self-service improvements introduced with VMware Cloud Foundation 9.1.

Runtime Services Architecture in VCF 9.1

One of the important architectural updates in VCF 9.1 is the introduction of three dedicated runtime service options:

  • VM Service
  • Container Service
  • VMware vSphere Kubernetes Service (VKS)

This runtime service segmentation provides a more structured and service-oriented approach for private cloud consumption. Instead of managing all workloads through a single runtime layer, administrators can now align services based on workload and operational requirements.

The update enables organizations to consume virtualization and Kubernetes services independently while continuing to operate under the VMware Cloud Foundation platform. From an operational perspective, this model also improves clarity for infrastructure teams managing different workload types across the environment.

Additionally, VCF 9.1 simplifies container adoption by offering a dedicated Container Service with lifecycle management capabilities. Organizations can deploy and manage containers without requiring deep Kubernetes expertise, while still having a clear migration path toward full Kubernetes-based platforms using VKS.

Container Service Lifecycle Management

Another major enhancement highlighted in the VCF Automation 9.1 announcement is the addition of lifecycle management capabilities for Container Service directly from the automation interface.

According to the published blog, administrators can now perform the following operations through the interface:

  • Deploy containers
  • Configure container environments
  • Monitor container workloads
  • Upgrade container deployments
  • Delete container environments

This provides a centralized operational experience for container lifecycle management inside VMware Cloud Foundation.

Instead of relying on multiple management workflows, administrators can now perform lifecycle operations from a unified automation platform.

The enhancement is focused on improving operational consistency while simplifying day-to-day container management activities.

Fast Deploy Capability for VM and VKS Provisioning

Provisioning speed is another area where VCF 9.1 introduces significant improvements.

The release adds Fast Deploy capabilities for both VM provisioning and VMware vSphere Kubernetes Service (VKS) cluster deployments.

For organizations deploying Kubernetes environments at scale, deployment time and upgrade windows are critical operational factors. VMware has highlighted substantial improvements in both deployment and upgrade workflows for VKS clusters.

VKS Cluster Deployment Improvements

According to the official announcement:

  • VKS cluster deployment time has been reduced from 37 minutes to 11 minutes.
  • This represents a 69% improvement in deployment speed.

Reducing cluster deployment time helps accelerate infrastructure readiness for Kubernetes-based workloads and development environments.

Faster provisioning also improves operational agility for infrastructure teams handling frequent cluster requests.

VKS Cluster Upgrade Improvements

VCF 9.1 also introduces major improvements in cluster upgrade workflows.

As published in the official blog:

  • VKS cluster upgrade time has been reduced from 6.9 hours to 1.7 hours.
  • This delivers approximately a 75% improvement in upgrade efficiency.

Cluster upgrades are often one of the more time-consuming operational activities in Kubernetes environments. Reducing upgrade duration can help simplify lifecycle operations and reduce maintenance windows for infrastructure administrators.

Self-Service Networking and Tenant Automation Enhancements

Along with runtime and provisioning improvements, VCF 9.1 also expands networking automation and tenant self-service capabilities.

The release introduces several new networking-related automation features, including:

  • Tenant IP address pre-allocation
  • Multiple external connections
  • Multiple transit gateways per tenant
  • Direct data center access
  • VPN deployment
  • Gateway firewall support
  • Shared subnet capabilities
  • VLAN extension support

These enhancements are designed to provide additional flexibility for tenant networking and private cloud connectivity requirements.

Tenant IP Address Pre-Allocation

VCF 9.1 introduces tenant IP address pre-allocation capabilities as part of the self-service networking enhancements.

This helps streamline IP management workflows during tenant provisioning and deployment operations.

Multiple External Connections

The release also adds support for multiple external connections.

This enhancement provides additional flexibility for connectivity requirements across different tenant or application environments.

Multiple Transit Gateways Per Tenant

Another networking enhancement introduced in VCF 9.1 is support for multiple transit gateways per tenant.

This capability expands networking design flexibility for environments requiring segmented or multi-path connectivity models.

VPN Deployment and Gateway Firewall Support

VCF 9.1 further expands networking automation with support for:

  • VPN deployment
  • Gateway firewall capabilities

These additions enhance networking configuration and connectivity management directly through the automation platform.

Shared Subnets and VLAN Extensions

The release also introduces support for:

  • Shared subnets
  • VLAN extensions

These capabilities further improve networking flexibility for tenant environments and workload connectivity scenarios.

The VMware Cloud Foundation 9.1 release continues to enhance automation and self-service capabilities across private cloud environments.

Based on the official VMware announcement, the release focuses on:

  • Runtime service separation
  • Container lifecycle management
  • Faster VM and VKS provisioning workflows
  • Improved VKS upgrade efficiency
  • Expanded tenant networking automation capabilities

The Fast Deploy enhancements for VMware vSphere Kubernetes Service (VKS) are one of the key highlights of this release, especially with the significant reduction in deployment and upgrade times.

At the same time, the additional networking automation capabilities continue to improve flexibility for self-service private cloud operations within VMware Cloud Foundation environments.

Thursday, April 23, 2026

Designing Supervisor Zone Architecture in VMware Kubernetes Service

As organizations modernize their infrastructure to support cloud-native applications, Kubernetes has become a foundational platform. With VMware Kubernetes Service running natively on vSphere, enterprises can now seamlessly integrate Kubernetes into their existing virtualized environments.

However, a successful deployment is not just about enabling Kubernetes—it requires careful architectural planning. One of the most critical design aspects is the Supervisor Zone Model, which determines how control plane components and workloads are distributed across the infrastructure.

This blog provides a structured view of Supervisor Zone architecture, key design principles, and alignment with enterprise deployments.

Understanding Supervisor Zones

A Supervisor Zone represents a logical failure domain within the vSphere environment. It groups compute, storage, and networking resources to provide:

  • Fault isolation
  • High availability
  • Predictable workload placement

These zones are conceptually similar to availability zones in public cloud platforms but are tightly integrated with on-prem infrastructure managed through vCenter Server and VMware NSX.

Supervisor Deployment Models

Depending on availability and isolation requirements, the Supervisor can be deployed using one of the following models:

1. Single Management Zone – Combined Workloads

In this model, both the Supervisor control plane and workloads run within the same zone.

Characteristics:

  • Simplified deployment
  • Shared resources
  • Single failure domain

Use Case:
Suitable for lab environments, proof-of-concepts, or small-scale deployments.

2. Single Management Zone – Isolated Workloads

The Supervisor control plane is deployed in one zone, while workloads run in separate zones.

Characteristics:

  • Logical separation of workloads
  • Improved resource isolation
  • Control plane remains single zone

Use Case:
Appropriate for environments requiring workload segmentation without complex infrastructure.

3. Three Management Zones – Combined Workloads

The control plane is distributed across three zones, while workloads share the same zones.

Characteristics:

  • High availability for control plane
  • Balanced resource utilization
  • Simplified workload placement

Use Case:
Recommended for production environments where availability is a priority.

4. Three Management Zones – Isolated Workloads

The control plane spans three zones, and workloads are deployed in separate, dedicated zones.

Characteristics:

  • Maximum resilience
  • Strong isolation
  • Enhanced performance predictability

Use Case:
Ideal for enterprise-scale, multi-tenant, and mission-critical environments.

Design Considerations

Zone Scalability

  • A single Supervisor supports up to 30 zones
  • Zones should align with physical or logical boundaries such as racks or availability domains

Networking and Load Balancing

All deployment models support flexible networking and load balancing options.

Networking Models:

  • VPC-based networking
  • NSX-backed segments
  • VLAN-backed networking

Load Balancer Options:

  • NSX Load Balancer
  • Avi Load Balancer
  • VCF-integrated load balancing

These capabilities are enabled through VMware NSX, ensuring consistent networking and security policies.

Platform Constraints

  • All zones must be managed by a single vCenter Server
  • Networking must be provided by a single VMware NSX instance
  • Control plane virtual machines remain within management zones and cannot move across workload zones

These constraints should be considered early during the design phase to avoid rework.

VMware Cloud Foundation Alignment

In environments built on VMware Cloud Foundation, Supervisor architecture aligns with the concept of Workload Domains.

Mapping Overview

  • Workload Domain → Infrastructure boundary
  • Supervisor Cluster → Kubernetes control plane
  • vSphere Cluster → Zone
  • NSX → Networking and security layer

Deployment Lifecycle

Day-0 Deployment:

  • Supervisor is enabled during workload domain creation
  • Limited to a single management zone

Day-2 Operations:

  • Addition of zones
  • Expansion to multi-zone architecture
  • Load balancer and networking adjustments

This staged approach highlights the importance of planning for future scalability.

Networking Considerations

Proper IP planning is essential for successful deployment.

Key elements include:

  • Management network CIDR
  • Pod CIDR
  • Service CIDR
  • External IP pools

In VPC-based environments, communication between Supervisor and workload clusters relies on external IP allocation, making IP planning a critical design step.

Operations and Access

VCF CLI

The VCF CLI is used for:

  • Authentication
  • Managing Supervisor contexts
  • Generating kubeconfig files

This simplifies cluster access and operational workflows.

SSH Access

  • Direct SSH access via external IP is not supported
  • Access is enabled through:
    • Credentials retrieved from vCenter Server
    • Supervisor management network

Best Practices

  • Prefer three management zones for production environments
  • Use isolated workload zones for better security and performance
  • Align zones with physical infrastructure design
  • Plan networking and CIDR ranges in advance
  • Use Day-2 operations to scale architecture as needed

Supervisor Zone design plays a critical role in determining the success of Kubernetes deployments on vSphere.

While single-zone deployments offer simplicity, multi-zone architectures provide the resilience and scalability required for enterprise workloads. By aligning Supervisor design with infrastructure capabilities and business requirements, organizations can build a robust and future-ready Kubernetes platform.

With platforms like VMware Kubernetes Service and VMware Cloud Foundation, enterprises are well-positioned to deliver consistent, scalable, and secure cloud-native environments.

Wednesday, March 25, 2026

NVMe Memory Tiering in VMware Cloud Foundation 9

In almost every infrastructure design discussion, there comes a point where things stop being elegant.

It usually starts with confidence.
You size your clusters carefully. CPU is balanced. Storage is optimized. Everything aligns with best practices.

And then comes the reality check.

Memory begins to run out.

Not dramatically. Not all at once. But gradually new workloads, growing applications, increasing user demand. And suddenly, the most expensive component in your design becomes the limiting factor.

So the solution feels obvious.

Add more DRAM.

But that solution comes with a cost—one that grows faster than most teams expect. And over time, a question starts to form:

Are we scaling infrastructure… or just scaling cost?



A Different Way to Think About Memory

This is where NVMe Memory Tiering in VMware Cloud Foundation (VCF) 9 introduces a subtle but powerful shift.

It doesn’t try to replace DRAM.
It doesn’t compromise performance.
It simply changes how memory is used.

At its core lies a simple realization:

Not all allocated memory is actively used at the same time.

Some memory pages are constantly accessed—critical to performance.
Others sit idle for long periods, quietly consuming expensive DRAM.

Traditional systems treat both the same. NVMe Memory Tiering does not.

With NVMe Memory Tiering, memory evolves from a static pool into a dynamic, self-optimizing system.

Instead of relying entirely on DRAM, the system introduces a second layer:

  • DRAM – fast, responsive, and reserved for active workloads
  • NVMe SSD – slightly slower, but highly cost-efficient, used for less active data

What makes this powerful is not the existence of two tiers—but the intelligence that connects them.

The hypervisor continuously observes memory behavior. It identifies which pages are actively used and which are not. Based on this, it quietly reorganizes memory in real time.

Active data remains in DRAM. Inactive data is moved to NVMe.
And if something becomes active again, it is seamlessly brought back.

All of this happens without disruption, without manual tuning, and without the virtual machine ever being aware.

Not a Workaround—A Smarter Design

It is important to understand what NVMe Memory Tiering is not.

It is not swapping.
It is not memory compression.

Those mechanisms react to memory pressure after it occurs.

This is different.

This is proactive.

Instead of waiting for memory to become a problem, the system ensures that:

  • High-performance memory is always available where it matters
  • Lower-cost memory absorbs what does not need speed

It’s a shift from reacting to optimizing.

Expanding Capacity Without Expanding Cost

One of the most compelling outcomes of this approach is its impact on scalability.

Because NVMe storage is significantly more cost-effective than DRAM, it can be used to extend memory capacity in a meaningful way.

A system configured with 512 GB of DRAM can effectively support workloads as if it had close to double that capacity—without physically doubling DRAM.

This is not an illusion.
It is the result of using memory more efficiently.

The Balance That Makes It Work

Despite its elegance, NVMe Memory Tiering is not magic. It follows a very important rule:

DRAM must always be sufficient to hold the active working set.

This is the foundation of good design.

If active memory exceeds DRAM capacity, the system is forced to rely more heavily on NVMe. While NVMe is fast, it is still not DRAM. Over time, this imbalance can introduce latency that applications may begin to feel.

This is why understanding workload behaviour is critical.

The success of NVMe Memory Tiering is not defined by how much memory you allocate—but by how well you understand what is actively used.

Where It Truly Delivers Value

When aligned with the right workloads, NVMe Memory Tiering can feel transformative.

In VDI environments, where user activity fluctuates and large portions of memory remain idle, it dramatically improves density and cost efficiency.

In development and testing environments, where systems are often over-provisioned, it brings balance without sacrificing flexibility.

In mixed workload clusters, it introduces a level of intelligence that allows infrastructure to adapt naturally to changing demands.

However, in environments where latency is critical—such as real-time systems or large in-memory databases—DRAM remains irreplaceable. These workloads demand consistency above all else.

Understanding this distinction is what defines a mature design.

Designing with Insight, Not Assumption

The most effective use of NVMe Memory Tiering begins long before it is enabled.

It begins with observation.

How much memory is truly active?
When do workloads peak?
How much of what is allocated is used?

These are the questions that shape a successful design.

Because ultimately, NVMe Memory Tiering is not about adding capacity.
It is about unlocking unused potential.

A Shift in How We Build Infrastructure

If you step back and look at the bigger picture, NVMe Memory Tiering represents something more fundamental.

For years, infrastructure scaling has been tied directly to hardware:

  • More demand meant more resources
  • More resources meant higher cost

But that model is changing.

We are moving toward systems that:

  • Understand usage patterns
  • Adapt in real time
  • Optimize themselves without constant intervention

This is the essence of modern, software-defined infrastructure.

 

There is something quietly powerful about a system that improves efficiency without demanding attention.

No complexity exposed to the user.
No disruption to applications.
No constant tuning required.

Just a smarter way of using what already exists.

Monday, March 23, 2026

NVMe Memory Tiering in VMware Cloud Foundation 9

 


In almost every infrastructure design discussion, there comes a point where things stop being elegant.

It usually starts with confidence.
You size your clusters carefully. CPU is balanced. Storage is optimized. Everything aligns with best practices.

And then comes the reality check.

Memory begins to run out.

Not dramatically. Not all at once. But gradually new workloads, growing applications, increasing user demand. And suddenly, the most expensive component in your design becomes the limiting factor.

So the solution feels obvious.

Add more DRAM.

But that solution comes with a cost—one that grows faster than most teams expect. And over time, a question starts to form:

Are we scaling infrastructure… or just scaling cost?

A Different Way to Think About Memory

This is where NVMe Memory Tiering in VMware Cloud Foundation (VCF) 9 introduces a subtle but powerful shift.

It doesn’t try to replace DRAM.
It doesn’t compromise performance.
It simply changes how memory is used.

At its core lies a simple realization:

Not all allocated memory is actively used at the same time.

Some memory pages are constantly accessed—critical to performance.
Others sit idle for long periods, quietly consuming expensive DRAM.

Traditional systems treat both the same. NVMe Memory Tiering does not.

With NVMe Memory Tiering, memory evolves from a static pool into a dynamic, self-optimizing system.

Instead of relying entirely on DRAM, the system introduces a second layer:

  • DRAM – fast, responsive, and reserved for active workloads
  • NVMe SSD – slightly slower, but highly cost-efficient, used for less active data

What makes this powerful is not the existence of two tiers—but the intelligence that connects them.

The hypervisor continuously observes memory behavior. It identifies which pages are actively used and which are not. Based on this, it quietly reorganizes memory in real time.

Active data remains in DRAM. Inactive data is moved to NVMe.
And if something becomes active again, it is seamlessly brought back.

All of this happens without disruption, without manual tuning, and without the virtual machine ever being aware.

Not a Workaround—A Smarter Design

It is important to understand what NVMe Memory Tiering is not.

It is not swapping.
It is not memory compression.

Those mechanisms react to memory pressure after it occurs.

This is different.

This is proactive.

Instead of waiting for memory to become a problem, the system ensures that:

  • High-performance memory is always available where it matters
  • Lower-cost memory absorbs what does not need speed

It’s a shift from reacting to optimizing.

Expanding Capacity Without Expanding Cost

One of the most compelling outcomes of this approach is its impact on scalability.

Because NVMe storage is significantly more cost-effective than DRAM, it can be used to extend memory capacity in a meaningful way.

A system configured with 512 GB of DRAM can effectively support workloads as if it had close to double that capacity—without physically doubling DRAM.

This is not an illusion.
It is the result of using memory more efficiently.

The Balance That Makes It Work

Despite its elegance, NVMe Memory Tiering is not magic. It follows a very important rule:

DRAM must always be sufficient to hold the active working set.

This is the foundation of good design.

If active memory exceeds DRAM capacity, the system is forced to rely more heavily on NVMe. While NVMe is fast, it is still not DRAM. Over time, this imbalance can introduce latency that applications may begin to feel.

This is why understanding workload behaviour is critical.

The success of NVMe Memory Tiering is not defined by how much memory you allocate—but by how well you understand what is actively used.

Where It Truly Delivers Value

When aligned with the right workloads, NVMe Memory Tiering can feel transformative.

In VDI environments, where user activity fluctuates and large portions of memory remain idle, it dramatically improves density and cost efficiency.

In development and testing environments, where systems are often over-provisioned, it brings balance without sacrificing flexibility.

In mixed workload clusters, it introduces a level of intelligence that allows infrastructure to adapt naturally to changing demands.

However, in environments where latency is critical—such as real-time systems or large in-memory databases—DRAM remains irreplaceable. These workloads demand consistency above all else.

Understanding this distinction is what defines a mature design.

Designing with Insight, Not Assumption

The most effective use of NVMe Memory Tiering begins long before it is enabled.

It begins with observation.

How much memory is truly active?
When do workloads peak?
How much of what is allocated is used?

These are the questions that shape a successful design.

Because ultimately, NVMe Memory Tiering is not about adding capacity.
It is about unlocking unused potential.

A Shift in How We Build Infrastructure

If you step back and look at the bigger picture, NVMe Memory Tiering represents something more fundamental.

For years, infrastructure scaling has been tied directly to hardware:

  • More demand meant more resources
  • More resources meant higher cost

But that model is changing.

We are moving toward systems that:

  • Understand usage patterns
  • Adapt in real time
  • Optimize themselves without constant intervention

This is the essence of modern, software-defined infrastructure.

 

There is something quietly powerful about a system that improves efficiency without demanding attention.

No complexity exposed to the user.
No disruption to applications.
No constant tuning required.

Just a smarter way of using what already exists.

Wednesday, December 10, 2025

Live Patching in VMware Cloud Foundation 9 – A Major Leap in Zero-Downtime Lifecycle Management

 With VMware Cloud Foundation 9, Live Patching has evolved from a promising feature into a truly powerful capability that transforms how infrastructure teams manage ESXi hosts at scale. In previous releases, Live Patch was mainly limited to the VM execution layer. But with VCF 9, the technology has matured significantly — expanding the scope of what can be patched without downtime and delivering deeper integration with the SDDC Manager lifecycle workflows.

This is a major step toward a future where critical infrastructure stays continuously available while staying continuously updated.



What’s New With Live Patching in VCF 9

VCF 9 introduces enhanced Live Patch capabilities across the ESXi host stack, making patching even more seamless:

1. Expanded Patch Coverage

Earlier releases focused primarily on the VMX/Virtual Machine execution component.
In VCF 9, Live Patch now supports updating:

  • Key vmkernel components
  • Select user-space daemons
  • Additional management agents
  • Newer security and stability modules

This means more patches can be applied without rebooting the host or impacting workloads.

2. Deep Integration With SDDC Manager

Lifecycle Manager in VCF 9 automatically identifies whether a patch is live-patchable or requires a traditional reboot workflow.
Admins now get:

  • Automated compatibility checks
  • Integrated “Live Patch Eligible” flag in LCM workflows
  • No need to manually track which patches need downtime

This tight integration helps ensure that clusters stay compliant without manual planning or human error.

3. Improved Fast-Suspend-Resume (FSR) Reliability

Live Patch still uses VMware’s Fast-Suspend-Resume mechanism, but VCF 9 includes:

  • Faster switchover to patched components
  • Better support for larger clusters
  • Reduced risk of VM interruptions
  • Improved handling of parallel patching operations

The result is even lower operational impact during patch transitions.

Why Live Patching in VCF 9 Is a Game-Changer

Zero Downtime for More Patch Types

With a much broader set of components eligible for Live Patch, maintenance windows become rare.
Most security fixes — even those in core components — can now be applied live.

Stronger Security Posture

Organizations can respond to vulnerabilities immediately. No delays. No dependency on host evacuations or cluster capacity.

Perfect for Large, High-Density Environments

In large VCF workload domains, draining hosts or performing rolling reboots is time-consuming and sometimes impractical.
Live Patching keeps workloads steady and reduces cluster churn.

 Automated & Consistent Lifecycle Management

SDDC Manager orchestrates the entire live patching process, eliminating guesswork and ensuring compliance across all hosts in a domain.

 Significant Operational Savings

Less downtime planning.
Fewer after-hours changes.
Lower admin overhead.
Higher SLA compliance.

Considerations in VCF 9

Even with expanded coverage, Live Patch is not universal:

  • Certain driver updates, hardware-dependent modules, storage controllers, and NIC firmware still require reboots.
  • VMs using FT, DirectPath I/O, or unsupported workloads may not participate in FSR.
  • All hosts in the domain must meet the required ESXi baseline before enabling Live Patch cycles.

VCF 9 clearly labels these cases and routes them through a traditional maintenance mode workflow.

Where Customers Benefit Most

Live Patching in VCF 9 is ideal for:

  • Mission-critical workloads with strict uptime requirements
  • Customers running large clusters or multiple workload domains
  • Cloud providers and MSPs managing hundreds of hosts
  • Financial, telecom, and healthcare environments
  • AI/ML and GPU-heavy workloads where host evacuations are costly


Live Patching in VCF 9 represents the next level of VMware’s commitment to continuous, resilient, and automated infrastructure operations. By expanding live-patchable components and integrating the feature seamlessly into SDDC Manager, VMware has made it possible for organizations to stay secure and compliant without sacrificing uptime.

This is not just an enhancement — it is a redefinition of how lifecycle management should work in modern datacenters.

Live Patching in VMware Cloud Foundation 9 – A Major Leap in Zero-Downtime Lifecycle Management

 

With VMware Cloud Foundation 9, Live Patching has evolved from a promising feature into a truly powerful capability that transforms how infrastructure teams manage ESXi hosts at scale. In previous releases, Live Patch was mainly limited to the VM execution layer. But with VCF 9, the technology has matured significantly — expanding the scope of what can be patched without downtime and delivering deeper integration with the SDDC Manager lifecycle workflows.

This is a major step toward a future where critical infrastructure stays continuously available while staying continuously updated.

What’s New With Live Patching in VCF 9

VCF 9 introduces enhanced Live Patch capabilities across the ESXi host stack, making patching even more seamless:

1. Expanded Patch Coverage

Earlier releases focused primarily on the VMX/Virtual Machine execution component.
In VCF 9, Live Patch now supports updating:

  • Key vmkernel components
  • Select user-space daemons
  • Additional management agents
  • Newer security and stability modules

This means more patches can be applied without rebooting the host or impacting workloads.

2. Deep Integration With SDDC Manager

Lifecycle Manager in VCF 9 automatically identifies whether a patch is live-patchable or requires a traditional reboot workflow.
Admins now get:

  • Automated compatibility checks
  • Integrated “Live Patch Eligible” flag in LCM workflows
  • No need to manually track which patches need downtime

This tight integration helps ensure that clusters stay compliant without manual planning or human error.

3. Improved Fast-Suspend-Resume (FSR) Reliability

Live Patch still uses VMware’s Fast-Suspend-Resume mechanism, but VCF 9 includes:

  • Faster switchover to patched components
  • Better support for larger clusters
  • Reduced risk of VM interruptions
  • Improved handling of parallel patching operations

The result is even lower operational impact during patch transitions.

Why Live Patching in VCF 9 Is a Game-Changer

Zero Downtime for More Patch Types

With a much broader set of components eligible for Live Patch, maintenance windows become rare.
Most security fixes — even those in core components — can now be applied live.

Stronger Security Posture

Organizations can respond to vulnerabilities immediately. No delays. No dependency on host evacuations or cluster capacity.

Perfect for Large, High-Density Environments

In large VCF workload domains, draining hosts or performing rolling reboots is time-consuming and sometimes impractical.
Live Patching keeps workloads steady and reduces cluster churn.

 Automated & Consistent Lifecycle Management

SDDC Manager orchestrates the entire live patching process, eliminating guesswork and ensuring compliance across all hosts in a domain.

 Significant Operational Savings

Less downtime planning.
Fewer after-hours changes.
Lower admin overhead.
Higher SLA compliance.

Considerations in VCF 9

Even with expanded coverage, Live Patch is not universal:

  • Certain driver updates, hardware-dependent modules, storage controllers, and NIC firmware still require reboots.
  • VMs using FT, DirectPath I/O, or unsupported workloads may not participate in FSR.
  • All hosts in the domain must meet the required ESXi baseline before enabling Live Patch cycles.

VCF 9 clearly labels these cases and routes them through a traditional maintenance mode workflow.

Where Customers Benefit Most

Live Patching in VCF 9 is ideal for:

  • Mission-critical workloads with strict uptime requirements
  • Customers running large clusters or multiple workload domains
  • Cloud providers and MSPs managing hundreds of hosts
  • Financial, telecom, and healthcare environments
  • AI/ML and GPU-heavy workloads where host evacuations are costly

Live Patching in VCF 9 represents the next level of VMware’s commitment to continuous, resilient, and automated infrastructure operations. By expanding live-patchable components and integrating the feature seamlessly into SDDC Manager, VMware has made it possible for organizations to stay secure and compliant without sacrificing uptime.

This is not just an enhancement — it is a redefinition of how lifecycle management should work in modern datacentres.

 

 

 

 

 

 

 

 

Saturday, December 6, 2025

Upgrading a vSphere 8.x Environment to VMware Cloud Foundation 9.0 – Real-World Journey


The release of VMware Cloud Foundation (VCF) 9.0 marks a major shift in how modern private cloud platforms are engineered and managed. For organizations operating a vSphere 8.x environment, the path to VCF 9.0 introduces a more modular architecture, improved lifecycle management, stronger security baselines, and support for next-generation workloads.

This guide provides a deep, end-to-end walkthrough of the upgrade journey—from preparation and compatibility validation through the actual upgrade sequencing and post-upgrade verification. The goal is to help architects and administrators execute this transition confidently, with clarity on each critical step.








Why Move From vSphere 8.x to VCF 9.0

Although the vSphere 8.x setup was stable and well-structured—with multiple clusters operating reliably across compute-only hosts, vSAN-based nodes, and some NSX-integrated workloads—it still carried several limitations typical of a growing data centre. The environment functioned well day to day, but the underlying operational challenges signaled the need for a more unified and automated cloud platform.

  • Lifecycle management tasks were still manual and time-consuming
  • Host upgrades required extended maintenance windows
  • Network configuration consistency differed across clusters
  • Governance and policy enforcement weren’t unified
  • Operational tooling was fragmented across different systems

At the same time, there was a clear goal to achieve:

  • A private cloud experience aligned with hyperscaler standards
  • Automated, streamlined operations
  • Centralized lifecycle management for the entire stack
  • A foundation ready for Kubernetes and modern application platforms

VCF 9.0 delivered exactly the kind of integrated, automated, and future-ready platform needed to address these requirements.

The First Step: Understanding What We’re Actually Changing

VCF 9.0 is not like “upgrading vCenter from 8.0 to 8.0U3.”
It’s a platform-level transformation.

When you transition from vanilla vSphere to VCF, three things change dramatically:

1. Your infrastructure becomes governed by a Fleet (VCF Fleet Management)

Everything — ESXi hosts, vCenter, NSX, vSAN, certificates, operations — begins to live under a unified lifecycle management engine.

2. Your management architecture gets an entire redesign

VCF 9 introduces Fleet, Operations, and Automation components that work together. This simplifies operations but changes how things are deployed and updated.

3. Your cluster upgrade model becomes image-based only

No more baselines.
No more VUM.
This was a big shift for the customer.

Understanding these changes helped set the right expectations before touching anything.

 

Pre-Upgrade Checklist: What I Checked (and Double-Checked)

I’ve done enough upgrades to know: 70% of failures happen due to missing prerequisites.

So here’s what I validated before even thinking of VCF:

 Hardware compatibility (HCL)

  • CPU family supported for ESXi 9.x
  • NIC/FW/HBA firmware compatibility
  • vSAN ESA readiness (for their vSAN-enabled clusters)

Networking: MTU, VLANs, TEP readiness

VCF 9 doesn’t enforce NSX overlay for every cluster, but if you want it, you need MTU 1600+.

Even if you don’t want overlay now — plan for it.

DNS, NTP, Certificates

VCF is extremely sensitive to:

  • forward/reverse lookups,
  • certificate mismatches,
  • expired PSC/SSO certs.

Backup of all management components

Rule: If it boots, back it up.
vCenter, NSX Manager, Aria components — everything.

Operations tools version readiness

If the customer had older versions of:

  • Aria Operations,
  • Aria Operations for Logs,
  • Aria Automation,

…they must be upgraded before joining the VCF 9 Fleet.

 Licensing

A surprisingly common delay.
We pre-validated VCF licenses before starting.

 

My Upgrade Strategy: Breaking It into Logical Phases

Instead of treating this as one giant upgrade, I approached it in four major phases:

Phase 1 — Stabilize and Upgrade the Existing vSphere 8.x Environment

This includes:

  • Upgrading vCenter to a version supported by VCF Installer
  • Making sure ESXi hosts are healthy
  • Ensuring NSX Managers (if present) are compatible

For vCenter, I chose the “reduced downtime” upgrade path.
It creates a new appliance and copies over config — safer and cleaner.

For ESXi hosts, I started preparing the shift from baseline to image-based lifecycle, because VCF will enforce image compliance later anyway.

This phase established the foundation

Phase 2 — Upgrade or Deploy VCF Operations

This was the first moment where I really saw the shift from “vSphere admin” to “cloud admin.”

We had two options:

Option A: Upgrade existing Aria Suite to versions supported by VCF

or

Option B: Deploy VCF Operations fresh

I chose Option A because the I had existing dashboards and compliance packs I  wanted to retain.

A few notes from this phase:

  • Operations upgrade pre-checks are extremely strict
  • Old credentials stored in Aria can break registration workflows
  • Time sync (NTP) must be perfect between all appliances

Once Aria was upgraded, we registered it properly with SDDC Manager.

 

Phase 3 — Deploy VCF Installer (The New Heart of Everything)

VCF 9 doesn’t use Cloud Builder. Instead, everything begins with the VCF Installer.

This step felt like “building a new control tower” while the airport is still active.

Steps I took:

1. Deployed the VCF Installer OVA

Simple enough, but ensure:

  • DNS resolution is perfect
  • IP addresses are reserved
  • FQDN matches forward/reverse

2. Configured online/offline bundle access

I  had strict firewall restrictions, so we used:

  • Offline bundle depot,
  • Hosted on an internal web server.

This avoided internet dependency.

3. Connected Installer to the existing vSphere 8 environment

Here, I selected:

  • Using the existing vCenter
  • Using existing ESXi hosts
  • Using upgraded Aria components

4. Performed pre-checks

VCF pre-checks are extensive.
They will catch:

  • DNS mismatches
  • MTU inconsistencies
  • NTP drift
  • Host hardware issues
  • Missing drivers
  • Certificate chain trust problems

I spent the most time here.

But honestly — fixing issues before deploying Fleet saved us hours later.

Phase 4 — Converging Into a VCF 9 Fleet

This was the most exciting part.

VCF Fleet Management discovers your environment and begins standardizing it.

The Installer automatically:

  • creates the Fleet database,
  • sets up SDDC Manager,
  • registers Aria Operations & Logging,
  • connects to vCenter,
  • establishes governance,
  • and prepares workload domains.

After this, the environment officially becomes VCF 9. It felt like everything clicked into place.

 

Post-Upgrade Work: What I Did to Finalize Everything

Upgrading isn't over until the environment is stable and integrated.

I focused on:

1. Verifying Fleet inventory

Checking that:

  • hosts,
  • clusters,
  • vCenter,
  • NSX Managers,
  • Aria tools were all correctly discovered.

2. Validating image compliance

VCF now enforces image-based lifecycle. I created cluster images and remediated any drift.

3. Running operational sanity checks

  • vMotion
  • DRS behaviour
  • vSAN health
  • Host remediation testing
  • Backup tool integration
  • Logging ingestion

4. Re-validating integrations

  • AD/LDAP
  • Certificate authority
  • Syslog
  • Monitoring tools
  • Backup vendors

5. Documenting everything

Always, always document:

  • build versions
  • IP/FQDN mapping
  • upgrade decisions
  • rollback plan
  • cluster design
  • lifecycle policy

This helps you as future admins.

What I Learned From This Upgrade

1. VCF 9 is not “just an upgrade” — it’s a platform transition

It changes how you operate your data center.

2. Lifecycle management becomes dramatically easier

Once Fleet is in place, upgrades feel like cloud updates.

3. Pre-checks decide your success

If pre-checks are green, the rest of the journey becomes smooth.

4. DNS, MTU, and certificates are the silent killers

Almost every deployment issue traces back to one of these.

6. Documentation gaps matter

I documented every decision, so the next person doesn’t struggle.

Upgrading from vSphere 8.x to VMware Cloud Foundation 9.0 is one of the most meaningful modernization steps you can take in a private cloud environment. It brings consistency, automation, lifecycle uniformity, and long-term stability.

But it’s not a “click next” upgrade.
It requires thoughtful planning, clear understanding, and methodical execution.

If you understand the journey, prepare thoroughly, and respect the dependencies, the upgrade becomes smooth — and honestly, rewarding.


I hope sharing it helps someone preparing for theirs.


Deploy Windows VMs for vRealize Automation Installation using vRealize Suite Lifecycle Manager 2.0

Deploy Windows VMs for vRealize Automation Installation using vRealize Suite Lifecycle Manager 2.0 In this post I am going to describe ...