VCF 9.1: The Features That Actually Matter If You Run VCF < ProVirtualzone

This blog post, VCF 9.1: The Features That Actually Matter If You Run VCF, is not a walkthrough of every new feature. A look at the changes that actually affect how you run the platform, from someone who does.

VCF 9.1 shipped on May 5, and if you read the release notes end to end, you will find well over eighty individual changes. Every VMware blog already walks through all of them, pillar by pillar, in the same order as the release notes present them. I am not going to do that, because a list of eighty features in the vendor’s own order tells you nothing about what to actually care about.

What follows is how I read this release after years of running VCF in production. I have grouped the changes around the operational themes that actually matter when you run the platform, rather than around the Compute, Storage, Network, and Kubernetes structure used in the documentation. Some of what Broadcom put front and center barely registers for a working operator, and some of what they buried in the middle of the release notes is the best thing in the whole version. This is my read, and yours may differ depending on your environment. That is fine. But if you want an operator’s opinion rather than a feature summary, here it is.

This Is the Release That Finally Attacks Downtime

If there is one theme that ties the best parts of 9.1 together, it is the reduction of downtime and maintenance effort. Broadcom scattered these changes across the Compute and Operations sections of the release, but when you put them together, they tell a single story, and it is the story I care about most. Anyone who runs VCF at scale spends a large part of their life planning and executing maintenance windows, and this release goes after that work directly.

vCenter Quick Patch

This is the change I am happiest about, and it is not close. Until now, applying a vCenter security patch meant taking vCenter down. Every single fix, no matter how small, needed a full maintenance window because the appliance had to go offline to be patched. When you run multiple vCenters across multiple datacenters, it adds up to a genuinely painful amount of scheduled downtime over the year, and security patches are not optional. Quick Patch changes the model so that only the services actually affected by the patch need to be restarted, and vCenter continues to operate throughout the process. In practice, that means many vCenter patches move from a planned after-hours window to something you can do with little or no disruption. That is not a flashy feature, and it will never headline a launch, but it removes one of the most tedious recurring tasks in the entire platform. For me, this alone is a strong argument for 9.1. The only caution is that not every patch will qualify for the reduced-downtime path, so you still need to read what each patch actually requires before you assume it is a no-impact operation.

ESX Live Patch, on by default

The host equivalent of the same idea, and just as welcome. ESX Live Patch lets you apply certain host patches without the full evacuate, patch, reboot, and re-populate cycle that has defined host maintenance for as long as I can remember. In 9.1, it is enabled by default, and coverage has been expanded to include TPM support. The reason this matters is entirely a matter of scale. Patching ten hosts the old way is an afternoon. Patching several hundred hosts the old way is a multi-day rolling operation that consumes an engineer’s attention, forces DRS to shuffle workloads across the whole estate, and carries real risk every time a host comes back up. Anything that lets a host take a patch without a full reboot cycle removes a large amount of that effort and risk. The honest caveat, same as with vCenter, is that not all patches are live-patchable, so you will still have reboot-required fixes in the mix. But moving the default to live patching and widening what qualifies is exactly the right direction, and at scale, the time saved is measured in engineer-days, not minutes.

Per-host upgrade selection and faster parallel upgrades

This one fixes something that has genuinely frustrated me in production. In the old model, a cluster upgrade was all-or-nothing. If one host in the cluster threw an error partway through, the whole upgrade stalled, leaving you either troubleshooting that host under time pressure in the middle of your window or rolling the entire operation back and starting again. Neither is a good place to be at two in the morning. In version 9.1, you can select specific hosts for the upgrade and skip those with issues, so a single misbehaving host no longer holds the rest of the cluster hostage. You finish the healthy hosts, close the window, and deal with the outlier as a separate, unhurried task. On top of that, Broadcom claims a fourfold improvement in parallel cluster upgrade operations, and the prechecks have been reworked to run natively rather than leaning entirely on SDDC Manager, with results you can export to CSV. I will believe the parallelism number when I measure it in my own environment, but the per-host control is a concrete, obvious improvement that changes how much you can safely attempt in a single maintenance window.

DRS maintenance evacuation and parallel vMotion

Two DRS improvements that belong together because they both improve maintenance operations. The first change is how a host enters maintenance mode. Previously, evacuating a heavily loaded host could create resource contention across the rest of the cluster because DRS would push workloads out regardless of whether the remaining hosts could comfortably absorb them. In 9.1, DRS can delay the evacuation if compute demand cannot be met and rebalance the remaining capacity first, so you do not trade a clean maintenance operation for a performance problem everywhere else.

The second change allows DRS to process vMotion tasks in parallel, starting new migrations as soon as one completes, rather than waiting for an entire batch to finish. For any operation that moves a lot of VMs at once- host evacuations, rebalancing, maintenance prep- that parallelism cuts the total time meaningfully. Neither of these is glamorous. Both of them make the single most common operational task, moving workloads around safely, faster, and less risky. At high consolidation ratios, which is where most of us run to control cost, they matter more than any headline feature in the release.

The Management Stack Was Rebuilt Underneath You

This is the part of 9.1 that nobody markets and everybody who runs VCF needs to understand. A set of structural changes to the management components will not show up as a feature you use, but they will show up the moment you open a runbook or an architecture diagram written for 9.0. I would read this section before planning any upgrade, because these are the changes most likely to surprise you.

VCF Operations consolidation and the license server

In 9.0, the Fleet Management Appliance was a distinct component in the management domain. In 9.1, it is gone, replaced by a fleet lifecycle capability folded directly into VCF Operations, which now handles installation, upgrades, patching, backups, and restores of the management components from a single place. At the same time, your licenses have moved out of VCF Operations and into a dedicated license server that installs automatically alongside it. Individually, these sound like plumbing. Together, they change the shape of the management stack you deploy and operate. If your design documents, your runbooks, or your recovery procedures reference the Fleet Management Appliance, they are now out of date, and the time to discover that is during planning for the upgrade, not during execution.

I am generally in favor of this direction because consolidating lifecycle management into fewer moving parts is the right call for a platform as complex as this. But it is the kind of change that quietly invalidates institutional knowledge, and teams that treat 9.1 as a simple version bump will get caught out. Treat the management-stack changes as their own workstream in your upgrade planning.

The Identity Broker replaces VIDM

This one deserves attention from anyone upgrading from an older release, because identity is where upgrades go wrong quietly. VIDM has been deprecated in the 9.x line and replaced by the Identity Broker, which is now deployed as part of a 9.1 upgrade. If you are coming from 5.x, there is a scripted workflow to migrate users and groups non-disruptively once the base upgrade is done, and you can run the broker either embedded in vCenter or as a separate three-node cluster for better availability. The reason I flag this is that authentication is the thing that, when it breaks during an upgrade, breaks everything at once and generates the loudest phone calls.

The migration path being scripted and non-disruptive is good, but you still need to plan it, test it, and understand which mode you want the broker deployed in before you start. Moving the broker out of the management domain constraint and allowing a proper clustered deployment is a real availability improvement, but it is also a change to a component you cannot afford to get wrong. Do not let this one ride along unnoticed inside the broader upgrade.

Storage Grows Up for People Who Outgrew HCI

For years, the vSAN story in VCF was fundamentally an HCI story: compute and storage together in the same hosts. Much of the storage work in 9.1 is aimed at people who have moved past that model, or want to, and it is the clearest signal yet that VCF is serious about disaggregated storage. If you run traditional local vSAN, you will not notice most of this. If you run compute and storage separately, or plan to, this is the release where that architecture becomes meaningfully more capable.

Remote vSAN datastores across vCenter boundaries

The change that stands out most to me is the ability of vSAN storage clusters to provide storage across vCenter boundaries, including within workload domains within a single VCF installation and even across separate VCF deployments. Add to that mixed-mode support, where a cluster can mount both vSAN OSA and ESA datastores at the same time, and you have real flexibility that simply did not exist before. This matters because the old vCenter boundary was a hard architectural wall for vSAN, and designing around it forced awkward compromises in larger environments. Being able to share a storage cluster across that boundary enables designs that were previously impossible or would have required duplicating capacity. It also makes migrating data between OSA and ESA far less painful, which is a real concern for anyone still sitting on OSA and planning the move to ESA.

My one reservation is operational: cross-vCenter storage dependencies add a failure-domain question you did not have before, and I would want to understand exactly how these remote mounts behave during a site or vCenter failure before I relied on them for anything critical. The capability is excellent. The design discipline it demands is higher, not lower.

Data-in-transit encryption and global deduplication

Two storage changes that matter most are in exactly the disaggregated setups that the remote datastore feature enables. The first is data-in-transit encryption between the client clusters that mount a datastore and the storage cluster that provides it. Once you separate compute from storage, your storage traffic crosses the network in ways it never did on a self-contained HCI host, and encrypting that traffic is not a nice-to-have; it is table stakes for many compliance regimes. Combining it with data-at-rest encryption gives you genuine end-to-end protection for disaggregated vSAN, closing a real gap.

The second is that vSAN ESA global deduplication has gone GA and now works alongside data-at-rest encryption without wrecking your data reduction ratios, with deduplication supported across three to sixty-four hosts. For anyone whose storage costs are driven by capacity, dedup that survives encryption is a direct saving. Neither of these is exciting to write about, but both are capabilities that quietly make a disaggregated design defensible to a security team and a finance team at the same time, which is usually where these projects live or die.

The Multi-Tenant Networking Rework

There is a substantial block of networking changes in 9.1 that only matter if you run a multi-tenant platform, but if you do, they matter a lot. This is provider territory, and it is where a lot of the real engineering in this release went. If you run a single-tenant enterprise environment, you can read this section for interest and move on. If you run tenants, read it twice.

Transit gateways and network services without edges

The networking work’s headline is decoupling network services from the traditional edge model. Virtual Network Appliances now run services for distributed external connections, so much of the L2, L3, and external IP traffic can be distributed rather than funneled through edge nodes, with only NAT and load balancer traffic redirected. On top of that, transit gateways gained real flexibility: HA mode per gateway with the gateway decoupled from the Tier-0, multiple gateways per project, and multiple external connections per gateway. For a provider, this is the difference between fighting the platform to build the tenant topologies you actually need and having the platform natively support them. The old edge-centric model forced design compromises in dense multi-tenant environments, particularly around how many service routers you could reasonably run and how you scaled external connectivity per tenant. Distributing services and giving each tenant gateway proper independence removes a real ceiling.

I want to spend serious time testing the distributed model at scale before I move production tenants onto it, because networking changes of this depth always have edge cases that only show up under load, but the direction is exactly what a provider platform needs. This is the most consequential part of 9.1 for anyone in the hosting business.

Image source: VMware by Broadcom. Adapted for commentary. © Broadcom.

Connectivity policy for tenant isolation

Alongside the transit gateway work is a new connectivity policy model that governs how tenant VPCs communicate, without requiring a firewall for basic isolation. VPCs can be assigned a policy: community, where they talk to peers in their group, promiscuous, where they can reach any VPC, or isolated, where they can only reach promiscuous VPCs. The value here is operational simplicity. A lot of tenant isolation requirements are actually simple in intent: keep these tenants apart, let these ones talk, and express that through a policy rather than through firewall rules, which removes a whole category of configuration effort and a whole category of mistakes. In a multi-tenant platform, firewall rule sprawl is a genuine operational burden and a genuine source of outages when someone gets a rule wrong. Being able to express common isolation patterns as a simple policy in the VPC is the kind of change that reduces both effort and risk. As always with anything that controls tenant separation, I would test the boundary behavior carefully, because the whole point is isolation and you need to be certain it holds. But conceptually, this is a clean, sensible model that aligns with how providers actually think about tenant connectivity.

What Broadcom Wants You to Look At, and What You Should Actually Look At

Every release has a marketing centerpiece, and for 9.1, it is the AI story. I want to be fair to it and also honest about where it sits for a working operator, because the gap between the two is wide.

The AI-ready platform work lets VCF Operations integrate with RAG and MCP servers and connect its operational data to LLM clients. It is technically interesting, and I can see a future where querying your infrastructure state through a language model is genuinely useful for troubleshooting and for surfacing insights you would otherwise have to dig for. But that future is not today for most of the people actually keeping production platforms healthy. Right now, this is a capability to be aware of and experiment with in a lab, not something that changes the daily job of running VCF. I would not let it influence an upgrade decision one way or the other. If it matures the way the marketing suggests, it will be worth revisiting in a release or two, and I will judge it then on whether it actually saves time rather than on the pitch.

The same caution applies to technology preview items, such as native S3-compatible object storage on vSAN. Preview means preview. It is worth knowing the direction, because object storage on the same vSAN cluster as your block and file is a genuinely useful idea for a provider, but it is not something to build a production plan around until it goes GA. And a large share of the VCF Automation enhancements only matter if you run Automation as a self-service platform for application teams. If you do, features like appstack formations and self-service namespace creation are relevant. If you do not, they are not part of your world. The one Automation item I would flag for a specific audience is the vCloud Director migration path, because if you are a current VCD shop, that is a real, formalized route to the newer model, and it belongs on your radar even if nothing else in this section does.

Closing Thoughts

VCF 9.1 is a solid release. Not because of the long feature list, but because a handful of the changes genuinely improve how you run the platform every day. Let me call out the three that I think matter most and why.

The patching changes are the ones I am most happy about. vCenter Quick Patch and ESX Live Patch together attack the single most tedious part of running VCF at scale: the endless cycle of maintenance windows just to apply security fixes. When you run hundreds of hosts, every patch that used to require a full evacuate-and-reboot was hours of planning, coordination, and babysitting. Being able to live-patch a host, or apply a vCenter security fix with little or no downtime, is not a flashy feature. It is the kind of thing that gives you back real hours in your week and reduces the number of after-hours maintenance windows your team has to sit through. This is the improvement I would mention first to anyone asking why 9.1 is worth it.

The cluster upgrade changes are the second. The per-host upgrade selection, which lets you skip a problem host and finish the rest of the cluster, fixes something that has frustrated me more than once. The old all-or-nothing model meant a single misbehaving host could stall an entire cluster upgrade, and you were stuck either fixing it under time pressure in the middle of the window or rolling the whole thing back. Being able to isolate the outlier, finish the healthy hosts, and deal with the problem host separately is exactly how upgrades should have worked all along. Combined with the claimed 4x improvement in parallel cluster upgrades, this genuinely changes how much you can get done in a single maintenance window.

On the storage side, the vSAN changes are more situational, but if you run disaggregated vSAN, they are significant. Being able to share vSAN storage across vCenter boundaries and separate VCF deployments, with data-in-transit encryption between the client and storage clusters, enables designs that were previously awkward or impossible. For a traditional HCI setup with local vSAN, you will not notice most of this. But for anyone running compute and storage separately, or anyone planning a disaggregated architecture, this is the release where that model gets meaningfully more flexible. I would want to test the cross-vCenter datastore mounting carefully before relying on it in production, but the direction is right.

Everything else is context: valuable if it matches your environment, interesting if you are tracking where the platform is heading, and safe to skip if it does not apply to you. The skill in reading a release this large is not absorbing all eighty features. It is knowing which ten matter for your platform and ignoring the rest until they do.

You can check the official Broadcom/VMware what’s new here: What’s New in VMware Cloud Foundation 9.1 and here.

Note: Some diagrams in this article are original recreations created by ProVirtualzone for explanation and commentary purposes. They are based on publicly available VMware by Broadcom technical materials, product documentation, and architectural concepts. VMware, VMware Cloud Foundation, vSAN, NSX, and related product names are trademarks of Broadcom. All original VMware/Broadcom materials remain the property of Broadcom. This article is independent commentary and is not affiliated with or endorsed by Broadcom.

Share this article if you think it is worth sharing. If you have any questions or comments, comment here, or contact me on Twitter(yes for me is not X but still Twitter) or LinkedIn.

Source link