Building Resilient SAP Infrastructure on VCF < ProVirtualzone

Building Resilient SAP Infrastructure on VCF < ProVirtualzone

Part 1: The Architecture

Series: Building Resilient SAP Infrastructure on VCF

If you’ve been reading my blog, you know my background is in VMware, virtualization infrastructure, and backup. That’s where I’ve spent most of my career. SAP is not my primary area, and I won’t pretend it is. However, years ago I worked extensively with Oracle databases and high-availability designs, and that experience gave me a solid foundation in how enterprise database workloads operate, including replication, clustering, failover, and the associated storage requirements.

Recently, I started thinking about how I would design a fully high-availability architecture for an SAP landscape running on VMware Cloud Foundation (VCF), stretched across multiple data centers. I drew on what I knew from the VMware and storage side, combined it with what I remembered from my Oracle days, and then spent considerable time researching SAP HANA System Replication, Pacemaker clustering for SAP, and how these technologies fit together in a modern infrastructure design.

This blog series is the result of that work. I won’t walk through button clicks or step-by-step installation guides; there’s plenty of documentation for that. Instead, I want to focus on the architecture decisions, the reasoning behind each layer of the design, the alternatives I evaluated and rejected, and the pitfalls I found along the way. If you are a VMware or infrastructure architect who gets pulled into SAP projects and needs to understand how the infrastructure side fits together, this series is for you.

Since the design covers multiple technology domains, including VMware Cloud Foundation, NetApp storage, SAP HANA, and Linux clustering, I decided to split this into a series of posts, each one focusing on a specific layer:

  • Part 1 (this post): The scenario, the design principles, and the overall architecture
  • Part 2: Compute and storage, VCF stretched cluster with NetApp MetroCluster
  • Part 3: SAP HA, HANA System Replication, Pacemaker, and ENSA2
  • Part 4: Disaster recovery with Veeam, ransomware attack and recovery scenario, and lessons learned

All posts in this series are tagged Building Resilient SAP Infrastructure on VCF.

Let’s start with the big picture.

The Scenario

Imagine you need to design a production SAP environment running on VMware Cloud Foundation. The landscape includes S/4HANA, BW/4HANA, SAP Process Orchestration, and several supporting systems, such as Web Dispatcher, Content Server, and Solution Manager. Alongside the SAP workloads, there are other enterprise applications, such as middleware, file services, and monitoring tools, that share the same infrastructure and require the same level of protection.

The requirements from the business are clear:

  • Near-zero downtime for all production SAP systems
  • Zero data loss (RPO = 0) for the HANA databases
  • Automatic failover with no manual intervention during a site failure
  • Ransomware resilience is built into the infrastructure, not bolted on as an afterthought
  • A separate disaster recovery site as the last line of defense, accepting longer recovery times, and serving as the guaranteed clean recovery point in case of a ransomware attack

You have four datacenter sites available. Two of them are in the same metro area and are connected by a low-latency dark-fiber backbone. A third site is in a different region, used for asynchronous replication and quorum services. And a fourth site, geographically separated, serves purely as a disaster recovery location.

Building Resilient SAP Infrastructure on VCF

Why Latency Changes Everything

Before designing anything, you need to understand one number: the round-trip latency between your two primary sites. This single measurement determines which technologies are available to you and which ones are off the table.

In this design, the two primary sites are connected at sub-1ms round-trip latency over a dedicated DWDM (Dense Wavelength Division Multiplexing) metro backbone. That sub-1ms number is critical because it opens up the full spectrum of synchronous technologies:

  • NetApp MetroCluster requires 10ms RTT or less, but the lower the better because every synchronous write pays the latency penalty
  • VCF Stretched Cluster is supported by VMware at up to 10ms RTT when using external storage like MetroCluster, and up to 5ms for vSAN stretched clusters
  • SAP HANA System Replication SYNC mode is recommended by SAP at sub-1ms latency. At higher latencies, you’re forced into async or sync-mem mode, which means accepting a data loss window

At sub-1ms, you don’t have to compromise. You can run full synchronous replication at every layer, storage, database, and application, without any meaningful performance impact. If your latency were 5ms or 10ms, you’d be making tradeoffs at every layer, and this design would look completely different.

This is the first lesson: measure your inter-site latency before you design anything. Don’t assume, measure. The difference between 0.5ms and 5ms lies between a fully synchronous architecture and one full of compromises.

The Four-Layer HA Architecture

The core of this design is four independent high-availability mechanisms, each one protecting against a different type of failure. No single technology covers everything, and that’s the point. You want defense in depth.

Layer 1: VCF Stretched Cluster

VMware Cloud Foundation provides the full software-defined datacenter stack, including vSphere for compute, NSX for networking, and SDDC Manager for lifecycle management. In this design, a single VCF workload domain stretches across both primary sites. All ESXi hosts at both sites are members of the same vSphere cluster, and VMware HA treats the entire cluster as one failure domain. If a host fails at Site A, VMs restart automatically on any available host, including hosts at Site B. vMotion works seamlessly between sites, so you can rebalance workloads or evacuate a site for planned maintenance without any downtime.

Using VCF rather than standalone vSphere gives you NSX for consistent network policy across both sites and SDDC Manager for standardized host provisioning and patching. This matters in a stretched cluster because you need both sites to be identically configured at all times.

The key enabler for a stretched cluster is shared storage visible from both sites, which brings us to Layer 2.

Layer 2: NetApp MetroCluster (Synchronous Storage Mirroring)

NetApp MetroCluster places a matched pair of storage controllers at each site and synchronously mirrors every write between them. From the ESXi hosts’ perspective, the NFS datastores are available at both sites simultaneously. If Site A loses its storage controller, Site B’s controller takes over automatically, and the VMs don’t even notice.

This is what makes the stretched cluster work. Without shared storage across both sites, VMware HA can only restart VMs on the site where their datastores live. With MetroCluster, the storage is everywhere, so the VMs can restart anywhere.

A MetroCluster Mediator runs at Site C (the async/quorum site) to handle automatic unattended switchover (AUSO) in the event of a complete site failure. This third-site witness is essential. Without it, MetroCluster cannot automatically determine which site should survive a network partition.

Note: It is worth noting that the Mediator is not the only option here. NetApp also offers MetroCluster Tiebreaker, a separate monitoring application that can also trigger automatic switchover. However, you cannot use both with the same MetroCluster configuration. For MetroCluster IP, the Mediator is the recommended choice because it verifies that both SyncMirror and DR mirroring are synchronized before initiating a switchover, reducing the risk of data loss. The Tiebreaker in active mode can trigger a switchover even when mirrors are out of sync, which introduces a risk of data loss. Without either the Mediator or the Tiebreaker, any switchover after a site failure would require manual intervention, which defeats the purpose of an automatic HA design. I will go deeper into this in Part 2.

One important thing to keep in mind with synchronous mirroring: MetroCluster faithfully replicates everything, including damage. NetApp’s own documentation confirms that MetroCluster duplicates data on a transaction-by-transaction basis. This means if ransomware encrypts your production data on Site A, MetroCluster will mirror those encrypted blocks to Site B in real time. Both sites are now compromised. This is not a flaw in MetroCluster; it is how synchronous replication works by design, and it applies to any synchronous mirroring technology.

This is why the architecture includes additional layers of ransomware defense at the storage level. NetApp ONTAP Autonomous Ransomware Protection (ARP), supported on MetroCluster configurations since ONTAP 9.10.1, uses machine learning to detect unusual encryption patterns and file activity in real time. When ARP detects a potential attack, it automatically creates a locked snapshot of the pre-attack data. That snapshot is tamperproof and cannot be deleted even by a compromised administrator account. So even though MetroCluster continues mirroring the encrypted data, the clean recovery point is preserved on the local controller.

Why does this matter for SAP specifically? Because SAP systems are actively being targeted. In 2025, multiple critical vulnerabilities in SAP S/4HANA and NetWeaver were exploited in the wild by ransomware groups and nation-state threat actors, compromising hundreds of SAP systems across multiple industries. A critical S/4HANA code injection vulnerability (CVE-2025-42957, CVSS 9.9) was confirmed under active exploitation, with security researchers warning that successful exploitation can lead to full system compromise and ransomware deployment with minimal effort. This is not a theoretical risk.

I will cover the full ransomware attack and recovery scenario in detail in Part 4 of this series, including how ARP, tamperproof snapshots, async replication to Site C, and Veeam immutable backups at Site D work together to get you back to a clean state.

Layer 3: SAP HANA System Replication (SYNC Mode)

While MetroCluster protects the storage, HANA has its own built-in replication mechanism. HANA System Replication (HSR) maintains a full in-memory copy of the database on a secondary system at Site B. In SYNC mode, every transaction is confirmed on both sides before being acknowledged to the application, so RPO is truly zero.

Why do you need this if MetroCluster already mirrors the storage? Because HANA is an in-memory database. If a HANA host crashes, the database needs to reload from disk. With HSR, the secondary already has the data preloaded in memory, so the takeover takes 1 to 3 minutes instead of the 15 to 30 minutes it would take to reload from storage.

A Pacemaker cluster manages the HANA takeover. It monitors the HANA processes, detects failures, promotes the secondary to primary, and floats the virtual IP address to the new primary.

Layer 4: SAP Application HA (ENSA2 + Pacemaker)

The SAP application layer has its own single point of failure: the ABAP Central Services (ASCS) instance, which manages the enqueue lock table. If ASCS goes down, all SAP users are affected.

SAP’s answer is ENSA2 (Enqueue Replication Server 2), which runs a replica of the lock table on a second node. Pacemaker manages the failover between the ASCS primary and the ERS (Enqueue Replication Server) secondary across both sites. Lock table failover happens in 30 to 60 seconds.

How the Four Layers Work Together

The key insight is that each layer protects against a different failure domain:

Building Resilient SAP Infrastructure on VCF

No single layer alone covers all scenarios. VMware HA can restart VMs, but it doesn’t understand HANA or SAP. MetroCluster protects storage but doesn’t manage application failover. HANA SR handles the database but not the application layer. You need all four.

Building Resilient SAP Infrastructure on VCF

The Fourth Site: Disaster Recovery and Ransomware Recovery with Veeam

Building Resilient SAP Infrastructure on VCF

Sites A, B, and C form the high-availability and quorum infrastructure. But what if you lose both primary sites? Or, more realistically, in today’s threat landscape, what if a ransomware attack compromises your entire production environment across both metro sites?

That’s where Site D comes in, and this is arguably the most important site in the entire design.

Site D is powered entirely by Veeam, with Object First Ootbi as the primary backup repository. Ootbi (Out-of-the-Box Immutability) is a purpose-built backup storage appliance designed specifically for Veeam, with immutability enabled by default upon deployment. There’s no MetroCluster, no stretched cluster, no Pacemaker at this site, just Veeam replicas and backups stored on Ootbi’s S3-compatible immutable object storage that cannot be encrypted, deleted, or tampered with.

Of course, to get data to Site D, you need Veeam infrastructure at the source. Veeam backup servers and proxies run at Sites A and B as part of the VCF workload domain, with backup and replication jobs sending data over the WAN to the Ootbi appliance at Site D. I will go into the full Veeam architecture, job design, and retention policies in Part 4.

This separation is critical. As I mentioned in the MetroCluster section, synchronous mirroring replicates damage as faithfully as it replicates good data. If ransomware encrypts your HANA databases and SAP systems on Site A, Site B has the same encrypted data within milliseconds. Even the NetApp ONTAP ransomware detection and immutable snapshots on Sites A and B, while extremely valuable, are still part of the same storage infrastructure that was attacked. Site D running Veeam with Ootbi on a completely independent technology stack gives you a recovery point that is architecturally isolated from whatever compromised your production environment. Ootbi’s immutability is not a setting you can disable or a policy an attacker can modify. It is built into the appliance at the hardware and firmware level, which is exactly what you need when you assume your admin credentials may be compromised.

Given the wave of real-world attacks targeting SAP systems throughout 2025, including a major global campaign that compromised hundreds of SAP installations and ransomware groups actively exploiting S/4HANA vulnerabilities, this is not a theoretical exercise. If your organization runs SAP, you need to assume your SAP landscape is a target, and your disaster recovery design should account for a scenario in which your entire production storage infrastructure is simultaneously compromised.

Recovery from Site D is measured in hours, not minutes. You restore VMs from Veeam replicas stored on Ootbi, bring up HANA from the last backup, and manually rebuild the SAP cluster. But you are recovering clean data from an untouched infrastructure, and in a ransomware scenario, that is the only thing that matters.

For those of you who work with Veeam regularly, this is a good example of how Veeam, paired with a purpose-built immutable appliance like Object First (Ootbi), fits into a larger HA design. Not as a replacement for synchronous replication, but as the independent, immutable safety net that works even when everything else has been compromised. If you want to learn more about Ootbi, I have covered it in previous posts on this blog. I will cover the full ransomware attack and recovery scenario in detail in Part 4.

Designs I Evaluated and Rejected

Before arriving at the four-layer architecture, I evaluated several alternatives. Understanding why they don’t work is as important as understanding why the chosen design does.

HANA Scale-Out Across Datacenters

In a HANA Scale-Out configuration, multiple HANA worker nodes share the same database across an NFS file system. SAP recommends a maximum of 1ms round-trip latency for inter-node communication in a scale-out cluster and requires high-bandwidth dedicated networking between nodes. Even though our metro link delivers sub-1ms latency, distributing Scale-Out nodes across two physical data centers introduces unacceptable risk of network partitioning. If the inter-site link goes down even briefly, the Scale-Out system loses nodes, and the database goes down. This is why all major platform providers, including AWS, Azure, and Google Cloud, deploy HANA scale-out nodes within a single availability zone or placement group, never across sites.

HANA System Replication is SAP’s recommended approach for cross-datacenter HA. Scale-Out is designed for scaling within a single site.

Standalone Storage Without MetroCluster

You could put an independent NetApp at each site and rely solely on HANA System Replication for cross-site protection. This works for the HANA databases, but it leaves all your non-HANA production VMs, such as Web Dispatcher, Content Server, PO, and middleware, without any automatic cross-site recovery during a site failure.

At sub-1ms latency, MetroCluster incurs no meaningful performance penalty and protects every VM, not just those running HANA. Giving up that protection to save on MetroCluster licensing is, in my opinion, not worth the risk.

vSAN Stretched Cluster Instead of MetroCluster

VMware vSAN stretched clusters are limited to 5ms RTT between sites. While our sub-1ms latency technically qualifies, I chose MetroCluster over vSAN for several reasons. MetroCluster provides synchronous mirroring at the storage layer without consuming ESXi host CPU and memory resources. It supports automatic unattended switchover with a mediator. And NetApp’s ONTAP ransomware detection, with automatic snapshots, adds a data-protection layer that vSAN doesn’t provide.

This four-layer approach, combined with an architecturally isolated DR site, provides a design that handles everything from a single-host failure to a full ransomware attack across both metro sites.

What’s Next

In Part 2: Compute and Storage, I’ll go deeper into the compute and storage layers, including how to size the ESXi hosts for N+1 redundancy (and the capacity planning trap that catches most people), how VCF stretched clusters actually work across two sites, how MetroCluster synchronous mirroring is configured, and the role of the third site for quorum.

In Part 3: SAP HA, I’ll cover the SAP-specific HA layer, including HANA System Replication in SYNC mode, how Pacemaker manages the failover with fence agents for VMware (this is where things get interesting), and ENSA2 for the application layer.

In Part 4: DR, Ransomware, and Lessons Learned, I’ll walk through the full failure-scenario matrix, the Veeam DR design at Site D with Object First Ootbi as the immutable backup target, and a detailed ransomware attack-and-recovery scenario. This last part is where it all comes together, because it forces you to answer the hard question: what happens when your synchronous replication faithfully mirrors encrypted data to both metro sites, and how does the architecture get you back to a clean state using Veeam and Ootbi on a completely independent infrastructure? I’ll also cover the lessons learned and gotchas I discovered during this research, including some that could easily break your entire HA design if you’re not aware of them.

Stay tuned, and as always, if you have any questions or want to discuss this design, leave a comment or reach out to me on social media.


Share this article if you think it is worth sharing. If you have any questions or comments, leave them here or contact me on Twitter (yes, for me it’s not X, but still Twitter) or LinkedIn, since I am getting off Twitter.

Source link

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *