[AMDGPU] Update gfx90a memory model support

Update AMDGPU gfx90a memory model to make coarse grain memory allocations consistent when fine grained system scope atomic acquire and release is performed. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D105137
2024-11-25 04:02:41 +01:00 · 2021-05-07 20:55:23 +00:00 · 2021-05-07 20:55:23 +00:00 · 47ca9dba51
commit 47ca9dba51
parent 16bf4a4717
7 changed files with 625 additions and 51 deletions
--- a/docs/AMDGPUUsage.rst
+++ b/docs/AMDGPUUsage.rst
@ -6093,10 +6093,10 @@ For GFX90A:
    ensures a previous vector memory operation has completed before executing a
    subsequent vector memory or LDS operation and so can be used to meet the
    requirements of acquire and release.
-  * The L2 cache of one agent can be kept coherent with other agents by using
-    the MTYPE CC (cache-coherent) with the PTE C-bit for memory local to the L2,
-    and MTYPE UC (uncached) with the PTE C-bit set for memory not local to the
-    L2.
+  * The L2 cache of one agent can be kept coherent with other agents by:
+    using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
+    C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
+    the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.

    * Any local memory cache lines will be automatically invalidated by writes
      from CUs associated with other L2 caches, or writes from the CPU, due to
@ -6108,13 +6108,21 @@ For GFX90A:
      the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
    * Since all work-groups on the same agent share the same L2, no L2
      invalidation or writeback is required for coherence.
-    * Since local memory reads and writes of work-groups in different agents
-      access memory using MTYPE CC, no L2 invalidate or writeback is required
-      for coherence. MTYPE CC causes write through to DRAM and local reads to be
-      invalidated by remote writes with with the PTE C-bit.
-    * Since remote memory reads and writes of work-groups in different agents
-      access memory using MTYPE UC, no L2 invalidate or writeback is required
-      for coherence. MTYPE UC causes direct accesses to DRAM.
+    * To ensure coherence of local and remote memory writes of work-groups in
+      different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
+      cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
+      ()used for remote coarse grain memory). Note that MTYPE CC (used for local
+      fine grain memory) causes write through to DRAM, and MTYPE UC (used for
+      remote fine grain memory) bypasses the L2, so both will never result in
+      dirty L2 cache lines.
+    * To ensure coherence of local and remote memory reads of work-groups in
+      different agents a ``buffer_invl2`` is required. It will invalidate L2
+      cache lines with MTYPE NC (used for remote coarse grain memory). Note that
+      MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
+      coarse memory) cause local reads to be invalidated by remote writes with
+      with the PTE C-bit so these cache lines are not invalidated. Note that
+      MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
+      never result in L2 cache lines that need to be invalidated.

  * PCIe access from the GPU to the CPU memory is kept coherent by using the
    MTYPE UC (uncached) which bypasses the L2.
@ -6384,14 +6392,15 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                         2. s_waitcnt vmcnt(0)

                                                           - Must happen before
-                                                             following
+                                                             following buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures the load
                                                             has completed
                                                             before invalidating
                                                             the cache.

-                                                         3. buffer_wbinvl1_vol
+                                                         3. buffer_invl2;
+                                                            buffer_wbinvl1_vol

                                                           - Must happen before
                                                             any following
@ -6401,7 +6410,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
+                                                             nor see stale L2 MTYPE
+                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -6444,13 +6455,15 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             lgkmcnt(0).
                                                           - Must happen before
                                                             following
+                                                             buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures the flat_load
                                                             has completed
                                                             before invalidating
                                                             the caches.

-                                                         3. buffer_wbinvl1_vol
+                                                         3. buffer_invl2;
+                                                            buffer_wbinvl1_vol

                                                           - Must happen before
                                                             any following
@ -6459,8 +6472,10 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             atomic/atomicrmw.
                                                           - Ensures that
                                                             following
-                                                             L1 loads will not see
-                                                             stale global data.
+                                                             loads will not see
+                                                             stale L1 global data,
+                                                             nor see stale L2 MTYPE
+                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -6579,7 +6594,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                         2. s_waitcnt vmcnt(0)

                                                           - Must happen before
-                                                             following
+                                                             following buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures the
                                                             atomicrmw has
@ -6587,7 +6602,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             invalidating the
                                                             caches.

-                                                         3. buffer_wbinvl1_vol
+                                                         3. buffer_invl2;
+                                                            buffer_wbinvl1_vol

                                                           - Must happen before
                                                             any following
@ -6597,8 +6613,10 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
-                                                             MTYPE RW and CC L2 memory
+                                                             stale L1 global data,
+                                                             nor see stale L2 MTYPE
+                                                             NC global data.
+                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.

@ -6641,6 +6659,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             lgkmcnt(0).
                                                           - Must happen before
                                                             following
+                                                             buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures the
                                                             atomicrmw has
@ -6648,7 +6667,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             invalidating the
                                                             caches.

-                                                         3. buffer_wbinvl1_vol
+                                                         3. buffer_invl2;
+                                                            buffer_wbinvl1_vol

                                                           - Must happen before
                                                             any following
@ -6658,7 +6678,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
+                                                             nor see stale L2 MTYPE
+                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -6734,7 +6756,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             value read by the
                                                             fence-paired-atomic.

-                                                         3. buffer_wbinvl1_vol
+                                                         2. buffer_wbinvl1_vol

                                                           - If not TgSplit execution
                                                             mode, omit.
@ -6872,7 +6894,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             termed the
                                                             fence-paired-atomic).
                                                           - Must happen before
-                                                             the following
+                                                             the following buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures that the
                                                             fence-paired atomic
@ -6887,7 +6909,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             the
                                                             fence-paired-atomic.

-                                                         2. buffer_wbinvl1_vol
+                                                         2. buffer_invl2;
+                                                            buffer_wbinvl1_vol

                                                           - Must happen before any
                                                             following global/generic
@ -6897,7 +6920,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
+                                                             nor see stale L2 MTYPE
+                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -6991,8 +7016,18 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             released.

                                                         2. buffer/global/flat_store
-     store atomic release      - system       - global   1. s_waitcnt lgkmcnt(0) &
-                                              - generic     vmcnt(0)
+     store atomic release      - system       - global   1. buffer_wbl2
+                                              - generic
+                                                           - Must happen before
+                                                             following s_waitcnt.
+                                                           - Performs L2 writeback to
+                                                             ensure previous
+                                                             global/generic
+                                                             store/atomicrmw are
+                                                             visible at system scope.
+
+                                                         2. s_waitcnt lgkmcnt(0) &
+                                                            vmcnt(0)

                                                           - If TgSplit execution mode,
                                                             omit lgkmcnt(0).
@ -7035,7 +7070,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             store that is being
                                                             released.

-                                                         2. buffer/global/flat_store
+                                                         3. buffer/global/flat_store
     atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
                               - wavefront    - generic
     atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
@ -7123,8 +7158,18 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             is being released.

                                                         2. buffer/global/flat_atomic
-     atomicrmw    release      - system       - global   1. s_waitcnt lgkmcnt(0) &
-                                              - generic     vmcnt(0)
+     atomicrmw    release      - system       - global   1. buffer_wbl2
+                                              - generic
+                                                           - Must happen before
+                                                             following s_waitcnt.
+                                                           - Performs L2 writeback to
+                                                             ensure previous
+                                                             global/generic
+                                                             store/atomicrmw are
+                                                             visible at system scope.
+
+                                                         2. s_waitcnt lgkmcnt(0) &
+                                                            vmcnt(0)

                                                           - If TgSplit execution mode,
                                                             omit lgkmcnt(0).
@ -7165,7 +7210,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             store that is being
                                                             released.

-                                                         2. buffer/global/flat_atomic
+                                                         3. buffer/global/flat_atomic
     fence        release      - singlethread *none*     *none*
                               - wavefront
     fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
@ -7298,7 +7343,20 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             following
                                                             fence-paired-atomic.

-     fence        release      - system       *none*     1. s_waitcnt lgkmcnt(0) &
+     fence        release      - system       *none*     1. buffer_wbl2
+
+                                                           - If OpenCL and
+                                                             address space is
+                                                             local, omit.
+                                                           - Must happen before
+                                                             following s_waitcnt.
+                                                           - Performs L2 writeback to
+                                                             ensure previous
+                                                             global/generic
+                                                             store/atomicrmw are
+                                                             visible at system scope.
+
+                                                         2. s_waitcnt lgkmcnt(0) &
                                                            vmcnt(0)

                                                           - If TgSplit execution mode,
@ -7588,7 +7646,17 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             will not see stale
                                                             global data.

-     atomicrmw    acq_rel      - system       - global   1. s_waitcnt lgkmcnt(0) &
+     atomicrmw    acq_rel      - system       - global   1. buffer_wbl2
+
+                                                           - Must happen before
+                                                             following s_waitcnt.
+                                                           - Performs L2 writeback to
+                                                             ensure previous
+                                                             global/generic
+                                                             store/atomicrmw are
+                                                             visible at system scope.
+
+                                                         2. s_waitcnt lgkmcnt(0) &
                                                            vmcnt(0)

                                                           - If TgSplit execution mode,
@ -7629,11 +7697,11 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             atomicrmw that is
                                                             being released.

-                                                         2. buffer/global_atomic
-                                                         3. s_waitcnt vmcnt(0)
+                                                         3. buffer/global_atomic
+                                                         4. s_waitcnt vmcnt(0)

                                                           - Must happen before
-                                                             following
+                                                             following buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures the
                                                             atomicrmw has
@ -7641,7 +7709,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             invalidating the
                                                             caches.

-                                                         4. buffer_wbinvl1_vol
+                                                         5. buffer_invl2;
+                                                            buffer_wbinvl1_vol

                                                           - Must happen before
                                                             any following
@ -7651,7 +7720,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
+                                                             nor see stale L2 MTYPE
+                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -7726,7 +7797,17 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             will not see stale
                                                             global data.

-     atomicrmw    acq_rel      - system       - generic  1. s_waitcnt lgkmcnt(0) &
+     atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2
+
+                                                           - Must happen before
+                                                             following s_waitcnt.
+                                                           - Performs L2 writeback to
+                                                             ensure previous
+                                                             global/generic
+                                                             store/atomicrmw are
+                                                             visible at system scope.
+
+                                                         2. s_waitcnt lgkmcnt(0) &
                                                            vmcnt(0)

                                                           - If TgSplit execution mode,
@ -7767,8 +7848,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             atomicrmw that is
                                                             being released.

-                                                         2. flat_atomic
-                                                         3. s_waitcnt vmcnt(0) &
+                                                         3. flat_atomic
+                                                         4. s_waitcnt vmcnt(0) &
                                                            lgkmcnt(0)

                                                           - If TgSplit execution mode,
@ -7776,7 +7857,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - If OpenCL, omit
                                                             lgkmcnt(0).
                                                           - Must happen before
-                                                             following
+                                                             following buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures the
                                                             atomicrmw has
@ -7784,7 +7865,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             invalidating the
                                                             caches.

-                                                         4. buffer_wbinvl1_vol
+                                                         5. buffer_invl2;
+                                                            buffer_wbinvl1_vol

                                                           - Must happen before
                                                             any following
@ -7794,7 +7876,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
+                                                             nor see stale L2 MTYPE
+                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -7902,7 +7986,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             the
                                                             acquire-fence-paired-atomic.

-                                                         3. buffer_wbinvl1_vol
+                                                         2. buffer_wbinvl1_vol

                                                           - If not TgSplit execution
                                                             mode, omit.
@ -8007,7 +8091,20 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             requirements of
                                                             acquire.

-     fence        acq_rel      - system       *none*     1. s_waitcnt lgkmcnt(0) &
+     fence        acq_rel      - system       *none*     1. buffer_wbl2
+
+                                                           - If OpenCL and
+                                                             address space is
+                                                             local, omit.
+                                                           - Must happen before
+                                                             following s_waitcnt.
+                                                           - Performs L2 writeback to
+                                                             ensure previous
+                                                             global/generic
+                                                             store/atomicrmw are
+                                                             visible at system scope.
+
+                                                         2. s_waitcnt lgkmcnt(0) &
                                                            vmcnt(0)

                                                           - If TgSplit execution mode,
@ -8048,7 +8145,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             atomic/store
                                                             atomic/atomicrmw.
                                                           - Must happen before
-                                                             the following
+                                                             the following buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures that the
                                                             preceding
@ -8087,7 +8184,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             requirements of
                                                             release.

-                                                         2.  buffer_wbinvl1_vol
+                                                         3.  buffer_invl2;
+                                                             buffer_wbinvl1_vol

                                                           - Must happen before
                                                             any following
@ -8098,7 +8196,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
+                                                             nor see stale L2 MTYPE
+                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
--- a/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@ -452,6 +452,12 @@ public:
                     SIAtomicScope Scope,
                     SIAtomicAddrSpace AddrSpace,
                     Position Pos) const override;
+
+  bool insertRelease(MachineBasicBlock::iterator &MI,
+                     SIAtomicScope Scope,
+                     SIAtomicAddrSpace AddrSpace,
+                     bool IsCrossAddrSpaceOrdering,
+                     Position Pos) const override;
 };

 class SIGfx10CacheControl : public SIGfx7CacheControl {
@ -1265,9 +1271,26 @@ bool SIGfx90ACacheControl::insertAcquire(MachineBasicBlock::iterator &MI,

  bool Changed = false;

+  MachineBasicBlock &MBB = *MI->getParent();
+  DebugLoc DL = MI->getDebugLoc();
+
+  if (Pos == Position::AFTER)
+    ++MI;
+
  if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) {
    switch (Scope) {
    case SIAtomicScope::SYSTEM:
+      // Ensures that following loads will not see stale remote VMEM data or
+      // stale local VMEM data with MTYPE NC. Local VMEM data with MTYPE RW and
+      // CC will never be stale due to the local memory probes.
+      BuildMI(MBB, MI, DL, TII->get(AMDGPU::BUFFER_INVL2));
+      // Inserting a "S_WAITCNT vmcnt(0)" after is not required because the
+      // hardware does not reorder memory operations by the same wave with
+      // respect to a preceding "BUFFER_INVL2". The invalidate is guaranteed to
+      // remove any cache lines of earlier writes by the same wave and ensures
+      // later reads by the same wave will refetch the cache lines.
+      Changed = true;
+      break;
    case SIAtomicScope::AGENT:
      // Same as GFX7.
      break;
@ -1297,11 +1320,62 @@ bool SIGfx90ACacheControl::insertAcquire(MachineBasicBlock::iterator &MI,

  /// Other address spaces do not have a cache.

+  if (Pos == Position::AFTER)
+    --MI;
+
  Changed |= SIGfx7CacheControl::insertAcquire(MI, Scope, AddrSpace, Pos);

  return Changed;
 }

+bool SIGfx90ACacheControl::insertRelease(MachineBasicBlock::iterator &MI,
+                                         SIAtomicScope Scope,
+                                         SIAtomicAddrSpace AddrSpace,
+                                         bool IsCrossAddrSpaceOrdering,
+                                         Position Pos) const {
+  bool Changed = false;
+
+  MachineBasicBlock &MBB = *MI->getParent();
+  DebugLoc DL = MI->getDebugLoc();
+
+  if (Pos == Position::AFTER)
+    ++MI;
+
+  if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) {
+    switch (Scope) {
+    case SIAtomicScope::SYSTEM:
+      // Inserting a "S_WAITCNT vmcnt(0)" before is not required because the
+      // hardware does not reorder memory operations by the same wave with
+      // respect to a following "BUFFER_WBL2". The "BUFFER_WBL2" is guaranteed
+      // to initiate writeback of any dirty cache lines of earlier writes by the
+      // same wave. A "S_WAITCNT vmcnt(0)" is needed after to ensure the
+      // writeback has completed.
+      BuildMI(MBB, MI, DL, TII->get(AMDGPU::BUFFER_WBL2));
+      // Followed by same as GFX7, which will ensure the necessary "S_WAITCNT
+      // vmcnt(0)" needed by the "BUFFER_WBL2".
+      Changed = true;
+      break;
+    case SIAtomicScope::AGENT:
+    case SIAtomicScope::WORKGROUP:
+    case SIAtomicScope::WAVEFRONT:
+    case SIAtomicScope::SINGLETHREAD:
+      // Same as GFX7.
+      break;
+    default:
+      llvm_unreachable("Unsupported synchronization scope");
+    }
+  }
+
+  if (Pos == Position::AFTER)
+    --MI;
+
+  Changed |=
+      SIGfx7CacheControl::insertRelease(MI, Scope, AddrSpace,
+                                        IsCrossAddrSpaceOrdering, Pos);
+
+  return Changed;
+}
+
 bool SIGfx10CacheControl::enableLoadCacheBypass(
    const MachineBasicBlock::iterator &MI,
    SIAtomicScope Scope,
--- a/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll
+++ b/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll
@ -424,9 +424,11 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat(double addrspace(1)*
 ; GFX90A-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], 4.0
+; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    global_atomic_cmpswap_x2 v[0:1], v4, v[0:3], s[0:1] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
 ; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
@ -470,9 +472,11 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat_system(double addrsp
 ; GFX90A-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], 4.0
+; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    global_atomic_cmpswap_x2 v[0:1], v4, v[0:3], s[0:1] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
 ; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
@ -526,9 +530,11 @@ define double @global_atomic_fadd_f64_rtn_pat(double addrspace(1)* %ptr, double
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
 ; GFX90A-NEXT:    v_add_f64 v[2:3], v[4:5], 4.0
+; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    global_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5], off glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]
 ; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]
@ -571,9 +577,11 @@ define double @global_atomic_fadd_f64_rtn_pat_system(double addrspace(1)* %ptr,
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
 ; GFX90A-NEXT:    v_add_f64 v[2:3], v[4:5], 4.0
+; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    global_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5], off glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]
 ; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]
@ -655,9 +663,11 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat(double* %ptr) #1 {
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], 4.0
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
+; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
 ; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
@ -702,9 +712,11 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat_system(double* %ptr) #
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], 4.0
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
+; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
@ -730,9 +742,11 @@ define double @flat_atomic_fadd_f64_rtn_pat(double* %ptr) #1 {
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
 ; GFX90A-NEXT:    v_add_f64 v[2:3], v[4:5], 4.0
+; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    flat_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]
 ; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]
@ -775,9 +789,11 @@ define double @flat_atomic_fadd_f64_rtn_pat_system(double* %ptr) #1 {
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
 ; GFX90A-NEXT:    v_add_f64 v[2:3], v[4:5], 4.0
+; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    flat_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]
--- a/test/CodeGen/AMDGPU/global-atomics-fp.ll
+++ b/test/CodeGen/AMDGPU/global-atomics-fp.ll
@ -70,9 +70,11 @@ define amdgpu_kernel void @global_atomic_fadd_ret_f32(float addrspace(1)* %ptr)
 ; GFX90A-NEXT:    v_mov_b32_e32 v1, v0
 ; GFX90A-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX90A-NEXT:    v_add_f32_e32 v0, 4.0, v1
+; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    global_atomic_cmpswap v0, v2, v[0:1], s[0:1] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v1
 ; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
@ -527,9 +529,11 @@ define amdgpu_kernel void @global_atomic_fadd_ret_f32_system(float addrspace(1)*
 ; GFX90A-NEXT:    v_mov_b32_e32 v1, v0
 ; GFX90A-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX90A-NEXT:    v_add_f32_e32 v0, 4.0, v1
+; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    global_atomic_cmpswap v0, v2, v[0:1], s[0:1] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v1
 ; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
--- a/test/CodeGen/AMDGPU/memory-legalizer-fence.ll
+++ b/test/CodeGen/AMDGPU/memory-legalizer-fence.ll
@ -1275,13 +1275,17 @@ define amdgpu_kernel void @system_acquire_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_acquire_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_acquire_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1319,11 +1323,13 @@ define amdgpu_kernel void @system_release_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_release_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_release_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1367,13 +1373,17 @@ define amdgpu_kernel void @system_acq_rel_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_acq_rel_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_acq_rel_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1417,13 +1427,17 @@ define amdgpu_kernel void @system_seq_cst_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_seq_cst_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_seq_cst_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1467,13 +1481,17 @@ define amdgpu_kernel void @system_one_as_acquire_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_one_as_acquire_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_one_as_acquire_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1511,11 +1529,13 @@ define amdgpu_kernel void @system_one_as_release_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_one_as_release_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_one_as_release_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1559,13 +1579,17 @@ define amdgpu_kernel void @system_one_as_acq_rel_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_one_as_acq_rel_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_one_as_acq_rel_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1609,13 +1633,17 @@ define amdgpu_kernel void @system_one_as_seq_cst_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_one_as_seq_cst_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-NOTTGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_one_as_seq_cst_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
+; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
+; GFX90A-TGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
--- a/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll
+++ b/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll
--- a/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll
+++ b/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll