[AMDGPU] Update gfx90a memory model support

Update AMDGPU gfx90a memory model to make coarse grain memory allocations consistent when fine grained system scope atomic acquire and release is performed. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D105137
2024-11-25 04:02:41 +01:00 · 2021-05-07 20:55:23 +00:00 · 2021-05-07 20:55:23 +00:00 · 47ca9dba51
commit 47ca9dba51
parent 16bf4a4717
7 changed files with 625 additions and 51 deletions
--- a/docs/AMDGPUUsage.rst
+++ b/docs/AMDGPUUsage.rst
@ -6093,10 +6093,10 @@ For GFX90A:
    ensures a previous vector memory operation has completed before executing a
    subsequent vector memory or LDS operation and so can be used to meet the
    requirements of acquire and release.
-  * The L2 cache of one agent can be kept coherent with other agents by using
+  * The L2 cache of one agent can be kept coherent with other agents by:
-    the MTYPE CC (cache-coherent) with the PTE C-bit for memory local to the L2,
+    using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
-    and MTYPE UC (uncached) with the PTE C-bit set for memory not local to the
+    C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
-    L2.
+    the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
    * Any local memory cache lines will be automatically invalidated by writes
      from CUs associated with other L2 caches, or writes from the CPU, due to
@ -6108,13 +6108,21 @@ For GFX90A:
      the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
    * Since all work-groups on the same agent share the same L2, no L2
      invalidation or writeback is required for coherence.
-    * Since local memory reads and writes of work-groups in different agents
+    * To ensure coherence of local and remote memory writes of work-groups in
-      access memory using MTYPE CC, no L2 invalidate or writeback is required
+      different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
-      for coherence. MTYPE CC causes write through to DRAM and local reads to be
+      cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
-      invalidated by remote writes with with the PTE C-bit.
+      ()used for remote coarse grain memory). Note that MTYPE CC (used for local
-    * Since remote memory reads and writes of work-groups in different agents
+      fine grain memory) causes write through to DRAM, and MTYPE UC (used for
-      access memory using MTYPE UC, no L2 invalidate or writeback is required
+      remote fine grain memory) bypasses the L2, so both will never result in
-      for coherence. MTYPE UC causes direct accesses to DRAM.
+      dirty L2 cache lines.
    * To ensure coherence of local and remote memory reads of work-groups in
      different agents a ``buffer_invl2`` is required. It will invalidate L2
      cache lines with MTYPE NC (used for remote coarse grain memory). Note that
      MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
      coarse memory) cause local reads to be invalidated by remote writes with
      with the PTE C-bit so these cache lines are not invalidated. Note that
      MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
      never result in L2 cache lines that need to be invalidated.
  * PCIe access from the GPU to the CPU memory is kept coherent by using the
    MTYPE UC (uncached) which bypasses the L2.
@ -6384,14 +6392,15 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                         2. s_waitcnt vmcnt(0)
                                                           - Must happen before
-                                                             following
+                                                             following buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures the load
                                                             has completed
                                                             before invalidating
                                                             the cache.
-                                                         3. buffer_wbinvl1_vol
+                                                         3. buffer_invl2;
                                                            buffer_wbinvl1_vol
                                                           - Must happen before
                                                             any following
@ -6401,7 +6410,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
                                                             nor see stale L2 MTYPE
                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -6444,13 +6455,15 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             lgkmcnt(0).
                                                           - Must happen before
                                                             following
                                                             buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures the flat_load
                                                             has completed
                                                             before invalidating
                                                             the caches.
-                                                         3. buffer_wbinvl1_vol
+                                                         3. buffer_invl2;
                                                            buffer_wbinvl1_vol
                                                           - Must happen before
                                                             any following
@ -6459,8 +6472,10 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             atomic/atomicrmw.
                                                           - Ensures that
                                                             following
-                                                             L1 loads will not see
+                                                             loads will not see
-                                                             stale global data.
+                                                             stale L1 global data,
                                                             nor see stale L2 MTYPE
                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -6579,7 +6594,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                         2. s_waitcnt vmcnt(0)
                                                           - Must happen before
-                                                             following
+                                                             following buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures the
                                                             atomicrmw has
@ -6587,7 +6602,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             invalidating the
                                                             caches.
-                                                         3. buffer_wbinvl1_vol
+                                                         3. buffer_invl2;
                                                            buffer_wbinvl1_vol
                                                           - Must happen before
                                                             any following
@ -6597,8 +6613,10 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
-                                                             MTYPE RW and CC L2 memory
+                                                             nor see stale L2 MTYPE
                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -6641,6 +6659,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             lgkmcnt(0).
                                                           - Must happen before
                                                             following
                                                             buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures the
                                                             atomicrmw has
@ -6648,7 +6667,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             invalidating the
                                                             caches.
-                                                         3. buffer_wbinvl1_vol
+                                                         3. buffer_invl2;
                                                            buffer_wbinvl1_vol
                                                           - Must happen before
                                                             any following
@ -6658,7 +6678,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
                                                             nor see stale L2 MTYPE
                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -6734,7 +6756,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             value read by the
                                                             fence-paired-atomic.
-                                                         3. buffer_wbinvl1_vol
+                                                         2. buffer_wbinvl1_vol
                                                           - If not TgSplit execution
                                                             mode, omit.
@ -6872,7 +6894,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             termed the
                                                             fence-paired-atomic).
                                                           - Must happen before
-                                                             the following
+                                                             the following buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures that the
                                                             fence-paired atomic
@ -6887,7 +6909,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             the
                                                             fence-paired-atomic.
-                                                         2. buffer_wbinvl1_vol
+                                                         2. buffer_invl2;
                                                            buffer_wbinvl1_vol
                                                           - Must happen before any
                                                             following global/generic
@ -6897,7 +6920,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
                                                             nor see stale L2 MTYPE
                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -6991,8 +7016,18 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             released.
                                                         2. buffer/global/flat_store
-     store atomic release      - system       - global   1. s_waitcnt lgkmcnt(0) &
+     store atomic release      - system       - global   1. buffer_wbl2
-                                              - generic     vmcnt(0)
+                                              - generic
                                                           - Must happen before
                                                             following s_waitcnt.
                                                           - Performs L2 writeback to
                                                             ensure previous
                                                             global/generic
                                                             store/atomicrmw are
                                                             visible at system scope.
                                                         2. s_waitcnt lgkmcnt(0) &
                                                            vmcnt(0)
                                                           - If TgSplit execution mode,
                                                             omit lgkmcnt(0).
@ -7035,7 +7070,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             store that is being
                                                             released.
-                                                         2. buffer/global/flat_store
+                                                         3. buffer/global/flat_store
     atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
                               - wavefront    - generic
     atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
@ -7123,8 +7158,18 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             is being released.
                                                         2. buffer/global/flat_atomic
-     atomicrmw    release      - system       - global   1. s_waitcnt lgkmcnt(0) &
+     atomicrmw    release      - system       - global   1. buffer_wbl2
-                                              - generic     vmcnt(0)
+                                              - generic
                                                           - Must happen before
                                                             following s_waitcnt.
                                                           - Performs L2 writeback to
                                                             ensure previous
                                                             global/generic
                                                             store/atomicrmw are
                                                             visible at system scope.
                                                         2. s_waitcnt lgkmcnt(0) &
                                                            vmcnt(0)
                                                           - If TgSplit execution mode,
                                                             omit lgkmcnt(0).
@ -7165,7 +7210,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             store that is being
                                                             released.
-                                                         2. buffer/global/flat_atomic
+                                                         3. buffer/global/flat_atomic
     fence        release      - singlethread *none*     *none*
                               - wavefront
     fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
@ -7298,7 +7343,20 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             following
                                                             fence-paired-atomic.
-     fence        release      - system       *none*     1. s_waitcnt lgkmcnt(0) &
+     fence        release      - system       *none*     1. buffer_wbl2
                                                           - If OpenCL and
                                                             address space is
                                                             local, omit.
                                                           - Must happen before
                                                             following s_waitcnt.
                                                           - Performs L2 writeback to
                                                             ensure previous
                                                             global/generic
                                                             store/atomicrmw are
                                                             visible at system scope.
                                                         2. s_waitcnt lgkmcnt(0) &
                                                            vmcnt(0)
                                                           - If TgSplit execution mode,
@ -7588,7 +7646,17 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             will not see stale
                                                             global data.
-     atomicrmw    acq_rel      - system       - global   1. s_waitcnt lgkmcnt(0) &
+     atomicrmw    acq_rel      - system       - global   1. buffer_wbl2
                                                           - Must happen before
                                                             following s_waitcnt.
                                                           - Performs L2 writeback to
                                                             ensure previous
                                                             global/generic
                                                             store/atomicrmw are
                                                             visible at system scope.
                                                         2. s_waitcnt lgkmcnt(0) &
                                                            vmcnt(0)
                                                           - If TgSplit execution mode,
@ -7629,11 +7697,11 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             atomicrmw that is
                                                             being released.
-                                                         2. buffer/global_atomic
+                                                         3. buffer/global_atomic
-                                                         3. s_waitcnt vmcnt(0)
+                                                         4. s_waitcnt vmcnt(0)
                                                           - Must happen before
-                                                             following
+                                                             following buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures the
                                                             atomicrmw has
@ -7641,7 +7709,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             invalidating the
                                                             caches.
-                                                         4. buffer_wbinvl1_vol
+                                                         5. buffer_invl2;
                                                            buffer_wbinvl1_vol
                                                           - Must happen before
                                                             any following
@ -7651,7 +7720,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
                                                             nor see stale L2 MTYPE
                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -7726,7 +7797,17 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             will not see stale
                                                             global data.
-     atomicrmw    acq_rel      - system       - generic  1. s_waitcnt lgkmcnt(0) &
+     atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2
                                                           - Must happen before
                                                             following s_waitcnt.
                                                           - Performs L2 writeback to
                                                             ensure previous
                                                             global/generic
                                                             store/atomicrmw are
                                                             visible at system scope.
                                                         2. s_waitcnt lgkmcnt(0) &
                                                            vmcnt(0)
                                                           - If TgSplit execution mode,
@ -7767,8 +7848,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             atomicrmw that is
                                                             being released.
-                                                         2. flat_atomic
+                                                         3. flat_atomic
-                                                         3. s_waitcnt vmcnt(0) &
+                                                         4. s_waitcnt vmcnt(0) &
                                                            lgkmcnt(0)
                                                           - If TgSplit execution mode,
@ -7776,7 +7857,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - If OpenCL, omit
                                                             lgkmcnt(0).
                                                           - Must happen before
-                                                             following
+                                                             following buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures the
                                                             atomicrmw has
@ -7784,7 +7865,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             invalidating the
                                                             caches.
-                                                         4. buffer_wbinvl1_vol
+                                                         5. buffer_invl2;
                                                            buffer_wbinvl1_vol
                                                           - Must happen before
                                                             any following
@ -7794,7 +7876,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
                                                             nor see stale L2 MTYPE
                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
@ -7902,7 +7986,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             the
                                                             acquire-fence-paired-atomic.
-                                                         3. buffer_wbinvl1_vol
+                                                         2. buffer_wbinvl1_vol
                                                           - If not TgSplit execution
                                                             mode, omit.
@ -8007,7 +8091,20 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             requirements of
                                                             acquire.
-     fence        acq_rel      - system       *none*     1. s_waitcnt lgkmcnt(0) &
+     fence        acq_rel      - system       *none*     1. buffer_wbl2
                                                           - If OpenCL and
                                                             address space is
                                                             local, omit.
                                                           - Must happen before
                                                             following s_waitcnt.
                                                           - Performs L2 writeback to
                                                             ensure previous
                                                             global/generic
                                                             store/atomicrmw are
                                                             visible at system scope.
                                                         2. s_waitcnt lgkmcnt(0) &
                                                            vmcnt(0)
                                                           - If TgSplit execution mode,
@ -8048,7 +8145,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             atomic/store
                                                             atomic/atomicrmw.
                                                           - Must happen before
-                                                             the following
+                                                             the following buffer_invl2 and
                                                             buffer_wbinvl1_vol.
                                                           - Ensures that the
                                                             preceding
@ -8087,7 +8184,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                             requirements of
                                                             release.
-                                                         2.  buffer_wbinvl1_vol
+                                                         3.  buffer_invl2;
                                                             buffer_wbinvl1_vol
                                                           - Must happen before
                                                             any following
@ -8098,7 +8196,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
                                                           - Ensures that
                                                             following
                                                             loads will not see
-                                                             stale L1 global data.
+                                                             stale L1 global data,
                                                             nor see stale L2 MTYPE
                                                             NC global data.
                                                             MTYPE RW and CC memory will
                                                             never be stale in L2 due to
                                                             the memory probes.
--- a/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@ -452,6 +452,12 @@ public:
                     SIAtomicScope Scope,
                     SIAtomicAddrSpace AddrSpace,
                     Position Pos) const override;
  bool insertRelease(MachineBasicBlock::iterator &MI,
                     SIAtomicScope Scope,
                     SIAtomicAddrSpace AddrSpace,
                     bool IsCrossAddrSpaceOrdering,
                     Position Pos) const override;
 };
 class SIGfx10CacheControl : public SIGfx7CacheControl {
@ -1265,9 +1271,26 @@ bool SIGfx90ACacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
  bool Changed = false;
  MachineBasicBlock &MBB = *MI->getParent();
  DebugLoc DL = MI->getDebugLoc();
  if (Pos == Position::AFTER)
    ++MI;
  if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) {
    switch (Scope) {
    case SIAtomicScope::SYSTEM:
      // Ensures that following loads will not see stale remote VMEM data or
      // stale local VMEM data with MTYPE NC. Local VMEM data with MTYPE RW and
      // CC will never be stale due to the local memory probes.
      BuildMI(MBB, MI, DL, TII->get(AMDGPU::BUFFER_INVL2));
      // Inserting a "S_WAITCNT vmcnt(0)" after is not required because the
      // hardware does not reorder memory operations by the same wave with
      // respect to a preceding "BUFFER_INVL2". The invalidate is guaranteed to
      // remove any cache lines of earlier writes by the same wave and ensures
      // later reads by the same wave will refetch the cache lines.
      Changed = true;
      break;
    case SIAtomicScope::AGENT:
      // Same as GFX7.
      break;
@ -1297,11 +1320,62 @@ bool SIGfx90ACacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
  /// Other address spaces do not have a cache.
  if (Pos == Position::AFTER)
    --MI;
  Changed |= SIGfx7CacheControl::insertAcquire(MI, Scope, AddrSpace, Pos);
  return Changed;
 }
 bool SIGfx90ACacheControl::insertRelease(MachineBasicBlock::iterator &MI,
                                         SIAtomicScope Scope,
                                         SIAtomicAddrSpace AddrSpace,
                                         bool IsCrossAddrSpaceOrdering,
                                         Position Pos) const {
  bool Changed = false;
  MachineBasicBlock &MBB = *MI->getParent();
  DebugLoc DL = MI->getDebugLoc();
  if (Pos == Position::AFTER)
    ++MI;
  if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) {
    switch (Scope) {
    case SIAtomicScope::SYSTEM:
      // Inserting a "S_WAITCNT vmcnt(0)" before is not required because the
      // hardware does not reorder memory operations by the same wave with
      // respect to a following "BUFFER_WBL2". The "BUFFER_WBL2" is guaranteed
      // to initiate writeback of any dirty cache lines of earlier writes by the
      // same wave. A "S_WAITCNT vmcnt(0)" is needed after to ensure the
      // writeback has completed.
      BuildMI(MBB, MI, DL, TII->get(AMDGPU::BUFFER_WBL2));
      // Followed by same as GFX7, which will ensure the necessary "S_WAITCNT
      // vmcnt(0)" needed by the "BUFFER_WBL2".
      Changed = true;
      break;
    case SIAtomicScope::AGENT:
    case SIAtomicScope::WORKGROUP:
    case SIAtomicScope::WAVEFRONT:
    case SIAtomicScope::SINGLETHREAD:
      // Same as GFX7.
      break;
    default:
      llvm_unreachable("Unsupported synchronization scope");
    }
  }
  if (Pos == Position::AFTER)
    --MI;
  Changed |=
      SIGfx7CacheControl::insertRelease(MI, Scope, AddrSpace,
                                        IsCrossAddrSpaceOrdering, Pos);
  return Changed;
 }
 bool SIGfx10CacheControl::enableLoadCacheBypass(
    const MachineBasicBlock::iterator &MI,
    SIAtomicScope Scope,
--- a/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll
+++ b/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll
@ -424,9 +424,11 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat(double addrspace(1)*
 ; GFX90A-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], 4.0
 ; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    global_atomic_cmpswap_x2 v[0:1], v4, v[0:3], s[0:1] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
 ; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
@ -470,9 +472,11 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat_system(double addrsp
 ; GFX90A-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GFX90A-NEXT:    v_mov_b32_e32 v4, 0
 ; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], 4.0
 ; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    global_atomic_cmpswap_x2 v[0:1], v4, v[0:3], s[0:1] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
 ; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
@ -526,9 +530,11 @@ define double @global_atomic_fadd_f64_rtn_pat(double addrspace(1)* %ptr, double
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
 ; GFX90A-NEXT:    v_add_f64 v[2:3], v[4:5], 4.0
 ; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    global_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5], off glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]
 ; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]
@ -571,9 +577,11 @@ define double @global_atomic_fadd_f64_rtn_pat_system(double addrspace(1)* %ptr,
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
 ; GFX90A-NEXT:    v_add_f64 v[2:3], v[4:5], 4.0
 ; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    global_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5], off glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]
 ; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]
@ -655,9 +663,11 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat(double* %ptr) #1 {
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], 4.0
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
 ; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
 ; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
@ -702,9 +712,11 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat_system(double* %ptr) #
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    v_add_f64 v[0:1], v[2:3], 4.0
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
 ; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
@ -730,9 +742,11 @@ define double @flat_atomic_fadd_f64_rtn_pat(double* %ptr) #1 {
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
 ; GFX90A-NEXT:    v_add_f64 v[2:3], v[4:5], 4.0
 ; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    flat_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]
 ; GFX90A-NEXT:    s_or_b64 s[4:5], vcc, s[4:5]
@ -775,9 +789,11 @@ define double @flat_atomic_fadd_f64_rtn_pat_system(double* %ptr) #1 {
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
 ; GFX90A-NEXT:    v_add_f64 v[2:3], v[4:5], 4.0
 ; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    flat_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX90A-NEXT:    v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]
--- a/test/CodeGen/AMDGPU/global-atomics-fp.ll
+++ b/test/CodeGen/AMDGPU/global-atomics-fp.ll
@ -70,9 +70,11 @@ define amdgpu_kernel void @global_atomic_fadd_ret_f32(float addrspace(1)* %ptr)
 ; GFX90A-NEXT:    v_mov_b32_e32 v1, v0
 ; GFX90A-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX90A-NEXT:    v_add_f32_e32 v0, 4.0, v1
 ; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NEXT:    global_atomic_cmpswap v0, v2, v[0:1], s[0:1] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v1
 ; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
@ -527,9 +529,11 @@ define amdgpu_kernel void @global_atomic_fadd_ret_f32_system(float addrspace(1)*
 ; GFX90A-NEXT:    v_mov_b32_e32 v1, v0
 ; GFX90A-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX90A-NEXT:    v_add_f32_e32 v0, 4.0, v1
 ; GFX90A-NEXT:    buffer_wbl2
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    global_atomic_cmpswap v0, v2, v[0:1], s[0:1] glc
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NEXT:    buffer_invl2
 ; GFX90A-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NEXT:    v_cmp_eq_u32_e32 vcc, v0, v1
 ; GFX90A-NEXT:    s_or_b64 s[2:3], vcc, s[2:3]
--- a/test/CodeGen/AMDGPU/memory-legalizer-fence.ll
+++ b/test/CodeGen/AMDGPU/memory-legalizer-fence.ll
@ -1275,13 +1275,17 @@ define amdgpu_kernel void @system_acquire_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_acquire_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_acquire_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-TGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1319,11 +1323,13 @@ define amdgpu_kernel void @system_release_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_release_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_release_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1367,13 +1373,17 @@ define amdgpu_kernel void @system_acq_rel_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_acq_rel_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_acq_rel_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-TGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1417,13 +1427,17 @@ define amdgpu_kernel void @system_seq_cst_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_seq_cst_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_seq_cst_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX90A-TGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1467,13 +1481,17 @@ define amdgpu_kernel void @system_one_as_acquire_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_one_as_acquire_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_one_as_acquire_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-TGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1511,11 +1529,13 @@ define amdgpu_kernel void @system_one_as_release_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_one_as_release_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_one_as_release_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1559,13 +1579,17 @@ define amdgpu_kernel void @system_one_as_acq_rel_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_one_as_acq_rel_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_one_as_acq_rel_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-TGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
@ -1609,13 +1633,17 @@ define amdgpu_kernel void @system_one_as_seq_cst_fence() {
 ;
 ; GFX90A-NOTTGSPLIT-LABEL: system_one_as_seq_cst_fence:
 ; GFX90A-NOTTGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-NOTTGSPLIT-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-NOTTGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-NOTTGSPLIT-NEXT:    s_endpgm
 ;
 ; GFX90A-TGSPLIT-LABEL: system_one_as_seq_cst_fence:
 ; GFX90A-TGSPLIT:       ; %bb.0: ; %entry
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbl2
 ; GFX90A-TGSPLIT-NEXT:    s_waitcnt vmcnt(0)
 ; GFX90A-TGSPLIT-NEXT:    buffer_invl2
 ; GFX90A-TGSPLIT-NEXT:    buffer_wbinvl1_vol
 ; GFX90A-TGSPLIT-NEXT:    s_endpgm
 entry:
--- a/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll
+++ b/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll
--- a/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll
+++ b/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll