diff --git a/docs/AMDGPUUsage.rst b/docs/AMDGPUUsage.rst index f4436045571..5aecdfc3c22 100644 --- a/docs/AMDGPUUsage.rst +++ b/docs/AMDGPUUsage.rst @@ -78,141 +78,170 @@ names from both the *Processor* and *Alternative Processor* can be used. .. table:: AMDGPU Processors :name: amdgpu-processor-table - =========== =============== ============ ===== ========== ======= ====================== - Processor Alternative Target dGPU/ Target ROCm Example - Processor Triple APU Features Support Products + =========== =============== ============ ===== ================= ======= ====================== + Processor Alternative Target dGPU/ Target ROCm Example + Processor Triple APU Features Support Products Architecture Supported [Default] - =========== =============== ============ ===== ========== ======= ====================== + =========== =============== ============ ===== ================= ======= ====================== **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ - ---------------------------------------------------------------------------------------- + ----------------------------------------------------------------------------------------------- ``r600`` ``r600`` dGPU ``r630`` ``r600`` dGPU ``rs880`` ``r600`` dGPU ``rv670`` ``r600`` dGPU **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ - ---------------------------------------------------------------------------------------- + ----------------------------------------------------------------------------------------------- ``rv710`` ``r600`` dGPU ``rv730`` ``r600`` dGPU ``rv770`` ``r600`` dGPU **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ - ---------------------------------------------------------------------------------------- + ----------------------------------------------------------------------------------------------- ``cedar`` ``r600`` dGPU ``cypress`` ``r600`` dGPU ``juniper`` ``r600`` dGPU ``redwood`` ``r600`` dGPU ``sumo`` ``r600`` dGPU **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ - ---------------------------------------------------------------------------------------- + ----------------------------------------------------------------------------------------------- ``barts`` ``r600`` dGPU ``caicos`` ``r600`` dGPU ``cayman`` ``r600`` dGPU ``turks`` ``r600`` dGPU **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ - ---------------------------------------------------------------------------------------- + ----------------------------------------------------------------------------------------------- ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU ``gfx601`` - ``hainan`` ``amdgcn`` dGPU - ``oland`` - ``pitcairn`` - ``verde`` **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ - ---------------------------------------------------------------------------------------- - ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000 - - A6 Pro-7050B - - A8-7100 - - A8 Pro-7150B - - A10-7300 - - A10 Pro-7350B - - FX-7500 - - A8-7200P - - A10-7400P - - FX-7600P - ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100 - - FirePro W9100 - - FirePro S9150 - - FirePro S9170 - ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290 - - Radeon R9 290x - - Radeon R390 - - Radeon R390x - ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100 - - ``mullins`` - E1-2200 - - E1-2500 - - E2-3000 - - E2-3800 - - A4-5000 - - A4-5100 - - A6-5200 - - A4 Pro-3340B - ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790 - - Radeon HD 8770 - - R7 260 - - R7 260X + ----------------------------------------------------------------------------------------------- + ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000 + - A6 Pro-7050B + - A8-7100 + - A8 Pro-7150B + - A10-7300 + - A10 Pro-7350B + - FX-7500 + - A8-7200P + - A10-7400P + - FX-7600P + ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100 + - FirePro W9100 + - FirePro S9150 + - FirePro S9170 + ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290 + - Radeon R9 290x + - Radeon R390 + - Radeon R390x + ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100 + - ``mullins`` - E1-2200 + - E1-2500 + - E2-3000 + - E2-3800 + - A4-5000 + - A4-5100 + - A6-5200 + - A4 Pro-3340B + ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790 + - Radeon HD 8770 + - R7 260 + - R7 260X **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ - ---------------------------------------------------------------------------------------- - ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P - [on] - Pro A6-8500B - - A8-8600P - - Pro A8-8600B - - FX-8800P - - Pro A12-8800B - \ ``amdgcn`` APU - xnack ROCm - A10-8700P - [on] - Pro A10-8700B - - A10-8780P - \ ``amdgcn`` APU - xnack - A10-9600P - [on] - A10-9630P - - A12-9700P - - A12-9730P - - FX-9800P - - FX-9830P - \ ``amdgcn`` APU - xnack - E2-9010 - [on] - A6-9210 - - A9-9410 - ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150 - - ``tonga`` [off] - FirePro S7100 - - FirePro W7100 - - Radeon R285 - - Radeon R9 380 - - Radeon R9 385 - - Mobile FirePro - M7170 - ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano - [off] - Radeon R9 Fury - - Radeon R9 FuryX - - Radeon Pro Duo - - FirePro S9300x2 - - Radeon Instinct MI8 - \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470 - [off] - Radeon RX 480 - - Radeon Instinct MI6 - \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460 + ----------------------------------------------------------------------------------------------- + ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P + [on] - Pro A6-8500B + - A8-8600P + - Pro A8-8600B + - FX-8800P + - Pro A12-8800B + \ ``amdgcn`` APU - xnack ROCm - A10-8700P + [on] - Pro A10-8700B + - A10-8780P + \ ``amdgcn`` APU - xnack - A10-9600P + [on] - A10-9630P + - A12-9700P + - A12-9730P + - FX-9800P + - FX-9830P + \ ``amdgcn`` APU - xnack - E2-9010 + [on] - A6-9210 + - A9-9410 + ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150 + - ``tonga`` [off] - FirePro S7100 + - FirePro W7100 + - Radeon R285 + - Radeon R9 380 + - Radeon R9 385 + - Mobile FirePro + M7170 + ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano + [off] - Radeon R9 Fury + - Radeon R9 FuryX + - Radeon Pro Duo + - FirePro S9300x2 + - Radeon Instinct MI8 + \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470 + [off] - Radeon RX 480 + - Radeon Instinct MI6 + \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460 [off] ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack [on] **GCN GFX9** [AMD-GCN-GFX9]_ - ---------------------------------------------------------------------------------------- - ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega - [off] Frontier Edition - - Radeon RX Vega 56 - - Radeon RX Vega 64 - - Radeon RX Vega 64 - Liquid - - Radeon Instinct MI25 - ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G - [on] - Ryzen 5 2400G - ``gfx904`` ``amdgcn`` dGPU - xnack *TBA* + ----------------------------------------------------------------------------------------------- + ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega + [off] Frontier Edition + - Radeon RX Vega 56 + - Radeon RX Vega 64 + - Radeon RX Vega 64 + Liquid + - Radeon Instinct MI25 + ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G + [on] - Ryzen 5 2400G + ``gfx904`` ``amdgcn`` dGPU - xnack *TBA* [off] - .. TODO - Add product - names. - ``gfx906`` ``amdgcn`` dGPU - xnack - Radeon Instinct MI50 - [off] - Radeon Instinct MI60 - ``gfx909`` ``amdgcn`` APU - xnack *TBA* (Raven Ridge 2) + .. TODO + Add product + names. + ``gfx906`` ``amdgcn`` dGPU - xnack - Radeon Instinct MI50 + [off] - Radeon Instinct MI60 + ``gfx909`` ``amdgcn`` APU - xnack *TBA* (Raven Ridge 2) [on] - .. TODO - Add product - names. - =========== =============== ============ ===== ========== ======= ====================== + .. TODO + Add product + names. + **GCN GFX10** [AMD-GCN-GFX10]_ + ----------------------------------------------------------------------------------------------- + ``gfx1010`` ``amdgcn`` dGPU - xnack *TBA* + [off] + - wavefrontsize64 + [off] + - cumode + [off] + .. TODO + Add product + names. + ``gfx1011`` ``amdgcn`` dGPU - xnack *TBA* + [off] + - wavefrontsize64 + [off] + - cumode + [off] + .. TODO + Add product + names. + ``gfx1012`` ``amdgcn`` dGPU - xnack *TBA* + [off] + - wavefrontsize64 + [off] + - cumode + [off] + .. TODO + Add product + names. + =========== =============== ============ ===== ================= ======= ====================== .. _amdgpu-target-features: @@ -243,26 +272,38 @@ For example: .. table:: AMDGPU Target Features :name: amdgpu-target-feature-table - =============== ================================================== - Target Feature Description - =============== ================================================== - -m[no-]xnack Enable/disable generating code that has - memory clauses that are compatible with - having XNACK replay enabled. + ====================== ================================================== + Target Feature Description + ====================== ================================================== + -m[no-]xnack Enable/disable generating code that has + memory clauses that are compatible with + having XNACK replay enabled. - This is used for demand paging and page - migration. If XNACK replay is enabled in - the device, then if a page fault occurs - the code may execute incorrectly if the - ``xnack`` feature is not enabled. Executing - code that has the feature enabled on a - device that does not have XNACK replay - enabled will execute correctly, but may - be less performant than code with the - feature disabled. - -m[no-]sram-ecc Enable/disable generating code that assumes SRAM - ECC is enabled/disabled. - =============== ================================================== + This is used for demand paging and page + migration. If XNACK replay is enabled in + the device, then if a page fault occurs + the code may execute incorrectly if the + ``xnack`` feature is not enabled. Executing + code that has the feature enabled on a + device that does not have XNACK replay + enabled will execute correctly, but may + be less performant than code with the + feature disabled. + + -m[no-]sram-ecc Enable/disable generating code that assumes SRAM + ECC is enabled/disabled. + + -m[no-]wavefrontsize64 Control the default wavefront size used when + generating code for kernels. When disabled + native wavefront size 32 is used, when enabled + wavefront size 64 is used. + + -m[no-]cumode Control the default wavefront execution mode used + when generating code for kernels. When disabled + native WGP wavefront execution mode is used, + when enabled CU wavefront execution mode is used + (see :ref:`amdgpu-amdhsa-memory-model`). + ====================== ================================================== .. _amdgpu-address-spaces: @@ -635,6 +676,10 @@ The AMDGPU backend uses the following ELF header: ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` *reserved* 0x030 Reserved. ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` + *reserved* 0x032 Reserved. + ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` + ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` + ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` ================================= ========== ============================= Sections @@ -1492,12 +1537,12 @@ non-AMD key names should be prefixed by "*vendor-name*.". "NumSGPRs" integer Required Number of scalar registers used by a wavefront for - GFX6-GFX9. This + GFX6-GFX10. This includes the special SGPRs for VCC, Flat - Scratch (GFX7-GFX9) + Scratch (GFX7-GFX10) and XNACK (for - GFX8-GFX9). It does + GFX8-GFX10). It does not include the 16 SGPR added if a trap handler is @@ -1508,7 +1553,7 @@ non-AMD key names should be prefixed by "*vendor-name*.". "NumVGPRs" integer Required Number of vector registers used by each work-item for - GFX6-GFX9 + GFX6-GFX10 "MaxFlatWorkGroupSize" integer Required Maximum flat work-group size supported by the @@ -2060,10 +2105,10 @@ the scratch buffer descriptor and per wavefront scratch offset, by the scratch instructions, or by flat instructions. If each lane of a wavefront accesses the same private address, the interleaving results in adjacent dwords being accessed and hence requires fewer cache lines to be fetched. Multi-dword access is not -supported except by flat and scratch instructions in GFX9. +supported except by flat and scratch instructions in GFX9-GFX10. The generic address space uses the hardware flat address support available in -GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and +GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and local appertures), that are outside the range of addressible global memory, to map from a flat address to a private or local address. @@ -2078,7 +2123,7 @@ To convert between a segment address and a flat address the base address of the appertures address can be used. For GFX7-GFX8 these are available in the :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For -GFX9 the appature base addresses are directly available as inline constant +GFX9-GFX10 the appature base addresses are directly available as inline constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32 which makes it easier to convert from flat to segment or segment to flat. @@ -2120,14 +2165,14 @@ A kernel descriptor consists of the information needed by CP to initiate the execution of a kernel, including the entry point address of the machine code that implements the kernel. -Kernel Descriptor for GFX6-GFX9 -+++++++++++++++++++++++++++++++ +Kernel Descriptor for GFX6-GFX10 +++++++++++++++++++++++++++++++++ CP microcode requires the Kernel descriptor to be allocated on 64 byte alignment. - .. table:: Kernel Descriptor for GFX6-GFX9 - :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table + .. table:: Kernel Descriptor for GFX6-GFX10 + :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table ======= ======= =============================== ============================ Bits Size Field Name Description @@ -2157,22 +2202,32 @@ alignment. entry point instruction which must be 256 byte aligned. - 383:192 24 Reserved, must be 0. + 351:272 20 Reserved, must be 0. bytes + 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-9 + Reserved, must be 0. + GFX10 + Compute Shader (CS) + program settings used by + CP to set up + ``COMPUTE_PGM_RSRC3`` + configuration + register. See + :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) program settings used by CP to set up ``COMPUTE_PGM_RSRC1`` configuration register. See - :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) program settings used by CP to set up ``COMPUTE_PGM_RSRC2`` configuration register. See - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. 448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the _BUFFER SGPR user data registers (see @@ -2192,15 +2247,24 @@ alignment. 453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT *see above* 454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT *see above* _SIZE - 455 1 bit Reserved, must be 0. - 511:456 8 bytes Reserved, must be 0. + 457:455 3 bits Reserved, must be 0. + 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-9 + Reserved, must be 0. + GFX10 + - If 0 execute in + wavefront size 64 mode. + - If 1 execute in + native wavefront size + 32 mode. + 463:459 5 bits Reserved, must be 0. + 511:464 6 bytes Reserved, must be 0. 512 **Total size 64 bytes.** ======= ==================================================================== .. - .. table:: compute_pgm_rsrc1 for GFX6-GFX9 - :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table + .. table:: compute_pgm_rsrc1 for GFX6-GFX10 + :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table ======= ======= =============================== =========================================================================== Bits Size Field Name Description @@ -2213,6 +2277,12 @@ alignment. GFX6-GFX9 - vgprs_used 0..256 - max(0, ceil(vgprs_used / 4) - 1) + GFX10 (wavefront size 64) + - max_vgpr 1..256 + - max(0, ceil(vgprs_used / 4) - 1) + GFX10 (wavefront size 32) + - max_vgpr 1..256 + - max(0, ceil(vgprs_used / 8) - 1) Where vgprs_used is defined as the highest VGPR number @@ -2244,6 +2314,10 @@ alignment. GFX9 - sgprs_used 0..112 - 2 * max(0, ceil(sgprs_used / 16) - 1) + GFX10 + Reserved, must be 0. + (128 SGPRs always + allocated.) Where sgprs_used is defined as the highest @@ -2407,7 +2481,7 @@ alignment. ``COMPUTE_PGM_RSRC1.CDBG_USER``. 26 1 bit FP16_OVFL GFX6-GFX8 Reserved, must be 0. - GFX9 + GFX9-GFX10 Wavefront starts execution with specified fp16 overflow mode. @@ -2423,14 +2497,60 @@ alignment. Used by CP to set up ``COMPUTE_PGM_RSRC1.FP16_OVFL``. - 31:27 5 bits Reserved, must be 0. + 28:27 2 bits Reserved, must be 0. + 29 1 bit WGP_MODE GFX6-GFX9 + Reserved, must be 0. + GFX10 + - If 0 execute work-groups in + CU wavefront execution mode. + - If 1 execute work-groups on + in WGP wavefront execution mode. + + See :ref:`amdgpu-amdhsa-memory-model`. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.WGP_MODE``. + 30 1 bit MEM_ORDERED GFX6-9 + Reserved, must be 0. + GFX10 + Controls the behavior of the + waitcnt's vmcnt and vscnt + counters. + + - If 0 vmcnt reports completion + of load and atomic with return + out of order with sample + instructions, and the vscnt + reports the completion of + store and atomic without + return in order. + - If 1 vmcnt reports completion + of load, atomic with return + and sample instructions in + order, and the vscnt reports + the completion of store and + atomic without return in order. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. + 31 1 bit FWD_PROGRESS GFX6-9 + Reserved, must be 0. + GFX10 + - If 0 execute SIMD wavefronts + using oldest first policy. + - If 1 execute SIMD wavefronts to + ensure wavefronts will make some + forward progress. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. 32 **Total size 4 bytes** ======= =================================================================================================================== .. - .. table:: compute_pgm_rsrc2 for GFX6-GFX9 - :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table + .. table:: compute_pgm_rsrc2 for GFX6-GFX10 + :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table ======= ======= =============================== =========================================================================== Bits Size Field Name Description @@ -2549,7 +2669,7 @@ alignment. GFX6: roundup(lds-size / (64 * 4)) - GFX7-GFX9: + GFX7-GFX10: roundup(lds-size / (128 * 4)) 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution @@ -2580,6 +2700,21 @@ alignment. 32 **Total size 4 bytes.** ======= =================================================================================================================== +.. + + .. table:: compute_pgm_rsrc3 for GFX10 + :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table + + ======= ======= =============================== =========================================================================== + Bits Size Field Name Description + ======= ======= =============================== =========================================================================== + 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120. + compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64. + 31:4 28 Reserved, must be 0. + bits + 32 **Total size 4 bytes.** + ======= =================================================================================================================== + .. .. table:: Floating Point Rounding Mode Enumeration Values @@ -2749,7 +2884,7 @@ SGPR register initial state is defined in it once avoids loading it at the beginning of every wavefront. - GFX9 + GFX9-GFX10 This is the 64 bit base address of the per SPI scratch backing @@ -2787,7 +2922,7 @@ SGPR register initial state is defined in GFX7-GFX8 since it is the same value as the second SGPR of Flat Scratch Init. However, it - may be needed for GFX9 which + may be needed for GFX9-GFX10 which changes the meaning of the Flat Scratch Init value. then Grid Work-Group Count X 1 32 bit count of the number of @@ -2889,8 +3024,8 @@ Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit value to the hardware required SGPRn-3 and SGPRn-4 respectively. The global segment can be accessed either using buffer instructions (GFX6 which -has V# 64 bit address support), flat instructions (GFX7-GFX9), or global -instructions (GFX9). +has V# 64 bit address support), flat instructions (GFX7-GFX10), or global +instructions (GFX9-GFX10). If buffer operations are used then the compiler can generate a V# with the following properties: @@ -2918,7 +3053,7 @@ GFX6-GFX8 available in dispatch packet. For M0, it is also possible to use maximum possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for GFX7-GFX8). -GFX9 +GFX9-GFX10 The M0 register is not used for range checking LDS accesses and so does not need to be initialized in the prolog. @@ -2951,7 +3086,7 @@ GFX7-GFX8 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH SIZE. -GFX9 +GFX9-GFX10 The Flat Scratch Init is the 64 bit address of the base of scratch backing memory being managed by SPI for the queue executing the kernel dispatch. The prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH @@ -2972,7 +3107,7 @@ The AMDGPU backend supports the memory synchronization scopes specified in :ref:`amdgpu-memory-scopes`. The code sequences used to implement the memory model are defined in table -:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. +:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table`. The sequences specify the order of instructions that a single thread must execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect @@ -3010,7 +3145,8 @@ termed vector memory operations. For GFX6-GFX9: -* Each agent has multiple compute units (CU). +* Each agent has multiple shader arrays (SA). +* Each SA has multiple compute units (CU). * Each CU has multiple SIMDs that execute wavefronts. * The wavefronts for a single work-group are executed in the same CU but may be executed by different SIMDs. @@ -3056,8 +3192,79 @@ For GFX6-GFX9: * The L2 cache can be kept coherent with other agents on some targets, or ranges of virtual addresses can be set up to bypass it to ensure system coherence. +For GFX10: + +* Each agent has multiple shader arrays (SA). +* Each SA has multiple work-group processors (WGP). +* Each WGP has multiple compute units (CU). +* Each CU has multiple SIMDs that execute wavefronts. +* The wavefronts for a single work-group are executed in the same + WGP. In CU wavefront execution mode the wavefronts may be executed by + different SIMDs in the same CU. In WGP wavefront execution mode the + wavefronts may be executed by different SIMDs in different CUs in the same + WGP. +* Each WGP has a single LDS memory shared by the wavefronts of the work-groups + executing on it. +* All LDS operations of a WGP are performed as wavefront wide operations in a + global order and involve no caching. Completion is reported to a wavefront in + execution order. +* The LDS memory has multiple request queues shared by the SIMDs of a + WGP. Therefore, the LDS operations performed by different wavefronts of a work-group + can be reordered relative to each other, which can result in reordering the + visibility of vector memory operations with respect to LDS operations of other + wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to + ensure synchronization between LDS operations and vector memory operations + between wavefronts of a work-group, but not between operations performed by the + same wavefront. +* The vector memory operations are performed as wavefront wide operations. + Completion of load/store/sample operations are reported to a wavefront in + execution order of other load/store/sample operations performed by that + wavefront. +* The vector memory operations access a vector L0 cache. There is a single L0 + cache per CU. Each SIMD of a CU accesses the same L0 cache. + Therefore, no special action is required for coherence between the lanes of a + single wavefront. However, a ``BUFFER_GL0_INV`` is required for coherence + between wavefronts executing in the same work-group as they may be executing on + SIMDs of different CUs that access different L0s. A ``BUFFER_GL0_INV`` is also + required for coherence between wavefronts executing in different work-groups as + they may be executing on different WGPs. +* The scalar memory operations access a scalar L0 cache shared by all wavefronts + on a WGP. The scalar and vector L0 caches are not coherent. However, scalar + operations are used in a restricted way so do not impact the memory model. See + :ref:`amdgpu-amdhsa-memory-spaces`. +* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on + the same SA. Therefore, no special action is required for coherence between + the wavefronts of a single work-group. However, a ``BUFFER_GL1_INV`` is + required for coherence between wavefronts executing in different work-groups as + they may be executing on different SAs that access different L1s. +* The L1 caches have independent quadrants to service disjoint ranges of virtual + addresses. +* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the + vector and scalar memory operations performed by different wavefronts, whether + executing in the same or different work-groups (which may be executing on + different CUs accessing different L0s), can be reordered relative to each + other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure synchronization + between vector memory operations of different wavefronts. It ensures a previous + vector memory operation has completed before executing a subsequent vector + memory or LDS operation and so can be used to meet the requirements of acquire, + release and sequential consistency. +* The L1 caches use an L2 cache shared by all SAs on the same agent. +* The L2 cache has independent channels to service disjoint ranges of virtual + addresses. +* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 + quadrant has a separate request queue per L2 channel. Therefore, the vector + and scalar memory operations performed by wavefronts executing in different + work-groups (which may be executing on different SAs) of an agent can be + reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is + required to ensure synchronization between vector memory operations of + different SAs. It ensures a previous vector memory operation has completed + before executing a subsequent vector memory and so can be used to meet the + requirements of acquire, release and sequential consistency. +* The L2 cache can be kept coherent with other agents on some targets, or ranges + of virtual addresses can be set up to bypass it to ensure system coherence. + Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8), -or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the +or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread is accessing the memory, atomic memory orderings are not meaningful and all accesses are treated as non-atomic. @@ -3100,285 +3307,428 @@ future wavefront that uses the same scratch area, or a function call that create frame at the same address, respectively. There is no need for a ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. -Scratch backing memory (which is used for the private address space) +For GFX6-GFX9, scratch backing memory (which is used for the private address space) is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space is only accessed by a single thread, and is always write-before-read, there is never a need to invalidate these entries from the L1 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the volatile cache lines. +For GFX10, scratch backing memory (which is used for the private address space) +is accessed with MTYPE NC (non-coherenent). Since the private address space is +only accessed by a single thread, and is always write-before-read, there is +never a need to invalidate these entries from the L0 or L1 caches. + +For GFX10, wavefronts are executed in native mode with in-order reporting of loads +and sample instructions. In this mode vmcnt reports completion of load, atomic +with return and sample instructions in order, and the vscnt reports the +completion of store and atomic without return in order. See ``MEM_ORDERED`` field +in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + +In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode: + +* In WGP wavefront execution mode the wavefronts of a work-group are executed + on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per + CU L0 caches is required for work-group synchronization. Also accesses to L1 at + work-group scope need to be expicitly ordered as the accesses from different + CUs are not ordered. +* In CU wavefront execution mode the wavefronts of a work-group are executed on + the SIMDs of a single CU of the WGP. Therefore, all global memory access by + the work-group access the same L0 which in turn ensures L1 accesses are + ordered and so do not require explicit management of the caches for + work-group synchronization. + +See ``WGP_MODE`` field in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` +and :ref:`amdgpu-target-features`. + On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing -to invalidate the L2 cache. This also causes it to be treated as +to invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated as non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC -(cache coherent) and so the L2 cache will coherent with the CPU and other +(cache coherent) and so the L2 cache will be coherent with the CPU and other agents. - .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 - :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table + .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX10 + :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table - ============ ============ ============== ========== =============================== - LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code - Ordering Sync Scope Address + ============ ============ ============== ========== =============================== ================================== + LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code AMDGPU Machine Code + Ordering Sync Scope Address GFX6-9 GFX10 Space - ============ ============ ============== ========== =============================== + ============ ============ ============== ========== =============================== ================================== **Non-Atomic** - ----------------------------------------------------------------------------------- - load *none* *none* - global - !volatile & !nontemporal + ---------------------------------------------------------------------------------------------------------------------- + load *none* *none* - global - !volatile & !nontemporal - !volatile & !nontemporal - generic - - private 1. buffer/global/flat_load + - private 1. buffer/global/flat_load 1. buffer/global/flat_load - constant - - volatile & !nontemporal + - volatile & !nontemporal - volatile & !nontemporal - 1. buffer/global/flat_load - glc=1 + 1. buffer/global/flat_load 1. buffer/global/flat_load + glc=1 glc=1 dlc=1 - - nontemporal + - nontemporal - nontemporal - 1. buffer/global/flat_load - glc=1 slc=1 + 1. buffer/global/flat_load 1. buffer/global/flat_load + glc=1 slc=1 slc=1 - load *none* *none* - local 1. ds_load - store *none* *none* - global - !nontemporal + load *none* *none* - local 1. ds_load 1. ds_load + store *none* *none* - global - !nontemporal - !nontemporal - generic - - private 1. buffer/global/flat_store + - private 1. buffer/global/flat_store 1. buffer/global/flat_store - constant - - nontemporal + - nontemporal - nontemporal - 1. buffer/global/flat_stote - glc=1 slc=1 + 1. buffer/global/flat_stote 1. buffer/global/flat_store + glc=1 slc=1 slc=1 - store *none* *none* - local 1. ds_store + store *none* *none* - local 1. ds_store 1. ds_store **Unordered Atomic** - ----------------------------------------------------------------------------------- - load atomic unordered *any* *any* *Same as non-atomic*. - store atomic unordered *any* *any* *Same as non-atomic*. - atomicrmw unordered *any* *any* *Same as monotonic - atomic*. + ---------------------------------------------------------------------------------------------------------------------- + load atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*. + store atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*. + atomicrmw unordered *any* *any* *Same as monotonic *Same as monotonic + atomic*. atomic*. **Monotonic Atomic** - ----------------------------------------------------------------------------------- - load atomic monotonic - singlethread - global 1. buffer/global/flat_load + ---------------------------------------------------------------------------------------------------------------------- + load atomic monotonic - singlethread - global 1. buffer/global/flat_load 1. buffer/global/flat_load - wavefront - generic - - workgroup - load atomic monotonic - singlethread - local 1. ds_load + load atomic monotonic - workgroup - global 1. buffer/global/flat_load 1. buffer/global/flat_load + - generic glc=1 + + - If CU wavefront execution mode, omit glc=1. + + load atomic monotonic - singlethread - local 1. ds_load 1. ds_load - wavefront - workgroup - load atomic monotonic - agent - global 1. buffer/global/flat_load - - system - generic glc=1 - store atomic monotonic - singlethread - global 1. buffer/global/flat_store + load atomic monotonic - agent - global 1. buffer/global/flat_load 1. buffer/global/flat_load + - system - generic glc=1 glc=1 dlc=1 + store atomic monotonic - singlethread - global 1. buffer/global/flat_store 1. buffer/global/flat_store - wavefront - generic - workgroup - agent - system - store atomic monotonic - singlethread - local 1. ds_store + store atomic monotonic - singlethread - local 1. ds_store 1. ds_store - wavefront - workgroup - atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic + atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 1. buffer/global/flat_atomic - wavefront - generic - workgroup - agent - system - atomicrmw monotonic - singlethread - local 1. ds_atomic + atomicrmw monotonic - singlethread - local 1. ds_atomic 1. ds_atomic - wavefront - workgroup **Acquire Atomic** - ----------------------------------------------------------------------------------- - load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load + ---------------------------------------------------------------------------------------------------------------------- + load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 1. buffer/global/ds/flat_load - wavefront - local - generic - load atomic acquire - workgroup - global 1. buffer/global/flat_load - load atomic acquire - workgroup - local 1. ds_load - 2. s_waitcnt lgkmcnt(0) + load atomic acquire - workgroup - global 1. buffer/global/flat_load 1. buffer/global_load glc=1 - - If OpenCL, omit. - - Must happen before - any following - global/generic - load/load - atomic/store/store - atomic/atomicrmw. - - Ensures any - following global - data read is no - older than the load - atomic value being - acquired. - load atomic acquire - workgroup - generic 1. flat_load - 2. s_waitcnt lgkmcnt(0) + - If CU wavefront execution mode, omit glc=1. - - If OpenCL, omit. - - Must happen before - any following - global/generic - load/load - atomic/store/store - atomic/atomicrmw. - - Ensures any - following global - data read is no - older than the load - atomic value being - acquired. - load atomic acquire - agent - global 1. buffer/global/flat_load - - system glc=1 - 2. s_waitcnt vmcnt(0) + 2. s_waitcnt vmcnt(0) - - Must happen before - following - buffer_wbinvl1_vol. - - Ensures the load - has completed - before invalidating - the cache. + - If CU wavefront execution mode, omit. + - Must happen before + the following buffer_gl0_inv + and before any following + global/generic + load/load + atomic/stote/store + atomic/atomicrmw. - 3. buffer_wbinvl1_vol + 3. buffer_gl0_inv - - Must happen before - any following - global/generic - load/load - atomic/atomicrmw. - - Ensures that - following - loads will not see - stale global data. + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. - load atomic acquire - agent - generic 1. flat_load glc=1 - - system 2. s_waitcnt vmcnt(0) & - lgkmcnt(0) + load atomic acquire - workgroup - local 1. ds_load 1. ds_load + 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) - - If OpenCL omit - lgkmcnt(0). - - Must happen before - following - buffer_wbinvl1_vol. - - Ensures the flat_load - has completed - before invalidating - the cache. + - If OpenCL, omit. - If OpenCL, omit. + - Must happen before - Must happen before + any following the following buffer_gl0_inv + global/generic and before any following + load/load global/generic load/load + atomic/store/store atomic/store/store + atomic/atomicrmw. atomic/atomicrmw. + - Ensures any - Ensures any + following global following global + data read is no data read is no + older than the load older than the load + atomic value being atomic value being + acquired. acquired. - 3. buffer_wbinvl1_vol + 3. buffer_gl0_inv - - Must happen before - any following - global/generic - load/load - atomic/atomicrmw. - - Ensures that - following loads - will not see stale - global data. + - If CU wavefront execution mode, omit. + - If OpenCL, omit. + - Ensures that + following + loads will not see + stale data. - atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic + load atomic acquire - workgroup - generic 1. flat_load 1. flat_load glc=1 + + - If CU wavefront execution mode, omit glc=1. + + 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) & + vmcnt(0) + + - If CU wavefront execution mode, omit vmcnt. + - If OpenCL, omit. - If OpenCL, omit + lgkmcnt(0). + - Must happen before - Must happen before + any following the following + global/generic buffer_gl0_inv and any + load/load following global/generic + atomic/store/store load/load + atomic/atomicrmw. atomic/store/store + atomic/atomicrmw. + - Ensures any - Ensures any + following global following global + data read is no data read is no + older than the load older than the load + atomic value being atomic value being + acquired. acquired. + + 3. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + load atomic acquire - agent - global 1. buffer/global/flat_load 1. buffer/global_load + - system glc=1 glc=1 dlc=1 + 2. s_waitcnt vmcnt(0) 2. s_waitcnt vmcnt(0) + + - Must happen before - Must happen before + following following + buffer_wbinvl1_vol. buffer_gl*_inv. + - Ensures the load - Ensures the load + has completed has completed + before invalidating before invalidating + the cache. the caches. + + 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following following + loads will not see loads will not see + stale global data. stale global data. + + load atomic acquire - agent - generic 1. flat_load glc=1 1. flat_load glc=1 dlc=1 + - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vmcnt(0) & + lgkmcnt(0) lgkmcnt(0) + + - If OpenCL omit - If OpenCL omit + lgkmcnt(0). lgkmcnt(0). + - Must happen before - Must happen before + following following + buffer_wbinvl1_vol. buffer_gl*_invl. + - Ensures the flat_load - Ensures the flat_load + has completed has completed + before invalidating before invalidating + the cache. the caches. + + 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. global data. + + atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic - wavefront - local - generic - atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic - atomicrmw acquire - workgroup - local 1. ds_atomic - 2. waitcnt lgkmcnt(0) + atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic 1. buffer/global_atomic + 2. s_waitcnt vm/vscnt(0) - - If OpenCL, omit. - - Must happen before - any following - global/generic + - If CU wavefront execution mode, omit. + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + - Must happen before + the following buffer_gl0_inv + and before any following + global/generic + load/load + atomic/stote/store + atomic/atomicrmw. + + 3. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acquire - workgroup - local 1. ds_atomic 1. ds_atomic + 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) + + - If OpenCL, omit. - If OpenCL, omit. + - Must happen before - Must happen before + any following the following + global/generic buffer_gl0_inv. load/load atomic/store/store atomic/atomicrmw. - - Ensures any - following global - data read is no - older than the - atomicrmw value - being acquired. + - Ensures any - Ensures any + following global following global + data read is no data read is no + older than the older than the + atomicrmw value atomicrmw value + being acquired. being acquired. - atomicrmw acquire - workgroup - generic 1. flat_atomic - 2. waitcnt lgkmcnt(0) + 3. buffer_gl0_inv - - If OpenCL, omit. - - Must happen before - any following - global/generic + - If OpenCL omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acquire - workgroup - generic 1. flat_atomic 1. flat_atomic + 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) & + vm/vscnt(0) + + - If CU wavefront execution mode, omit vm/vscnt. + - If OpenCL, omit. - If OpenCL, omit + waitcnt lgkmcnt(0).. + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + waitcnt lgkmcnt(0). + - Must happen before - Must happen before + any following the following + global/generic buffer_gl0_inv. load/load atomic/store/store atomic/atomicrmw. - - Ensures any - following global - data read is no - older than the - atomicrmw value - being acquired. + - Ensures any - Ensures any + following global following global + data read is no data read is no + older than the older than the + atomicrmw value atomicrmw value + being acquired. being acquired. - atomicrmw acquire - agent - global 1. buffer/global/flat_atomic - - system 2. s_waitcnt vmcnt(0) + 3. buffer_gl0_inv - - Must happen before - following - buffer_wbinvl1_vol. - - Ensures the - atomicrmw has - completed before - invalidating the - cache. + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. - 3. buffer_wbinvl1_vol + atomicrmw acquire - agent - global 1. buffer/global/flat_atomic 1. buffer/global_atomic + - system 2. s_waitcnt vmcnt(0) 2. s_waitcnt vm/vscnt(0) - - Must happen before - any following - global/generic - load/load - atomic/atomicrmw. - - Ensures that - following loads - will not see stale - global data. + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + waitcnt lgkmcnt(0). + - Must happen before - Must happen before + following following + buffer_wbinvl1_vol. buffer_gl*_inv. + - Ensures the - Ensures the + atomicrmw has atomicrmw has + completed before completed before + invalidating the invalidating the + cache. caches. - atomicrmw acquire - agent - generic 1. flat_atomic - - system 2. s_waitcnt vmcnt(0) & - lgkmcnt(0) + 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; + buffer_gl1_inv - - If OpenCL, omit - lgkmcnt(0). - - Must happen before - following - buffer_wbinvl1_vol. - - Ensures the - atomicrmw has - completed before - invalidating the - cache. + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. global data. - 3. buffer_wbinvl1_vol + atomicrmw acquire - agent - generic 1. flat_atomic 1. flat_atomic + - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vm/vscnt(0) & + lgkmcnt(0) lgkmcnt(0) - - Must happen before - any following - global/generic - load/load - atomic/atomicrmw. - - Ensures that - following loads - will not see stale - global data. + - If OpenCL, omit - If OpenCL, omit + lgkmcnt(0). lgkmcnt(0). + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + - Must happen before - Must happen before + following following + buffer_wbinvl1_vol. buffer_gl*_inv. + - Ensures the - Ensures the + atomicrmw has atomicrmw has + completed before completed before + invalidating the invalidating the + cache. caches. - fence acquire - singlethread *none* *none* + 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. global data. + + fence acquire - singlethread *none* *none* *none* - wavefront - fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) + fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) - - If OpenCL and - address space is - not generic, omit. - - However, since LLVM - currently has no - address space on - the fence need to - conservatively - always generate. If - fence had an - address space then - set to address - space of OpenCL - fence flag, or to - generic if both - local and global - flags are - specified. + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL and - If OpenCL and + address space is address space is + not generic, omit. not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM - However, since LLVM + currently has no currently has no + address space on address space on + the fence need to the fence need to + conservatively conservatively + always generate. If always generate. If + fence had an fence had an + address space then address space then + set to address set to address + space of OpenCL space of OpenCL + fence flag, or to fence flag, or to + generic if both generic if both + local and global local and global + flags are flags are + specified. specified. - Must happen after any preceding local/generic load @@ -3402,22 +3752,95 @@ agents. older than the value read by the fence-paired-atomic. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + atomicrmw-no-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + the following + buffer_gl0_inv. + - Ensures that the + fence-paired atomic + has completed + before invalidating + the + cache. Therefore + any following + locations read must + be no older than + the value read by + the + fence-paired-atomic. - fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & - - system vmcnt(0) + 3. buffer_gl0_inv - - If OpenCL and - address space is - not generic, omit - lgkmcnt(0). - - However, since LLVM - currently has no - address space on - the fence need to - conservatively - always generate - (see comment for - previous fence). + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) vmcnt(0) & vscnt(0) + + - If OpenCL and - If OpenCL and + address space is address space is + not generic, omit not generic, omit + lgkmcnt(0). lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM - However, since LLVM + currently has no currently has no + address space on address space on + the fence need to the fence need to + conservatively conservatively + always generate always generate + (see comment for (see comment for + previous fence). previous fence). - Could be split into separate s_waitcnt vmcnt(0) and @@ -3466,863 +3889,1555 @@ agents. the value read by the fence-paired-atomic. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + atomicrmw-no-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + the following + buffer_gl*_inv. + - Ensures that the + fence-paired atomic + has completed + before invalidating + the + caches. Therefore + any following + locations read must + be no older than + the value read by + the + fence-paired-atomic. - 2. buffer_wbinvl1_vol + 2. buffer_wbinvl1_vol 2. buffer_gl0_inv; + buffer_gl1_inv - - Must happen before any - following global/generic - load/load - atomic/store/store - atomic/atomicrmw. - - Ensures that - following loads - will not see stale - global data. + - Must happen before any - Must happen before any + following global/generic following global/generic + load/load load/load + atomic/store/store atomic/store/store + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. global data. **Release Atomic** - ----------------------------------------------------------------------------------- - store atomic release - singlethread - global 1. buffer/global/ds/flat_store + ---------------------------------------------------------------------------------------------------------------------- + store atomic release - singlethread - global 1. buffer/global/ds/flat_store 1. buffer/global/ds/flat_store - wavefront - local - generic - store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) + store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) - - If OpenCL, omit. + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL, omit. - If OpenCL, omit + lgkmcnt(0). - Must happen after any preceding local/generic load/store/load atomic/store atomic/atomicrmw. - - Must happen before - the following - store. - - Ensures that all - memory operations - to local have - completed before - performing the - store that is being - released. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + store. store. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + store that is being store that is being + released. released. - 2. buffer/global/flat_store - store atomic release - workgroup - local 1. ds_store - store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0) + 2. buffer/global/flat_store 2. buffer/global_store + store atomic release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) - - If OpenCL, omit. + - If CU wavefront execution mode, omit. + - If OpenCL, omit. + - Could be split into + separate s_waitcnt + vmcnt(0) and s_waitcnt + vscnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - Must happen before + the following + store. + - Ensures that all + global memory + operations have + completed before + performing the + store that is being + released. + + 1. ds_store 2. ds_store + store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL, omit. - If OpenCL, omit + lgkmcnt(0). - Must happen after any preceding local/generic load/store/load atomic/store atomic/atomicrmw. - - Must happen before - the following - store. - - Ensures that all - memory operations - to local have - completed before - performing the - store that is being - released. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load/store/load + atomic/store atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + store. store. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + store that is being store that is being + released. released. - 2. flat_store - store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & - - system - generic vmcnt(0) + 2. flat_store 2. flat_store + store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system - generic vmcnt(0) vmcnt(0) & vscnt(0) - - If OpenCL, omit - lgkmcnt(0). - - Could be split into - separate s_waitcnt - vmcnt(0) and - s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic - load/store/load - atomic/store - atomic/atomicrmw. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store - atomic/atomicrmw. - - Must happen before - the following - store. - - Ensures that all - memory operations - to memory have - completed before - performing the - store that is being - released. + - If OpenCL, omit - If OpenCL, omit + lgkmcnt(0). lgkmcnt(0). + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) and vmcnt(0), s_waitcnt vscnt(0) + s_waitcnt and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) + must happen after must happen after + any preceding any preceding + global/generic global/generic + load/store/load load/load + atomic/store atomic/ + atomic/atomicrmw. atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) + must happen after must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + store. store. + - Ensures that all - Ensures that all + memory operations memory operations + to memory have to memory have + completed before completed before + performing the performing the + store that is being store that is being + released. released. - 2. buffer/global/ds/flat_store - atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic + 2. buffer/global/ds/flat_store 2. buffer/global/ds/flat_store + atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic - wavefront - local - generic - atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) + atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + - If CU wavefront execution mode, omit vmcnt and + vscnt. - If OpenCL, omit. + - Must happen after any preceding local/generic load/store/load atomic/store atomic/atomicrmw. - - Must happen before - the following - atomicrmw. - - Ensures that all - memory operations - to local have - completed before - performing the - atomicrmw that is - being released. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + atomicrmw that is atomicrmw that is + being released. being released. - 2. buffer/global/flat_atomic - atomicrmw release - workgroup - local 1. ds_atomic - atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0) + 2. buffer/global/flat_atomic 2. buffer/global_atomic + atomicrmw release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) - - If OpenCL, omit. + - If CU wavefront execution mode, omit. + - If OpenCL, omit. + - Could be split into + separate s_waitcnt + vmcnt(0) and s_waitcnt + vscnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - Must happen before + the following + store. + - Ensures that all + global memory + operations have + completed before + performing the + store that is being + released. + + 1. ds_atomic 2. ds_atomic + atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL, omit. - If OpenCL, omit + waitcnt lgkmcnt(0). - Must happen after any preceding local/generic load/store/load atomic/store atomic/atomicrmw. - - Must happen before - the following - atomicrmw. - - Ensures that all - memory operations - to local have - completed before - performing the - atomicrmw that is - being released. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load/store/load + atomic/store atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + atomicrmw that is atomicrmw that is + being released. being released. - 2. flat_atomic - atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & - - system - generic vmcnt(0) + 2. flat_atomic 2. flat_atomic + atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lkkmcnt(0) & + - system - generic vmcnt(0) vmcnt(0) & vscnt(0) - - If OpenCL, omit - lgkmcnt(0). - - Could be split into - separate s_waitcnt - vmcnt(0) and - s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic - load/store/load - atomic/store + - If OpenCL, omit - If OpenCL, omit + lgkmcnt(0). lgkmcnt(0). + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) and vmcnt(0), s_waitcnt + s_waitcnt vscnt(0) and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) + must happen after must happen after + any preceding any preceding + global/generic global/generic + load/store/load load/load atomic/ + atomic/store atomicrmw-with-return-value. atomic/atomicrmw. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store - atomic/atomicrmw. - - Must happen before - the following - atomicrmw. - - Ensures that all - memory operations - to global and local - have completed - before performing - the atomicrmw that - is being released. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) + must happen after must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to global and local to global and local + have completed have completed + before performing before performing + the atomicrmw that the atomicrmw that + is being released. is being released. - 2. buffer/global/ds/flat_atomic - fence release - singlethread *none* *none* + 2. buffer/global/ds/flat_atomic 2. buffer/global/ds/flat_atomic + fence release - singlethread *none* *none* *none* - wavefront - fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) + fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) - - If OpenCL and - address space is - not generic, omit. - - However, since LLVM - currently has no - address space on - the fence need to - conservatively - always generate. If - fence had an - address space then - set to address - space of OpenCL - fence flag, or to - generic if both - local and global - flags are - specified. + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL and - If OpenCL and + address space is address space is + not generic, omit. not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM - However, since LLVM + currently has no currently has no + address space on address space on + the fence need to the fence need to + conservatively conservatively + always generate. If always generate. If + fence had an fence had an + address space then address space then + set to address set to address + space of OpenCL space of OpenCL + fence flag, or to fence flag, or to + generic if both generic if both + local and global local and global + flags are flags are + specified. specified. - Must happen after any preceding local/generic load/load atomic/store/store atomic/atomicrmw. - - Must happen before - any following store - atomic/atomicrmw - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - fence-paired-atomic). - - Ensures that all - memory operations - to local have - completed before - performing the - following - fence-paired-atomic. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store atomic/ + atomicrmw. + - Must happen before - Must happen before + any following store any following store + atomic/atomicrmw atomic/atomicrmw + with an equal or with an equal or + wider sync scope wider sync scope + and memory ordering and memory ordering + stronger than stronger than + unordered (this is unordered (this is + termed the termed the + fence-paired-atomic). fence-paired-atomic). + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + following following + fence-paired-atomic. fence-paired-atomic. - fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & - - system vmcnt(0) + fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) vmcnt(0) & vscnt(0) - - If OpenCL and - address space is - not generic, omit - lgkmcnt(0). - - If OpenCL and - address space is - local, omit - vmcnt(0). - - However, since LLVM - currently has no - address space on - the fence need to - conservatively - always generate. If - fence had an - address space then - set to address - space of OpenCL - fence flag, or to - generic if both - local and global - flags are - specified. - - Could be split into - separate s_waitcnt - vmcnt(0) and - s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic - load/store/load - atomic/store + - If OpenCL and - If OpenCL and + address space is address space is + not generic, omit not generic, omit + lgkmcnt(0). lgkmcnt(0). + - If OpenCL and - If OpenCL and + address space is address space is + local, omit local, omit + vmcnt(0). vmcnt(0) and vscnt(0). + - However, since LLVM - However, since LLVM + currently has no currently has no + address space on address space on + the fence need to the fence need to + conservatively conservatively + always generate. If always generate. If + fence had an fence had an + address space then address space then + set to address set to address + space of OpenCL space of OpenCL + fence flag, or to fence flag, or to + generic if both generic if both + local and global local and global + flags are flags are + specified. specified. + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) and vmcnt(0), s_waitcnt + s_waitcnt vscnt(0) and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) + must happen after must happen after + any preceding any preceding + global/generic global/generic + load/store/load load/load atomic/ + atomic/store atomicrmw-with-return-value. atomic/atomicrmw. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store - atomic/atomicrmw. - - Must happen before - any following store - atomic/atomicrmw - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - fence-paired-atomic). - - Ensures that all - memory operations - have - completed before - performing the - following - fence-paired-atomic. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) + must happen after must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Must happen before - Must happen before + any following store any following store + atomic/atomicrmw atomic/atomicrmw + with an equal or with an equal or + wider sync scope wider sync scope + and memory ordering and memory ordering + stronger than stronger than + unordered (this is unordered (this is + termed the termed the + fence-paired-atomic). fence-paired-atomic). + - Ensures that all - Ensures that all + memory operations memory operations + have have + completed before completed before + performing the performing the + following following + fence-paired-atomic. fence-paired-atomic. **Acquire-Release Atomic** - ----------------------------------------------------------------------------------- - atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic + ---------------------------------------------------------------------------------------------------------------------- + atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic - wavefront - local - generic - atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) + atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) - - If OpenCL, omit. + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL, omit. - If OpenCL, omit + s_waitcnt lgkmcnt(0). + - Must happen after - Must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load/store/load + atomic/store atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + atomicrmw that is atomicrmw that is + being released. being released. + + 2. buffer/global/flat_atomic 2. buffer/global_atomic + 3. s_waitcnt vm/vscnt(0) + + - If CU wavefront execution mode, omit vm/vscnt. + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + waitcnt lgkmcnt(0). + - Must happen before + the following + buffer_gl0_inv. + - Ensures any + following global + data read is no + older than the + atomicrmw value + being acquired. + + 4. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acq_rel - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit. + - If OpenCL, omit. + - Could be split into + separate s_waitcnt + vmcnt(0) and s_waitcnt + vscnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - Must happen before + the following + store. + - Ensures that all + global memory + operations have + completed before + performing the + store that is being + released. + + 1. ds_atomic 2. ds_atomic + 2. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. - If OpenCL, omit. + - Must happen before - Must happen before + any following the following + global/generic buffer_gl0_inv. + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any - Ensures any + following global following global + data read is no data read is no + older than the load older than the load + atomic value being atomic value being + acquired. acquired. + + 4. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - If OpenCL omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL, omit. - If OpenCL, omit + waitcnt lgkmcnt(0). - Must happen after any preceding local/generic load/store/load atomic/store atomic/atomicrmw. - - Must happen before - the following - atomicrmw. - - Ensures that all - memory operations - to local have - completed before - performing the - atomicrmw that is - being released. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load/store/load + atomic/store atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + atomicrmw that is atomicrmw that is + being released. being released. - 2. buffer/global/flat_atomic - atomicrmw acq_rel - workgroup - local 1. ds_atomic - 2. s_waitcnt lgkmcnt(0) + 2. flat_atomic 2. flat_atomic + 3. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) & + vm/vscnt(0) - - If OpenCL, omit. - - Must happen before - any following - global/generic + - If CU wavefront execution mode, omit vm/vscnt. + - If OpenCL, omit. - If OpenCL, omit + waitcnt lgkmcnt(0). + - Must happen before - Must happen before + any following the following + global/generic buffer_gl0_inv. load/load atomic/store/store atomic/atomicrmw. - - Ensures any - following global - data read is no - older than the load - atomic value being - acquired. + - Ensures any - Ensures any + following global following global + data read is no data read is no + older than the load older than the load + atomic value being atomic value being + acquired. acquired. - atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) + 3. buffer_gl0_inv - - If OpenCL, omit. - - Must happen after - any preceding - local/generic - load/store/load - atomic/store + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) vmcnt(0) & vscnt(0) + + - If OpenCL, omit - If OpenCL, omit + lgkmcnt(0). lgkmcnt(0). + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) and vmcnt(0), s_waitcnt + s_waitcnt vscnt(0) and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) + must happen after must happen after + any preceding any preceding + global/generic global/generic + load/store/load load/load atomic/ + atomic/store atomicrmw-with-return-value. atomic/atomicrmw. - - Must happen before - the following - atomicrmw. - - Ensures that all - memory operations - to local have - completed before - performing the - atomicrmw that is - being released. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) + must happen after must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to global have to global have + completed before completed before + performing the performing the + atomicrmw that is atomicrmw that is + being released. being released. - 2. flat_atomic - 3. s_waitcnt lgkmcnt(0) + 2. buffer/global/flat_atomic 2. buffer/global_atomic + 3. s_waitcnt vmcnt(0) 3. s_waitcnt vm/vscnt(0) - - If OpenCL, omit. - - Must happen before - any following - global/generic - load/load - atomic/store/store + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + waitcnt lgkmcnt(0). + - Must happen before - Must happen before + following following + buffer_wbinvl1_vol. buffer_gl*_inv. + - Ensures the - Ensures the + atomicrmw has atomicrmw has + completed before completed before + invalidating the invalidating the + cache. caches. + + 4. buffer_wbinvl1_vol 4. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. global data. + + atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) vmcnt(0) & vscnt(0) + + - If OpenCL, omit - If OpenCL, omit + lgkmcnt(0). lgkmcnt(0). + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) and vmcnt(0), s_waitcnt + s_waitcnt vscnt(0) and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) + must happen after must happen after + any preceding any preceding + global/generic global/generic + load/store/load load/load atomic + atomic/store atomicrmw-with-return-value. atomic/atomicrmw. - - Ensures any - following global - data read is no - older than the load - atomic value being - acquired. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) + must happen after must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to global have have + completed before completed before + performing the performing the + atomicrmw that is atomicrmw that is + being released. being released. - atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & - - system vmcnt(0) + 2. flat_atomic 2. flat_atomic + 3. s_waitcnt vmcnt(0) & 3. s_waitcnt vm/vscnt(0) & + lgkmcnt(0) lgkmcnt(0) - - If OpenCL, omit - lgkmcnt(0). - - Could be split into - separate s_waitcnt - vmcnt(0) and - s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic - load/store/load - atomic/store - atomic/atomicrmw. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store - atomic/atomicrmw. - - Must happen before - the following - atomicrmw. - - Ensures that all - memory operations - to global have - completed before - performing the - atomicrmw that is - being released. + - If OpenCL, omit - If OpenCL, omit + lgkmcnt(0). lgkmcnt(0). + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + - Must happen before - Must happen before + following following + buffer_wbinvl1_vol. buffer_gl*_inv. + - Ensures the - Ensures the + atomicrmw has atomicrmw has + completed before completed before + invalidating the invalidating the + cache. caches. - 2. buffer/global/flat_atomic - 3. s_waitcnt vmcnt(0) + 4. buffer_wbinvl1_vol 4. buffer_gl0_inv; + buffer_gl1_inv - - Must happen before - following - buffer_wbinvl1_vol. - - Ensures the - atomicrmw has - completed before - invalidating the - cache. + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. global data. - 4. buffer_wbinvl1_vol - - - Must happen before - any following - global/generic - load/load - atomic/atomicrmw. - - Ensures that - following loads - will not see stale - global data. - - atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & - - system vmcnt(0) - - - If OpenCL, omit - lgkmcnt(0). - - Could be split into - separate s_waitcnt - vmcnt(0) and - s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic - load/store/load - atomic/store - atomic/atomicrmw. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store - atomic/atomicrmw. - - Must happen before - the following - atomicrmw. - - Ensures that all - memory operations - to global have - completed before - performing the - atomicrmw that is - being released. - - 2. flat_atomic - 3. s_waitcnt vmcnt(0) & - lgkmcnt(0) - - - If OpenCL, omit - lgkmcnt(0). - - Must happen before - following - buffer_wbinvl1_vol. - - Ensures the - atomicrmw has - completed before - invalidating the - cache. - - 4. buffer_wbinvl1_vol - - - Must happen before - any following - global/generic - load/load - atomic/atomicrmw. - - Ensures that - following loads - will not see stale - global data. - - fence acq_rel - singlethread *none* *none* + fence acq_rel - singlethread *none* *none* *none* - wavefront - fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) + fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) - - If OpenCL and - address space is - not generic, omit. - - However, - since LLVM - currently has no - address space on - the fence need to - conservatively - always generate - (see comment for - previous fence). + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL and - If OpenCL and + address space is address space is + not generic, omit. not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, - However, + since LLVM since LLVM + currently has no currently has no + address space on address space on + the fence need to the fence need to + conservatively conservatively + always generate always generate + (see comment for (see comment for + previous fence). previous fence). - Must happen after any preceding local/generic load/load atomic/store/store atomic/atomicrmw. - - Must happen before - any following - global/generic - load/load - atomic/store/store - atomic/atomicrmw. - - Ensures that all - memory operations - to local have - completed before - performing any - following global - memory operations. - - Ensures that the - preceding - local/generic load - atomic/atomicrmw - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - acquire-fence-paired-atomic - ) has completed - before following - global memory - operations. This - satisfies the - requirements of - acquire. - - Ensures that all - previous memory - operations have - completed before a - following - local/generic store - atomic/atomicrmw - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - release-fence-paired-atomic - ). This satisfies the - requirements of - release. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store atomic/ + atomicrmw. + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/store/store atomic/store/store + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing any performing any + following global following global + memory operations. memory operations. + - Ensures that the - Ensures that the + preceding preceding + local/generic load local/generic load + atomic/atomicrmw atomic/atomicrmw + with an equal or with an equal or + wider sync scope wider sync scope + and memory ordering and memory ordering + stronger than stronger than + unordered (this is unordered (this is + termed the termed the + acquire-fence-paired-atomic acquire-fence-paired-atomic + ) has completed ) has completed + before following before following + global memory global memory + operations. This operations. This + satisfies the satisfies the + requirements of requirements of + acquire. acquire. + - Ensures that all - Ensures that all + previous memory previous memory + operations have operations have + completed before a completed before a + following following + local/generic store local/generic store + atomic/atomicrmw atomic/atomicrmw + with an equal or with an equal or + wider sync scope wider sync scope + and memory ordering and memory ordering + stronger than stronger than + unordered (this is unordered (this is + termed the termed the + release-fence-paired-atomic release-fence-paired-atomic + ). This satisfies the ). This satisfies the + requirements of requirements of + release. release. + - Must happen before + the following + buffer_gl0_inv. + - Ensures that the + acquire-fence-paired + atomic has completed + before invalidating + the + cache. Therefore + any following + locations read must + be no older than + the value read by + the + acquire-fence-paired-atomic. - fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & - - system vmcnt(0) + 3. buffer_gl0_inv - - If OpenCL and - address space is - not generic, omit - lgkmcnt(0). - - However, since LLVM - currently has no - address space on - the fence need to - conservatively - always generate - (see comment for - previous fence). - - Could be split into - separate s_waitcnt - vmcnt(0) and - s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - s_waitcnt vmcnt(0) - must happen after - any preceding - global/generic - load/store/load - atomic/store - atomic/atomicrmw. - - s_waitcnt lgkmcnt(0) - must happen after - any preceding - local/generic - load/store/load - atomic/store - atomic/atomicrmw. - - Must happen before - the following - buffer_wbinvl1_vol. - - Ensures that the - preceding - global/local/generic - load - atomic/atomicrmw - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - acquire-fence-paired-atomic - ) has completed - before invalidating - the cache. This - satisfies the - requirements of - acquire. - - Ensures that all - previous memory - operations have - completed before a - following - global/local/generic - store - atomic/atomicrmw - with an equal or - wider sync scope - and memory ordering - stronger than - unordered (this is - termed the - release-fence-paired-atomic - ). This satisfies the - requirements of - release. + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. - 2. buffer_wbinvl1_vol + fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) vmcnt(0) & vscnt(0) - - Must happen before - any following - global/generic - load/load - atomic/store/store - atomic/atomicrmw. - - Ensures that - following loads - will not see stale - global data. This - satisfies the - requirements of - acquire. + - If OpenCL and - If OpenCL and + address space is address space is + not generic, omit not generic, omit + lgkmcnt(0). lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM - However, since LLVM + currently has no currently has no + address space on address space on + the fence need to the fence need to + conservatively conservatively + always generate always generate + (see comment for (see comment for + previous fence). previous fence). + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) and vmcnt(0), s_waitcnt + s_waitcnt vscnt(0) and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) + must happen after must happen after + any preceding any preceding + global/generic global/generic + load/store/load load/load + atomic/store atomic/ + atomic/atomicrmw. atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) + must happen after must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + buffer_wbinvl1_vol. buffer_gl*_inv. + - Ensures that the - Ensures that the + preceding preceding + global/local/generic global/local/generic + load load + atomic/atomicrmw atomic/atomicrmw + with an equal or with an equal or + wider sync scope wider sync scope + and memory ordering and memory ordering + stronger than stronger than + unordered (this is unordered (this is + termed the termed the + acquire-fence-paired-atomic acquire-fence-paired-atomic + ) has completed ) has completed + before invalidating before invalidating + the cache. This the caches. This + satisfies the satisfies the + requirements of requirements of + acquire. acquire. + - Ensures that all - Ensures that all + previous memory previous memory + operations have operations have + completed before a completed before a + following following + global/local/generic global/local/generic + store store + atomic/atomicrmw atomic/atomicrmw + with an equal or with an equal or + wider sync scope wider sync scope + and memory ordering and memory ordering + stronger than stronger than + unordered (this is unordered (this is + termed the termed the + release-fence-paired-atomic release-fence-paired-atomic + ). This satisfies the ). This satisfies the + requirements of requirements of + release. release. + + 2. buffer_wbinvl1_vol 2. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/store/store atomic/store/store + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. This global data. This + satisfies the satisfies the + requirements of requirements of + acquire. acquire. **Sequential Consistent Atomic** - ----------------------------------------------------------------------------------- - load atomic seq_cst - singlethread - global *Same as corresponding - - wavefront - local load atomic acquire, - - generic except must generated - all instructions even - for OpenCL.* - load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) - - generic - - Must - happen after - preceding - global/generic load - atomic/store - atomic/atomicrmw - with memory - ordering of seq_cst - and with equal or - wider sync scope. - (Note that seq_cst - fences have their - own s_waitcnt - lgkmcnt(0) and so do - not need to be - considered.) - - Ensures any - preceding - sequential - consistent local - memory instructions - have completed - before executing - this sequentially - consistent - instruction. This - prevents reordering - a seq_cst store - followed by a - seq_cst load. (Note - that seq_cst is - stronger than - acquire/release as - the reordering of - load acquire - followed by a store - release is - prevented by the - waitcnt of - the release, but - there is nothing - preventing a store - release followed by - load acquire from - competing out of - order.) + ---------------------------------------------------------------------------------------------------------------------- + load atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding + - wavefront - local load atomic acquire, load atomic acquire, + - generic except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + - generic vmcnt(0) & vscnt(0) - 2. *Following - instructions same as - corresponding load - atomic acquire, - except must generated - all instructions even - for OpenCL.* + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - Must - waitcnt lgkmcnt(0) must + happen after happen after + preceding preceding + global/generic load local load + atomic/store atomic/store + atomic/atomicrmw atomic/atomicrmw + with memory with memory + ordering of seq_cst ordering of seq_cst + and with equal or and with equal or + wider sync scope. wider sync scope. + (Note that seq_cst (Note that seq_cst + fences have their fences have their + own s_waitcnt own s_waitcnt + lgkmcnt(0) and so do lgkmcnt(0) and so do + not need to be not need to be + considered.) considered.) + - waitcnt vmcnt(0) + Must happen after + preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vmcnt(0) and so do + not need to be + considered.) + - waitcnt vscnt(0) + Must happen after + preceding + global/generic store + atomic/ + atomicrmw-no-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vscnt(0) and so do + not need to be + considered.) + - Ensures any - Ensures any + preceding preceding + sequential sequential + consistent local consistent global/local + memory instructions memory instructions + have completed have completed + before executing before executing + this sequentially this sequentially + consistent consistent + instruction. This instruction. This + prevents reordering prevents reordering + a seq_cst store a seq_cst store + followed by a followed by a + seq_cst load. (Note seq_cst load. (Note + that seq_cst is that seq_cst is + stronger than stronger than + acquire/release as acquire/release as + the reordering of the reordering of + load acquire load acquire + followed by a store followed by a store + release is release is + prevented by the prevented by the + waitcnt of waitcnt of + the release, but the release, but + there is nothing there is nothing + preventing a store preventing a store + release followed by release followed by + load acquire from load acquire from + competing out of competing out of + order.) order.) + + 2. *Following 2. *Following + instructions same as instructions same as + corresponding load corresponding load + atomic acquire, atomic acquire, + except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* load atomic seq_cst - workgroup - local *Same as corresponding load atomic acquire, except must generated all instructions even for OpenCL.* - load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & - - system - generic vmcnt(0) - - Could be split into - separate s_waitcnt - vmcnt(0) - and s_waitcnt - lgkmcnt(0) to allow - them to be - independently moved - according to the - following rules. - - waitcnt lgkmcnt(0) - must happen after - preceding - global/generic load - atomic/store - atomic/atomicrmw - with memory - ordering of seq_cst - and with equal or - wider sync scope. - (Note that seq_cst - fences have their - own s_waitcnt - lgkmcnt(0) and so do - not need to be - considered.) - - waitcnt vmcnt(0) - must happen after - preceding - global/generic load - atomic/store - atomic/atomicrmw - with memory - ordering of seq_cst - and with equal or - wider sync scope. - (Note that seq_cst - fences have their - own s_waitcnt - vmcnt(0) and so do - not need to be - considered.) - - Ensures any - preceding - sequential - consistent global - memory instructions - have completed - before executing - this sequentially - consistent - instruction. This - prevents reordering - a seq_cst store - followed by a - seq_cst load. (Note - that seq_cst is - stronger than - acquire/release as - the reordering of - load acquire - followed by a store - release is - prevented by the - waitcnt of - the release, but - there is nothing - preventing a store - release followed by - load acquire from - competing out of - order.) + 1. s_waitcnt vmcnt(0) & vscnt(0) - 2. *Following - instructions same as - corresponding load - atomic acquire, - except must generated - all instructions even - for OpenCL.* - store atomic seq_cst - singlethread - global *Same as corresponding - - wavefront - local store atomic release, - - workgroup - generic except must generated - all instructions even - for OpenCL.* - store atomic seq_cst - agent - global *Same as corresponding - - system - generic store atomic release, - except must generated - all instructions even - for OpenCL.* - atomicrmw seq_cst - singlethread - global *Same as corresponding - - wavefront - local atomicrmw acq_rel, - - workgroup - generic except must generated - all instructions even - for OpenCL.* - atomicrmw seq_cst - agent - global *Same as corresponding - - system - generic atomicrmw acq_rel, - except must generated - all instructions even - for OpenCL.* - fence seq_cst - singlethread *none* *Same as corresponding - - wavefront fence acq_rel, - - workgroup except must generated - - agent all instructions even - - system for OpenCL.* - ============ ============ ============== ========== =============================== + - If CU wavefront execution mode, omit. + - Could be split into + separate s_waitcnt + vmcnt(0) and s_waitcnt + vscnt(0) to allow + them to be + independently moved + according to the + following rules. + - waitcnt vmcnt(0) + Must happen after + preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vmcnt(0) and so do + not need to be + considered.) + - waitcnt vscnt(0) + Must happen after + preceding + global/generic store + atomic/ + atomicrmw-no-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vscnt(0) and so do + not need to be + considered.) + - Ensures any + preceding + sequential + consistent global + memory instructions + have completed + before executing + this sequentially + consistent + instruction. This + prevents reordering + a seq_cst store + followed by a + seq_cst load. (Note + that seq_cst is + stronger than + acquire/release as + the reordering of + load acquire + followed by a store + release is + prevented by the + waitcnt of + the release, but + there is nothing + preventing a store + release followed by + load acquire from + competing out of + order.) + + 2. *Following + instructions same as + corresponding load + atomic acquire, + except must generated + all instructions even + for OpenCL.* + + load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system - generic vmcnt(0) vmcnt(0) & vscnt(0) + + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) vmcnt(0), s_waitcnt + and s_waitcnt vscnt(0) and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - waitcnt lgkmcnt(0) - waitcnt lgkmcnt(0) + must happen after must happen after + preceding preceding + global/generic load local load + atomic/store atomic/store + atomic/atomicrmw atomic/atomicrmw + with memory with memory + ordering of seq_cst ordering of seq_cst + and with equal or and with equal or + wider sync scope. wider sync scope. + (Note that seq_cst (Note that seq_cst + fences have their fences have their + own s_waitcnt own s_waitcnt + lgkmcnt(0) and so do lgkmcnt(0) and so do + not need to be not need to be + considered.) considered.) + - waitcnt vmcnt(0) - waitcnt vmcnt(0) + must happen after must happen after + preceding preceding + global/generic load global/generic load + atomic/store atomic/ + atomic/atomicrmw atomicrmw-with-return-value + with memory with memory + ordering of seq_cst ordering of seq_cst + and with equal or and with equal or + wider sync scope. wider sync scope. + (Note that seq_cst (Note that seq_cst + fences have their fences have their + own s_waitcnt own s_waitcnt + vmcnt(0) and so do vmcnt(0) and so do + not need to be not need to be + considered.) considered.) + - waitcnt vscnt(0) + Must happen after + preceding + global/generic store + atomic/ + atomicrmw-no-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vscnt(0) and so do + not need to be + considered.) + - Ensures any - Ensures any + preceding preceding + sequential sequential + consistent global consistent global + memory instructions memory instructions + have completed have completed + before executing before executing + this sequentially this sequentially + consistent consistent + instruction. This instruction. This + prevents reordering prevents reordering + a seq_cst store a seq_cst store + followed by a followed by a + seq_cst load. (Note seq_cst load. (Note + that seq_cst is that seq_cst is + stronger than stronger than + acquire/release as acquire/release as + the reordering of the reordering of + load acquire load acquire + followed by a store followed by a store + release is release is + prevented by the prevented by the + waitcnt of waitcnt of + the release, but the release, but + there is nothing there is nothing + preventing a store preventing a store + release followed by release followed by + load acquire from load acquire from + competing out of competing out of + order.) order.) + + 2. *Following 2. *Following + instructions same as instructions same as + corresponding load corresponding load + atomic acquire, atomic acquire, + except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + store atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding + - wavefront - local store atomic release, store atomic release, + - workgroup - generic except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + store atomic seq_cst - agent - global *Same as corresponding *Same as corresponding + - system - generic store atomic release, store atomic release, + except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + atomicrmw seq_cst - singlethread - global *Same as corresponding *Same as corresponding + - wavefront - local atomicrmw acq_rel, atomicrmw acq_rel, + - workgroup - generic except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + atomicrmw seq_cst - agent - global *Same as corresponding *Same as corresponding + - system - generic atomicrmw acq_rel, atomicrmw acq_rel, + except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + fence seq_cst - singlethread *none* *Same as corresponding *Same as corresponding + - wavefront fence acq_rel, fence acq_rel, + - workgroup except must generated except must generated + - agent all instructions even all instructions even + - system for OpenCL.* for OpenCL.* + ============ ============ ============== ========== =============================== ================================== The memory order also adds the single thread optimization constrains defined in table -:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`. +:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table`. - .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9 - :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table + .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10 + :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table ============ ============================================================== LLVM Memory Optimization Constraints @@ -4597,7 +5712,7 @@ Assembler --------- AMDGPU backend has LLVM-MC based assembler which is currently in development. -It supports AMDGCN GFX6-GFX9. +It supports AMDGCN GFX6-GFX10. This section describes general syntax for instructions and operands. @@ -4615,6 +5730,9 @@ Instructions AMDGPUInstructionSyntax AMDGPUInstructionNotation +.. TODO + AMDGPUAsmGFX10 + An instruction has the following :doc:`syntax`: ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` @@ -4632,7 +5750,8 @@ Note that features under development are not included in this description. For more information about instructions, their semantics and supported combinations of operands, refer to one of instruction set architecture manuals -[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_. +[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_ and +[AMD-GCN-GFX10]_. Operands ~~~~~~~~ @@ -4929,16 +6048,24 @@ The list must be terminated by the *.end_amd_kernel_code_t* directive. For any amd_kernel_code_t values that are unspecified a default value will be used. The default value for all keys is 0, with the following exceptions: -- *kernel_code_version_major* defaults to 1. -- *machine_kind* defaults to 1. -- *machine_version_major*, *machine_version_minor*, and - *machine_version_stepping* are derived from the value of the -mcpu option +- *amd_code_version_major* defaults to 1. +- *amd_kernel_code_version_minor* defaults to 2. +- *amd_machine_kind* defaults to 1. +- *amd_machine_version_major*, *machine_version_minor*, and + *amd_machine_version_stepping* are derived from the value of the -mcpu option that is passed to the assembler. - *kernel_code_entry_byte_offset* defaults to 256. -- *wavefront_size* defaults to 6. +- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards + defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. + Note that wavefront size is specified as a power of two, so a value of **n** + means a size of 2^ **n**. +- *call_convention* defaults to -1. - *kernarg_segment_alignment*, *group_segment_alignment*, and *private_segment_alignment* default to 4. Note that alignments are specified as a power of 2, so a value of **n** means an alignment of 2^ **n**. +- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for + GFX10 onwards. +- *enable_mem_ordered* defaults to 1 for GFX10 onwards. The *.amd_kernel_code_t* directive must be placed immediately after the function label and before any instructions. @@ -4976,9 +6103,9 @@ Here is an example of a minimal assembly source file, defining one HSA kernel: compute_pgm_rsrc1_vgprs = 0 compute_pgm_rsrc1_sgprs = 0 compute_pgm_rsrc2_user_sgpr = 2 - kernarg_segment_byte_size = 8 - wavefront_sgpr_count = 2 - workitem_vgpr_count = 3 + compute_pgm_rsrc1_wgp_mode = 0 + compute_pgm_rsrc1_mem_ordered = 0 + compute_pgm_rsrc1_fwd_progress = 1 .end_amd_kernel_code_t s_load_dwordx2 s[0:1], s[0:1] 0x0 @@ -5095,95 +6222,107 @@ terminated by an ``.end_amdhsa_kernel`` directive. .. table:: AMDHSA Kernel Assembler Directives :name: amdhsa-kernel-directives-table - ======================================================== ================ ============ =================== - Directive Default Supported On Description - ======================================================== ================ ============ =================== - ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX9 Controls GROUP_SEGMENT_FIXED_SIZE in - :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. - ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX9 Controls PRIVATE_SEGMENT_FIXED_SIZE in - :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. - ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in - :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. - ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_DISPATCH_PTR in - :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. - ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_QUEUE_PTR in - :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. - ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in - :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. - ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX9 Controls ENABLE_SGPR_DISPATCH_ID in - :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. - ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX9 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in - :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. - ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in - :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. - ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_X in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_Y in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_Z in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_INFO in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX9 Controls ENABLE_VGPR_WORKITEM_ID in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - Possible values are defined in - :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. - ``.amdhsa_next_free_vgpr`` Required GFX6-GFX9 Maximum VGPR number explicitly referenced, plus one. - Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. - ``.amdhsa_next_free_sgpr`` Required GFX6-GFX9 Maximum SGPR number explicitly referenced, plus one. - Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. - ``.amdhsa_reserve_vcc`` 1 GFX6-GFX9 Whether the kernel may use the special VCC SGPR. - Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. - ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX9 Whether the kernel may use flat instructions to access - scratch memory. Used to calculate - GRANULATED_WAVEFRONT_SGPR_COUNT in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. - ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX9 Whether the kernel may trigger XNACK replay. - Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in - Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. + ======================================================== =================== ============ =================== + Directive Default Supported On Description + ======================================================== =================== ============ =================== + ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in + Feature :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + Specific + (-wavefrontsize64) + ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + Possible values are defined in + :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. + ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one. + Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one. + Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR. + Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access + scratch memory. Used to calculate + GRANULATED_WAVEFRONT_SGPR_COUNT in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. + Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in + Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. (+xnack) - ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX9 Controls FLOAT_ROUND_MODE_32 in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. - Possible values are defined in - :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. - ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX9 Controls FLOAT_ROUND_MODE_16_64 in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. - Possible values are defined in - :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. - ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX9 Controls FLOAT_DENORM_MODE_32 in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. - Possible values are defined in - :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. - ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX9 Controls FLOAT_DENORM_MODE_16_64 in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. - Possible values are defined in - :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. - ``.amdhsa_dx10_clamp`` 1 GFX6-GFX9 Controls ENABLE_DX10_CLAMP in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. - ``.amdhsa_ieee_mode`` 1 GFX6-GFX9 Controls ENABLE_IEEE_MODE in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. - ``.amdhsa_fp16_overflow`` 0 GFX9 Controls FP16_OVFL in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. - ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in - :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. - ======================================================== ================ ============ =================== + ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + Possible values are defined in + :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. + ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + Possible values are defined in + :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. + ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + Possible values are defined in + :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. + ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + Possible values are defined in + :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. + ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in + Feature :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + Specific + (-cumode) + ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ======================================================== =================== ============ =================== .amdgpu_metadata ++++++++++++++++ @@ -5334,6 +6473,9 @@ Additional Documentation .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA `_ .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture `__ .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture `__ +.. [AMD-GCN-GFX10] AMD "Navi" Instruction Set Architecture *TBA* +.. TODO + ttye Add link when made public. .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing `__ .. [AMD-ROCm-github] `ROCm github `__ .. [HSA] `Heterogeneous System Architecture (HSA) Foundation `__