llvm-mirror

mirror of https://github.com/RPCS3/llvm-mirror.git synced 2025-01-31 20:51:52 +01:00

Author	SHA1	Message	Date
Craig Topper	82f73ac58e	[X86] Copy the tuning features and scheduler model from pentium4/x86-64 to generic This is preparation for making clang default to -mtune=generic when no -march is specified. This will allow the default tuning to be "generic" even though our default march is "pentium4" or "x86-64". To avoid llc lit test regressions, if no mcpu is specified, I've defaulted tune to use i586 to match the old tuning settings of no CPU. Some tests explicitly used -mcpu=generic which I've removed so they instead get this default of architecture features from generic and tune from i586. I updated one llvm-mca test to check a different CPU since generic has a scheduler model now Differential Revision: https://reviews.llvm.org/D86312	2020-08-24 14:47:10 -07:00
Craig Topper	ee765be827	[X86] Add mayLoad/mayStore flags to some X87 instructions that don't have isel patterns to infer them from. Should remove part of the differences in D81833 due to some some of these getting isel patterns.	2020-06-23 23:40:30 -07:00
David Green	680cf8ff46	[ARM] Mark more integer instructions as not having side effects. LDRD and STRD along with UBFX and SBFX are selected from DAGToDAG transforms, so do not have tblgen patterns. They don't get marked as having side effects so cannot be scheduled as efficiently as you would like. This specifically marks then as not having side effects. Differential Revision: https://reviews.llvm.org/D82358	2020-06-23 22:45:51 +01:00
David Green	b31f621535	[ARM] Cortex-M4 integer instructions scheduler info test. NFC Most useful at the moment for showing where unpredicatable instructions are.	2020-06-23 22:26:23 +01:00
Wang, Pengfei	48ab1244bc	[X86][llvm-mc] Make the suffix matcher more accurate. Summary: Some instruction like VPMULDQ is NOT the variant of VPMULD but a new one. So we should make sure the suffix matcher only works for memory variant that has the same size with the suffix. Currently we only check for SSE/AVX* instructions, because many legacy instructions didn't declare the alias instructions of their variants. Differential Revision: https://reviews.llvm.org/D80608	2020-05-27 14:45:17 +08:00
Andrea Di Biagio	79ed672fa3	[MCA][InstrBuilder] Correctly mark reserved resources in initializeUsedResources. This fixes a bug reported by Alex Renda on LLVMDev where mca did not correctly mark a resource group as "reserved". (See http://lists.llvm.org/pipermail/llvm-dev/2020-May/141485.html). The issue was caused by a wrong check in function `initializeUsedResources`. As a consequence of this, a resource group was left unreserved, and its field `NumUnits` incorrectly reported an unrealistic number of consumed resource units. This patch fixes the issue with the handling of reserved resources in the InstrBuilder class, and adds a simple test for it. Ideally, as suggested by Andy Trick, most of these problems will disappear if in the future we will introduce a (optional) DelayCycles vector for SchedWriteRes.	2020-05-10 19:25:54 +01:00
Craig Topper	f02fd28101	[X86] Remove the mayLoad and mayStore flags from vzeroupper/vzeroall. But leave the hasUnmodelledSideEffects flag.	2020-05-08 12:47:20 -07:00
Andrea Di Biagio	3f70cc7e72	Forgot to add a -mtriple to a test. NFC This should unbreak the clang-ppc64be-linux buildbot.	2020-05-05 10:48:00 +01:00
Andrea Di Biagio	52f56e2249	[MCA] Fixed a bug where loads and stores were sometimes incorrectly marked as depedent. Fixes PR45793. This fixes a regression introduced by a very old commit 280ac1fd1dc35 (was llvm-svn 361950). Commit 280ac1fd1dc35 redesigned the logic in the LSUnit with the goal of speeding up isReady() queries, and stabilising the LSUnit API (while also making the load store unit more customisable). The concept of MemoryGroup (effectively an alias set) was added by that commit to better describe and track dependencies between memory operations. However, that concept was not just used for alias dependencies, but it was also used for describing memory "order" dependencies (enforced by the memory consistency model). Instructions of a same memory group were considered "equivalent" as in: independent operations that can potentially execute in parallel. The problem was that the cost of a dependency (in terms of number of cycles) should have been different for "order" dependency. Instructions in an order dependency simply have to have to wait until their predecessors are "issued" to an underlying pipeline (rather than having to wait until predecessors have beeng fully executed). For simple "order" dependencies, this was effectively introducing an artificial delay on the "issue" of independent loads and stores. This patch fixes the issue and adds a new test named 'independent-load-stores.s' to a bunch of x86 targets. That test contains the reproducible posted by Fabian Ritter on PR45793. I had to rerun the update-mca-tests script on several files. To avoid expected regressions on some Exynos tests, I have added a -noalias=false flag (to match the old strict behavior on latencies). Some tests for processor Barcelona are improved/fixed by this change and they now show better results. In a few tests we were incorrectly counting the time spent by instructions in a scheduler queue. In one case in particular we now correctly see a store executed out of order. That test was affected by the same underlying issue reported as PR45793. Reviewers: mattd Differential Revision: https://reviews.llvm.org/D79351	2020-05-05 10:25:36 +01:00
Georgii Rymar	463ae4125d	[tools][tests] - Use --check-prefixes instead of multiple --check-prefix. NFCI. There is no need to use `--check-prefix` multiple times. It helps to improve readability/test maintainability. This patch does it for all tools at once. Differential revision: https://reviews.llvm.org/D78217	2020-04-17 12:35:25 +03:00
Craig Topper	ed198c45c9	[X86] Match vpmullq latency to uops.info. Correct port usage for 512-bit memory form uops.info says these should be 15 cycle instructions. Uops.info also shows the 512-bit form uses port 0 and 5 for both register and memory. We had memory using 0 and 1. Differential Revision: https://reviews.llvm.org/D75549	2020-03-03 12:16:03 -08:00
Craig Topper	11c8ae2f75	[X86] Increase latency of port5 masked compares and kshift/kadd/kunpck instructions in SKX scheduler model Uops.info shows these as 4 cycle latency.	2020-02-16 16:59:37 -08:00
Craig Topper	04d909f9da	[X86] Add more avx512 instrutions to llvm-mca resource tests	2020-02-16 16:59:36 -08:00
Craig Topper	341faaf09a	[X86] Raise the latency for VectorImul from 4 to 5 in Skylake scheduler models Based on uops.info these should have 5 cycle latency as they did on Haswell/Broadwell. I have no additional internal information from Intel. This was also shown as a discrepancy in the spreadsheet that was sent with an early llvm-dev post about llvm-exegesis. It also matches Agner Fog. Differential Revision: https://reviews.llvm.org/D74357	2020-02-11 11:24:25 -08:00
Craig Topper	f4ddf70574	[X86] Improve the gather scheduler models for SkylakeClient and SkylakeServer The load ports need a cycle for each potentially loaded element just like Haswell and Skylake. Unlike Haswell and Broadwell, the number of uops does not scale with the number of elements. Instead the load uops run for multiple cycles. I've taken the latency number from the uops.info. The port binding for the non-load uops is taken from the original IACA data I have. Differential Revision: https://reviews.llvm.org/D74000	2020-02-05 13:26:47 -08:00
Simon Pilgrim	b6af7e47c1	[X86] Fix missing load latencies (PR36894) We weren't account for load latencies in the SSE42/AES/CLMUL schedule classes	2020-02-05 11:53:16 +00:00
Simon Pilgrim	dea8bc5779	[X86] Fix missing load latencies (PR36894) We weren't account for load latencies in the SSE42/AES/CLMUL schedule classes	2020-02-04 18:18:29 +00:00
Craig Topper	3c613b53f1	[X86] Update the haswell and broadwell scheduler information for gather instructions Broadwell was missing half the gather instructions. Both models had some mixups in the resource costs and number of uops. I've updated here based on what I think the original IACA source says with some cross checking against the microcode. I'm not sure about latency as the IACA source I have doesn't have that information. So I'm using the latency from uops.info. I plan to update Skylake models as well, but I'll do that in a separate patch. Differential Revision: https://reviews.llvm.org/D73844	2020-02-03 17:57:48 -08:00
Clement Courbet	56c810e79e	[X86][Sched] A bunch of fixes to the Zen2 sched model latencies. Summary: As determined with `llvm-exegesis`. Some of these look like typos/misunderstandings of the sched model td spec: - latency defaults to `1` when not set => Maybe we can avoid having a default ? - problems with regexps not being anchored by default (XCHG matching CMPXHG) Note that this is not complete, it fixes only the most obvious mistakes, and only for latency (not uops). Reviewers: RKSimon, GGanesh Subscribers: hiraditya, jfb, mstojanovic, hfinkel, craig.topper, andreadb, lebedev.ri, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D73172	2020-01-30 10:20:31 +01:00
Roman Lebedev	daa0abe865	[X86][BdVer2] Polish LEA instruction scheduling info Based on exhaustive llvm-exegesis measurements. There may still be some imperfections for LEA16r/LEA32r. Much like was observed in D68646, i'm also measuring some outliers with some specific registers.	2020-01-26 22:17:27 +03:00
Roman Lebedev	399f253bb6	[NFC][MCA] Re-autogenerate all check lines in all X86 MCA tests Some whitespace issues have crept in, and some znver2 check lines were missing..	2020-01-26 22:17:26 +03:00
Clement Courbet	4267e3f97c	[llvm-mca][NFC] Regenerate tests @HEAD. For Zen2.	2020-01-22 14:50:52 +01:00
Diogo Sampaio	69646a28e6	[ARM][Thumb2] Fix ADD/SUB invalid writes to SP Summary: This patch fixes pr23772 [ARM] r226200 can emit illegal thumb2 instruction: "sub sp, r12, #80". The violation was that SUB and ADD (reg, immediate) instructions can only write to SP if the source register is also SP. So the above instructions was unpredictable. To enforce that the instruction t2(ADD\|SUB)ri does not write to SP we now enforce the destination register to be rGPR (That exclude PC and SP). Different than the ARM specification, that defines one instruction that can read from SP, and one that can't, here we inserted one that can't write to SP, and other that can only write to SP as to reuse most of the hard-coded size optimizations. When performing this change, it uncovered that emitting Thumb2 Reg plus Immediate could not emit all variants of ADD SP, SP #imm instructions before so it was refactored to be able to. (see test/CodeGen/Thumb2/mve-stacksplot.mir where we use a subw sp, sp, Imm12 variant ) It also uncovered a disassembly issue of adr.w instructions, that were only written as SUBW instructions (see llvm/test/MC/Disassembler/ARM/thumb2.txt). Reviewers: eli.friedman, dmgreen, carwil, olista01, efriedma, andreadb Reviewed By: efriedma Subscribers: gbedwell, john.brawn, efriedma, ostannard, kristof.beyls, hiraditya, dmgreen, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D70680	2020-01-14 11:47:19 +00:00
Ganesh Gopalasubramanian	a887b33e36	[X86] AMD Znver2 (Rome) Scheduler enablement The patch gives out the details of the znver2 scheduler model. There are few improvements with respect to execution units, latencies and throughput when compared with znver1. The tests that were present for znver1 for llvm-mca tool were replicated. The latencies, execution units, timeline and throughput information are updated for znver2. Reviewers: craig.topper, Simon Pilgrim Differential Revision: https://reviews.llvm.org/D66088	2020-01-10 00:44:59 +05:30
Evandro Menezes	9e85d58e98	[MCA] Fix test cases (NFC) Fix the test cases for Exynos M5 that break under Darwin.	2019-11-22 16:19:58 -06:00
Evandro Menezes	db1708480a	[AArch64] Add the pipeline model for Exynos M5 Add the scheduling and cost models for Exynos M5.	2019-11-22 15:09:17 -06:00
Eric Christopher	3d68f1e222	Revert "[AArch64] Add the pipeline model for Exynos M5" as it's causing test failures in llvm-mca. This reverts commit 9bdfee2a3bd13d405ce1592930182f23849d2897.	2019-11-20 16:04:52 -08:00
Evandro Menezes	4ab9ed6b8f	[AArch64] Add the pipeline model for Exynos M5 Add the scheduling and cost models for Exynos M5.	2019-11-20 16:56:07 -06:00
Simon Pilgrim	7d4b7cf3b9	[X86] Fix SLM v2i64 ADD/Sub/CMPEQ instruction schedules Noticed while fixing the reduction costs for D59710 - the SLM model doesn't account for the poor throughput of v2i64 ops. Numbers taken from Intel AOM (+ checked against Agner)	2019-11-06 19:08:15 +00:00
Simon Pilgrim	02871604f5	[X86] Fix SLM v2f64 ADD/MUL + FP BLEND/HADD instruction schedules Noticed while fixing the reduction costs for D59710 - the SLM model doesn't account for the poor throughput of v2f64/v2i64 ops.	2019-11-06 19:08:15 +00:00
Evandro Menezes	d9e3dc7008	[mca] Fix test case (NFC) Fix test case for Darwin builds.	2019-10-31 16:44:52 -05:00
Evandro Menezes	8cd41b3ebd	[AArch64] Update for Exynos Fix the costs of `add` and `orr` with an immediate operand.	2019-10-31 15:25:22 -05:00
Evandro Menezes	7348b241c1	[clang][llvm] Obsolete Exynos M1 and M2	2019-10-30 15:02:59 -05:00
Andrea Di Biagio	4a9308eaf8	[X86][BtVer2] Improved latency and throughput of float/vector loads and stores. This patch introduces the following changes to the btver2 scheduling model: - The number of micro opcodes for YMM loads and stores is now 2 (it was incorrectly set to 1 for both aligned and misaligned loads/stores). - Increased the number of AGU resource cycles for YMM loads and stores to 2cy (instead of 1cy). - Removed JFPU01 and JFPX from the list of resources consumed by pure float/vector loads (no MMX). I verified with llvm-exegesis that pure XMM/YMM loads are no-pipe. Those are dispatched to the FPU but not really issues on JFPU01. Differential Revision: https://reviews.llvm.org/D68871 llvm-svn: 374765	2019-10-14 11:12:18 +00:00
Roman Lebedev	f72d6fc559	[MCA] Show aggregate over Average Wait times for the whole snippet (PR43219) Summary: As disscused in https://bugs.llvm.org/show_bug.cgi?id=43219, i believe it may be somewhat useful to show //some// aggregates over all the sea of statistics provided. Example: ``` Average Wait times (based on the timeline view): [0]: Executions [1]: Average time spent waiting in a scheduler's queue [2]: Average time spent waiting in a scheduler's queue while ready [3]: Average time elapsed from WB until retire stage [0] [1] [2] [3] 0. 3 1.0 1.0 4.7 vmulps %xmm0, %xmm1, %xmm2 1. 3 2.7 0.0 2.3 vhaddps %xmm2, %xmm2, %xmm3 2. 3 6.0 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 3 3.2 0.3 2.3 <total> ``` I.e. we average the averages. Reviewers: andreadb, mattd, RKSimon Reviewed By: andreadb Subscribers: gbedwell, arphaman, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68714 llvm-svn: 374361	2019-10-10 14:46:21 +00:00
Andrea Di Biagio	13160fb6a6	[MCA][LSUnit] Track loads and stores until retirement. Before this patch, loads and stores were only tracked by their corresponding queues in the LSUnit from dispatch until execute stage. In practice we should be more conservative and assume that memory opcodes leave their queues at retirement stage. Basically, loads should leave the load queue only when they have completed and delivered their data. We conservatively assume that a load is completed when it is retired. Stores should be tracked by the store queue from dispatch until retirement. In practice, stores can only leave the store queue if their data can be written to the data cache. This is mostly a mechanical change. With this patch, the retire stage notifies the LSUnit when a memory instruction is retired. That would triggers the release of LDQ/STQ entries. The only visible change is in memory tests for the bdver2 model. That is because bdver2 is the only model that defines the load/store queue size. This patch partially addresses PR39830. Differential Revision: https://reviews.llvm.org/D68266 llvm-svn: 374034	2019-10-08 10:46:01 +00:00
David Green	3bac3332a1	[llvm-mca] Add a -mattr flag This adds a -mattr flag to llvm-mca, for cases where the -mcpu option does not contain all optional features. Differential Revision: https://reviews.llvm.org/D68190 llvm-svn: 373358	2019-10-01 17:41:38 +00:00
Andrea Di Biagio	935e89a564	[MCA] Improved cost computation for loop carried dependencies in the bottleneck analysis. This patch introduces a cut-off threshold for dependency edge frequences with the goal of simplifying the critical sequence computation. This patch also removes the cost normalization for loop carried dependencies. We didn't really need to artificially amplify the cost of loop-carried dependencies since it is already computed as the integral over time of the delay (in cycle). In the absence of backend stalls there is no need for computing a critical sequence. With this patch we early exit from the critical sequence computation if no bottleneck was reported during the simulation. llvm-svn: 372337	2019-09-19 16:05:11 +00:00
Andrea Di Biagio	08b9b0a5ac	[X86][BtVer2] Fix latency and throughput of conditional SIMD store instructions. On BtVer2 conditional SIMD stores are heavily microcoded. The latency is directly proportional to the number of packed elements extracted from the input vector. Also, according to micro-benchmarks, most of the computation seems to be done in the integer unit. Only a minority of the uOPs is executed by the FPU. The observed behaviour on the FPU looks similar to this: - The input MASK value is moved to the Integer Unit -- [ a VMOVMSK-like uOP-executed on JFPU0]. - In parallel, each element of the input XMM/YMM is extracted and then sent to the IntegerUnit through JFPU1. As expected, a (conditional) store is executed for every extracted element. Interestingly, a (speculative) load is executed for every extracted element too. It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the integer unit for every contionally stored element. VMASKMOVDQU is a special case: the number of speculative loads is always 2 (presumably, one load per quadword). That means, extra shifts and masking is performed on (one of) the loaded quadwords before each conditional store (that also explains the big number of non-FP uOPs retired). This patch replaces the existing writes for conditional SIMD stores (i.e. WriteFMaskedStore, and WriteFMaskedStoreY) with the following new writes: WriteFMaskedStore32 [ XMM Packed Single ] WriteFMaskedStore32Y [ YMM Packed Single ] WriteFMaskedStore64 [ XMM Packed Double ] WriteFMaskedStore64Y [ YMM Packed Double ] Added a wrapper class named X86SchedWriteMaskMove in X86Schedule.td to describe both RM and MR variants for conditional SIMD moves in a single tablegen definition. Instances of that class are then passed in input to multiclass avx_movmask_rm when constructing MASKMOVPS/PD definitions. Since this patch introduces new writes, I had to update all the X86 scheduling models. Differential Revision: https://reviews.llvm.org/D66801 llvm-svn: 370649	2019-09-02 12:32:28 +00:00
Andrea Di Biagio	56ca3f6977	[X86][BtVer2] Add a read-advance to every implicit register use of CMPXCHG8B/16B. This is a follow up of r369642. This patch assigns a ReadAfterLd to every implicit register use of instruction CMPXCHG8B and instruction CMPXCHG16B. Perf micro-benchmarks show that implicit registers are read after 3cy from the start of execution. llvm-svn: 369750	2019-08-23 12:19:45 +00:00
Andrea Di Biagio	25ee718e07	[X86][BtVer2] Fix latency of ALU RMW instructions. Excluding ADC/SBB and the bit-test instructions (BTR/BTS/BTC), the observed latency of all other RMW integer arithmetic/logic instructions is 6cy and not 5cy. Example (ADD): ``` addb $0, (%rsp) # Latency: 6cy addb $7, (%rsp) # Latency: 6cy addb %sil, (%rsp) # Latency: 6cy addw $0, (%rsp) # Latency: 6cy addw $511, (%rsp) # Latency: 6cy addw %si, (%rsp) # Latency: 6cy addl $0, (%rsp) # Latency: 6cy addl $511, (%rsp) # Latency: 6cy addl %esi, (%rsp) # Latency: 6cy addq $0, (%rsp) # Latency: 6cy addq $511, (%rsp) # Latency: 6cy addq %rsi, (%rsp) # Latency: 6cy ``` The same latency profile applies to SUB/AND/OR/XOR/INC/DEC. The observed latency of ADC/SBB is 7-8cy. So we need a different write to model those. Latency of BTS/BTR/BTC is not fixed by this patch (they are much slower than what the model for btver2 currently reports). Differential Revision: https://reviews.llvm.org/D66636 llvm-svn: 369748	2019-08-23 11:34:10 +00:00
Andrea Di Biagio	38b78fd66c	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions. Single operand MUL instructions that implicitly set EAX have the following latency/throughput profile (see below): imul %cl # latency: 3cy - uOPs: 1 - 1 JMul imul %cx # latency: 3cy - uOPs: 3 - 3 JMul imul %ecx # latency: 3cy - uOPs: 2 - 2 JMul imul %rcx # latency: 6cy - uOPs: 2 - 4 JMul mul %cl # latency: 3cy - uOPs: 1 - 1 JMul mul %cx # latency: 3cy - uOPs: 3 - 3 JMul mul %ecx # latency: 3cy - uOPs: 2 - 2 JMul mul %rcx # latency: 6cy - uOPs: 2 - 4 JMul Excluding the 64bit variant, which has a latency of 6cy, every other instruction has a latency of 3cy. However, the number of decoded macro-opcodes (as well as the resource cyles) depend on the MUL size. The two operand MULs have a more predictable profile (see below): imul %dx, %dx # latency: 3cy - uOPs: 1 - 1 JMul imul %edx, %edx # latency: 3cy - uOPs: 1 - 1 JMul imul %rdx, %rdx # latency: 6cy - uOPs: 1 - 4 JMul imul $3, %dx, %dx # latency: 4cy - uOPs: 2 - 2 JMul imul $3, %ecx, %ecx # latency: 3cy - uOPs: 1 - 1 JMul imul $3, %rdx, %rdx # latency: 6cy - uOPs: 1 - 4 JMul This patch updates the values in the Jaguar scheduling model and regenerates llvm-mca tests. Differential Revision: https://reviews.llvm.org/D66547 llvm-svn: 369661	2019-08-22 15:20:16 +00:00
Andrea Di Biagio	fae9f9f261	[X86][BtVer2] Fix latency and throughput of XCHG and XADD. On Jaguar, XCHG has a latency of 1cy and decodes to 2 macro-opcodes. Maximum throughput for XCHG is 1 IPC. The byte exchange has worse latency and decodes to 1 extra uOP; maximum observed throughput is 0.5 IPC. ``` xchgb %cl, %dl # Latency: 2cy - uOPs: 3 - 2 ALU xchgw %cx, %dx # Latency: 1cy - uOPs: 2 - 2 ALU xchgl %ecx, %edx # Latency: 1cy - uOPs: 2 - 2 ALU xchgq %rcx, %rdx # Latency: 1cy - uOPs: 2 - 2 ALU ``` The reg-mem forms of XCHG are atomic operations with an observed latency of 16cy. The resource usage is similar to the XCHGrr variants. The biggest difference is obviously the bus-locking, which prevents the LS to issue other memory uOPs in parallel until the unlocking store uOP is executed. ``` xchgb %cl, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy xchgw %cx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy xchgl %ecx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy xchgq %rcx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy ``` The exchanged in/out register operand becomes available after 11cy from the start of execution. Added test xchg.s to verify that we correctly see that register write committed in 11cy (and not 16cy). Reg-reg XADD instructions have the same latency/throughput than the byte exchange (register-register variant). ``` xaddb %cl, %dl # latency: 2cy - uOPs: 3 - 3 ALU xaddw %cx, %dx # latency: 2cy - uOPs: 3 - 3 ALU xaddl %ecx, %edx # latency: 2cy - uOPs: 3 - 3 ALU xaddq %rcx, %rdx # latency: 2cy - uOPs: 3 - 3 ALU ``` The non-atomic RM variants have a latency of 11cy, and decode to 4 macro-opcodes. They still consume 2 ALU pipes, and the exchange in/out register operand becomes available in 3cy (it matches the 'load-to-use latency'). ``` xaddb %cl, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU xaddw %cx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU xaddl %ecx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU xaddq %rcx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU ``` The atomic XADD variants execute in 16cy. The in/out register operand is available after 11cy from the start of execution. ``` lock xaddb %cl, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy lock xaddw %cx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy lock xaddl %ecx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy lock xaddq %rcx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy ``` Added test xadd.s to verify those latencies as well as read-advance values. Differential Revision: https://reviews.llvm.org/D66535 llvm-svn: 369642	2019-08-22 11:32:47 +00:00
Andrea Di Biagio	60bf5e7c65	[X86][BtVer2] Use ReadAfterLd entries for the register operands of CMPXCHG. This is a follow-up of r369365. llvm-svn: 369412	2019-08-20 17:05:56 +00:00
Andrea Di Biagio	fd00d5a846	[X86][BtVer2] Fix latency and throughput of atomic INC/DEC/NEG/NOT. Latency and throughput of LOCK INC/DEC/NEG/NOT is always 19cy. Number of uOPs is still 1. Differential Revision: https://reviews.llvm.org/D66469 llvm-svn: 369388	2019-08-20 14:31:27 +00:00
Simon Pilgrim	aa50d0d398	[MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops D66424 adds the base support for LOCK so we should be able to add special case support for all these cases in future patches llvm-svn: 369367	2019-08-20 11:13:20 +00:00
Andrea Di Biagio	d5dd3f579a	[X86][Btver2] Fix latency and throughput of CMPXCHG instructions. On Jaguar, CMPXCHG has a latency of 11cy, and a maximum throughput of 0.33 IPC. Throughput is superiorly limited to 0.33 because of the implicit in/out dependency on register EAX. In the case of repeated non-atomic CMPXCHG with the same memory location, store-to-load forwarding occurs and values for sequent loads are quickly forwarded from the store buffer. Interestingly, the functionality in LLVM that computes the reciprocal throughput doesn't seem to know about RMW instructions. That functionality only looks at the "consumed resource cycles" for the throughput computation. It should be fixed/improved by a future patch. In particular, for RMW instructions, that logic should also take into account for the write latency of in/out register operands. An atomic CMPXCHG has a latency of ~17cy. Throughput is also limited to ~17cy/inst due to cache locking, which prevents other memory uOPs to start executing before the "lock releasing" store uOP. CMPXCHG8rr and CMPXCHG8rm are treated specially because they decode to one less macro opcode. Their latency tend to be the same as the other RR/RM variants. RR variants are relatively fast 3cy (but still microcoded - 5 macro opcodes). CMPXCHG8B is 11cy and unfortunately doesn't seem to benefit from store-to-load forwarding. That means, throughput is clearly limited by the in/out dependency on GPR registers. The uOP composition is sadly unknown (due to the lack of PMCs for the Integer pipes). I have reused the same mix of consumed resource from the other CMPXCHG instructions for CMPXCHG8B too. LOCK CMPXCHG8B is instead 18cycles. CMPXCHG16B is 32cycles. Up to 38cycles when the LOCK prefix is specified. Due to the in/out dependencies, throughput is limited to 1 instruction every 32 (or 38) cycles dependeing on whether the LOCK prefix is specified or not. I wouldn't be surprised if the microcode for CMPXCHG16B is similar to 2x microcode from CMPXCHG8B. So, I have speculatively set the JALU01 consumption to 2x the resource cycles used for CMPXCHG8B. The two new hasLockPrefix() functions are used by the btver2 scheduling model check if a MCInst/MachineInst has a LOCK prefix. Calls to hasLockPrefix() have been encoded in predicates of variant scheduling classes that describe lat/thr of CMPXCHG. Differential Revision: https://reviews.llvm.org/D66424 llvm-svn: 369365	2019-08-20 10:23:55 +00:00
Andrea Di Biagio	7f44363da1	[X86] Move scheduling tests for CMPXCHG to the corresponding resources-x86_64.s files. NFC In D66424 it has been requested to move all the new tests added by r369278 into resources-x86_64.s. That is because only the 8b/16 ops should be tested by resources-cmpxchg.s. This partially reverts r369278. llvm-svn: 369288	2019-08-19 18:20:30 +00:00
Andrea Di Biagio	f2f9d97508	[X86] Added extensive scheduling model tests for all the CMPXCHG variants. NFC Addresses a review comment in D66424 llvm-svn: 369279	2019-08-19 17:07:26 +00:00
Andrea Di Biagio	9bbf3a5aeb	[MCA] Add flag -show-encoding to llvm-mca. Flag -show-encoding enables the printing of instruction encodings as part of the the instruction info view. Example (with flags -mtriple=x86_64-- -mcpu=btver2): Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [7]: Encoding Size [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions: 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4 In this example, column Encoding Size is the size in bytes of the instruction encoding. Column Encodings reports the actual instruction encodings as byte sequences in hex (objdump style). The computation of encodings is done by a utility class named mca::CodeEmitter. In future, I plan to expose the CodeEmitter to the instruction builder, so that information about instruction encoding sizes can be used by the simulator. That would be a first step towards simulating the throughput from the decoders in the hardware frontend. Differential Revision: https://reviews.llvm.org/D65948 llvm-svn: 368432	2019-08-09 11:26:27 +00:00

1 2 3 4 5 ...

418 Commits