llvm-mirror

mirror of https://github.com/RPCS3/llvm-mirror.git synced 2025-02-01 05:01:59 +01:00

Author	SHA1	Message	Date
Craig Topper	ed198c45c9	[X86] Match vpmullq latency to uops.info. Correct port usage for 512-bit memory form uops.info says these should be 15 cycle instructions. Uops.info also shows the 512-bit form uses port 0 and 5 for both register and memory. We had memory using 0 and 1. Differential Revision: https://reviews.llvm.org/D75549	2020-03-03 12:16:03 -08:00
Craig Topper	11c8ae2f75	[X86] Increase latency of port5 masked compares and kshift/kadd/kunpck instructions in SKX scheduler model Uops.info shows these as 4 cycle latency.	2020-02-16 16:59:37 -08:00
Craig Topper	04d909f9da	[X86] Add more avx512 instrutions to llvm-mca resource tests	2020-02-16 16:59:36 -08:00
Craig Topper	341faaf09a	[X86] Raise the latency for VectorImul from 4 to 5 in Skylake scheduler models Based on uops.info these should have 5 cycle latency as they did on Haswell/Broadwell. I have no additional internal information from Intel. This was also shown as a discrepancy in the spreadsheet that was sent with an early llvm-dev post about llvm-exegesis. It also matches Agner Fog. Differential Revision: https://reviews.llvm.org/D74357	2020-02-11 11:24:25 -08:00
Craig Topper	f4ddf70574	[X86] Improve the gather scheduler models for SkylakeClient and SkylakeServer The load ports need a cycle for each potentially loaded element just like Haswell and Skylake. Unlike Haswell and Broadwell, the number of uops does not scale with the number of elements. Instead the load uops run for multiple cycles. I've taken the latency number from the uops.info. The port binding for the non-load uops is taken from the original IACA data I have. Differential Revision: https://reviews.llvm.org/D74000	2020-02-05 13:26:47 -08:00
Simon Pilgrim	b6af7e47c1	[X86] Fix missing load latencies (PR36894) We weren't account for load latencies in the SSE42/AES/CLMUL schedule classes	2020-02-05 11:53:16 +00:00
Simon Pilgrim	dea8bc5779	[X86] Fix missing load latencies (PR36894) We weren't account for load latencies in the SSE42/AES/CLMUL schedule classes	2020-02-04 18:18:29 +00:00
Craig Topper	3c613b53f1	[X86] Update the haswell and broadwell scheduler information for gather instructions Broadwell was missing half the gather instructions. Both models had some mixups in the resource costs and number of uops. I've updated here based on what I think the original IACA source says with some cross checking against the microcode. I'm not sure about latency as the IACA source I have doesn't have that information. So I'm using the latency from uops.info. I plan to update Skylake models as well, but I'll do that in a separate patch. Differential Revision: https://reviews.llvm.org/D73844	2020-02-03 17:57:48 -08:00
Clement Courbet	56c810e79e	[X86][Sched] A bunch of fixes to the Zen2 sched model latencies. Summary: As determined with `llvm-exegesis`. Some of these look like typos/misunderstandings of the sched model td spec: - latency defaults to `1` when not set => Maybe we can avoid having a default ? - problems with regexps not being anchored by default (XCHG matching CMPXHG) Note that this is not complete, it fixes only the most obvious mistakes, and only for latency (not uops). Reviewers: RKSimon, GGanesh Subscribers: hiraditya, jfb, mstojanovic, hfinkel, craig.topper, andreadb, lebedev.ri, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D73172	2020-01-30 10:20:31 +01:00
Roman Lebedev	daa0abe865	[X86][BdVer2] Polish LEA instruction scheduling info Based on exhaustive llvm-exegesis measurements. There may still be some imperfections for LEA16r/LEA32r. Much like was observed in D68646, i'm also measuring some outliers with some specific registers.	2020-01-26 22:17:27 +03:00
Roman Lebedev	399f253bb6	[NFC][MCA] Re-autogenerate all check lines in all X86 MCA tests Some whitespace issues have crept in, and some znver2 check lines were missing..	2020-01-26 22:17:26 +03:00
Clement Courbet	4267e3f97c	[llvm-mca][NFC] Regenerate tests @HEAD. For Zen2.	2020-01-22 14:50:52 +01:00
Diogo Sampaio	69646a28e6	[ARM][Thumb2] Fix ADD/SUB invalid writes to SP Summary: This patch fixes pr23772 [ARM] r226200 can emit illegal thumb2 instruction: "sub sp, r12, #80". The violation was that SUB and ADD (reg, immediate) instructions can only write to SP if the source register is also SP. So the above instructions was unpredictable. To enforce that the instruction t2(ADD\|SUB)ri does not write to SP we now enforce the destination register to be rGPR (That exclude PC and SP). Different than the ARM specification, that defines one instruction that can read from SP, and one that can't, here we inserted one that can't write to SP, and other that can only write to SP as to reuse most of the hard-coded size optimizations. When performing this change, it uncovered that emitting Thumb2 Reg plus Immediate could not emit all variants of ADD SP, SP #imm instructions before so it was refactored to be able to. (see test/CodeGen/Thumb2/mve-stacksplot.mir where we use a subw sp, sp, Imm12 variant ) It also uncovered a disassembly issue of adr.w instructions, that were only written as SUBW instructions (see llvm/test/MC/Disassembler/ARM/thumb2.txt). Reviewers: eli.friedman, dmgreen, carwil, olista01, efriedma, andreadb Reviewed By: efriedma Subscribers: gbedwell, john.brawn, efriedma, ostannard, kristof.beyls, hiraditya, dmgreen, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D70680	2020-01-14 11:47:19 +00:00
Ganesh Gopalasubramanian	a887b33e36	[X86] AMD Znver2 (Rome) Scheduler enablement The patch gives out the details of the znver2 scheduler model. There are few improvements with respect to execution units, latencies and throughput when compared with znver1. The tests that were present for znver1 for llvm-mca tool were replicated. The latencies, execution units, timeline and throughput information are updated for znver2. Reviewers: craig.topper, Simon Pilgrim Differential Revision: https://reviews.llvm.org/D66088	2020-01-10 00:44:59 +05:30
Evandro Menezes	9e85d58e98	[MCA] Fix test cases (NFC) Fix the test cases for Exynos M5 that break under Darwin.	2019-11-22 16:19:58 -06:00
Evandro Menezes	db1708480a	[AArch64] Add the pipeline model for Exynos M5 Add the scheduling and cost models for Exynos M5.	2019-11-22 15:09:17 -06:00
Eric Christopher	3d68f1e222	Revert "[AArch64] Add the pipeline model for Exynos M5" as it's causing test failures in llvm-mca. This reverts commit 9bdfee2a3bd13d405ce1592930182f23849d2897.	2019-11-20 16:04:52 -08:00
Evandro Menezes	4ab9ed6b8f	[AArch64] Add the pipeline model for Exynos M5 Add the scheduling and cost models for Exynos M5.	2019-11-20 16:56:07 -06:00
Simon Pilgrim	7d4b7cf3b9	[X86] Fix SLM v2i64 ADD/Sub/CMPEQ instruction schedules Noticed while fixing the reduction costs for D59710 - the SLM model doesn't account for the poor throughput of v2i64 ops. Numbers taken from Intel AOM (+ checked against Agner)	2019-11-06 19:08:15 +00:00
Simon Pilgrim	02871604f5	[X86] Fix SLM v2f64 ADD/MUL + FP BLEND/HADD instruction schedules Noticed while fixing the reduction costs for D59710 - the SLM model doesn't account for the poor throughput of v2f64/v2i64 ops.	2019-11-06 19:08:15 +00:00
Evandro Menezes	d9e3dc7008	[mca] Fix test case (NFC) Fix test case for Darwin builds.	2019-10-31 16:44:52 -05:00
Evandro Menezes	8cd41b3ebd	[AArch64] Update for Exynos Fix the costs of `add` and `orr` with an immediate operand.	2019-10-31 15:25:22 -05:00
Evandro Menezes	7348b241c1	[clang][llvm] Obsolete Exynos M1 and M2	2019-10-30 15:02:59 -05:00
Andrea Di Biagio	4a9308eaf8	[X86][BtVer2] Improved latency and throughput of float/vector loads and stores. This patch introduces the following changes to the btver2 scheduling model: - The number of micro opcodes for YMM loads and stores is now 2 (it was incorrectly set to 1 for both aligned and misaligned loads/stores). - Increased the number of AGU resource cycles for YMM loads and stores to 2cy (instead of 1cy). - Removed JFPU01 and JFPX from the list of resources consumed by pure float/vector loads (no MMX). I verified with llvm-exegesis that pure XMM/YMM loads are no-pipe. Those are dispatched to the FPU but not really issues on JFPU01. Differential Revision: https://reviews.llvm.org/D68871 llvm-svn: 374765	2019-10-14 11:12:18 +00:00
Roman Lebedev	f72d6fc559	[MCA] Show aggregate over Average Wait times for the whole snippet (PR43219) Summary: As disscused in https://bugs.llvm.org/show_bug.cgi?id=43219, i believe it may be somewhat useful to show //some// aggregates over all the sea of statistics provided. Example: ``` Average Wait times (based on the timeline view): [0]: Executions [1]: Average time spent waiting in a scheduler's queue [2]: Average time spent waiting in a scheduler's queue while ready [3]: Average time elapsed from WB until retire stage [0] [1] [2] [3] 0. 3 1.0 1.0 4.7 vmulps %xmm0, %xmm1, %xmm2 1. 3 2.7 0.0 2.3 vhaddps %xmm2, %xmm2, %xmm3 2. 3 6.0 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 3 3.2 0.3 2.3 <total> ``` I.e. we average the averages. Reviewers: andreadb, mattd, RKSimon Reviewed By: andreadb Subscribers: gbedwell, arphaman, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68714 llvm-svn: 374361	2019-10-10 14:46:21 +00:00
Andrea Di Biagio	13160fb6a6	[MCA][LSUnit] Track loads and stores until retirement. Before this patch, loads and stores were only tracked by their corresponding queues in the LSUnit from dispatch until execute stage. In practice we should be more conservative and assume that memory opcodes leave their queues at retirement stage. Basically, loads should leave the load queue only when they have completed and delivered their data. We conservatively assume that a load is completed when it is retired. Stores should be tracked by the store queue from dispatch until retirement. In practice, stores can only leave the store queue if their data can be written to the data cache. This is mostly a mechanical change. With this patch, the retire stage notifies the LSUnit when a memory instruction is retired. That would triggers the release of LDQ/STQ entries. The only visible change is in memory tests for the bdver2 model. That is because bdver2 is the only model that defines the load/store queue size. This patch partially addresses PR39830. Differential Revision: https://reviews.llvm.org/D68266 llvm-svn: 374034	2019-10-08 10:46:01 +00:00
David Green	3bac3332a1	[llvm-mca] Add a -mattr flag This adds a -mattr flag to llvm-mca, for cases where the -mcpu option does not contain all optional features. Differential Revision: https://reviews.llvm.org/D68190 llvm-svn: 373358	2019-10-01 17:41:38 +00:00
Andrea Di Biagio	935e89a564	[MCA] Improved cost computation for loop carried dependencies in the bottleneck analysis. This patch introduces a cut-off threshold for dependency edge frequences with the goal of simplifying the critical sequence computation. This patch also removes the cost normalization for loop carried dependencies. We didn't really need to artificially amplify the cost of loop-carried dependencies since it is already computed as the integral over time of the delay (in cycle). In the absence of backend stalls there is no need for computing a critical sequence. With this patch we early exit from the critical sequence computation if no bottleneck was reported during the simulation. llvm-svn: 372337	2019-09-19 16:05:11 +00:00
Andrea Di Biagio	08b9b0a5ac	[X86][BtVer2] Fix latency and throughput of conditional SIMD store instructions. On BtVer2 conditional SIMD stores are heavily microcoded. The latency is directly proportional to the number of packed elements extracted from the input vector. Also, according to micro-benchmarks, most of the computation seems to be done in the integer unit. Only a minority of the uOPs is executed by the FPU. The observed behaviour on the FPU looks similar to this: - The input MASK value is moved to the Integer Unit -- [ a VMOVMSK-like uOP-executed on JFPU0]. - In parallel, each element of the input XMM/YMM is extracted and then sent to the IntegerUnit through JFPU1. As expected, a (conditional) store is executed for every extracted element. Interestingly, a (speculative) load is executed for every extracted element too. It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the integer unit for every contionally stored element. VMASKMOVDQU is a special case: the number of speculative loads is always 2 (presumably, one load per quadword). That means, extra shifts and masking is performed on (one of) the loaded quadwords before each conditional store (that also explains the big number of non-FP uOPs retired). This patch replaces the existing writes for conditional SIMD stores (i.e. WriteFMaskedStore, and WriteFMaskedStoreY) with the following new writes: WriteFMaskedStore32 [ XMM Packed Single ] WriteFMaskedStore32Y [ YMM Packed Single ] WriteFMaskedStore64 [ XMM Packed Double ] WriteFMaskedStore64Y [ YMM Packed Double ] Added a wrapper class named X86SchedWriteMaskMove in X86Schedule.td to describe both RM and MR variants for conditional SIMD moves in a single tablegen definition. Instances of that class are then passed in input to multiclass avx_movmask_rm when constructing MASKMOVPS/PD definitions. Since this patch introduces new writes, I had to update all the X86 scheduling models. Differential Revision: https://reviews.llvm.org/D66801 llvm-svn: 370649	2019-09-02 12:32:28 +00:00
Andrea Di Biagio	56ca3f6977	[X86][BtVer2] Add a read-advance to every implicit register use of CMPXCHG8B/16B. This is a follow up of r369642. This patch assigns a ReadAfterLd to every implicit register use of instruction CMPXCHG8B and instruction CMPXCHG16B. Perf micro-benchmarks show that implicit registers are read after 3cy from the start of execution. llvm-svn: 369750	2019-08-23 12:19:45 +00:00
Andrea Di Biagio	25ee718e07	[X86][BtVer2] Fix latency of ALU RMW instructions. Excluding ADC/SBB and the bit-test instructions (BTR/BTS/BTC), the observed latency of all other RMW integer arithmetic/logic instructions is 6cy and not 5cy. Example (ADD): ``` addb $0, (%rsp) # Latency: 6cy addb $7, (%rsp) # Latency: 6cy addb %sil, (%rsp) # Latency: 6cy addw $0, (%rsp) # Latency: 6cy addw $511, (%rsp) # Latency: 6cy addw %si, (%rsp) # Latency: 6cy addl $0, (%rsp) # Latency: 6cy addl $511, (%rsp) # Latency: 6cy addl %esi, (%rsp) # Latency: 6cy addq $0, (%rsp) # Latency: 6cy addq $511, (%rsp) # Latency: 6cy addq %rsi, (%rsp) # Latency: 6cy ``` The same latency profile applies to SUB/AND/OR/XOR/INC/DEC. The observed latency of ADC/SBB is 7-8cy. So we need a different write to model those. Latency of BTS/BTR/BTC is not fixed by this patch (they are much slower than what the model for btver2 currently reports). Differential Revision: https://reviews.llvm.org/D66636 llvm-svn: 369748	2019-08-23 11:34:10 +00:00
Andrea Di Biagio	38b78fd66c	[X86][BtVer2] Fix latency/throughput of scalar integer MUL instructions. Single operand MUL instructions that implicitly set EAX have the following latency/throughput profile (see below): imul %cl # latency: 3cy - uOPs: 1 - 1 JMul imul %cx # latency: 3cy - uOPs: 3 - 3 JMul imul %ecx # latency: 3cy - uOPs: 2 - 2 JMul imul %rcx # latency: 6cy - uOPs: 2 - 4 JMul mul %cl # latency: 3cy - uOPs: 1 - 1 JMul mul %cx # latency: 3cy - uOPs: 3 - 3 JMul mul %ecx # latency: 3cy - uOPs: 2 - 2 JMul mul %rcx # latency: 6cy - uOPs: 2 - 4 JMul Excluding the 64bit variant, which has a latency of 6cy, every other instruction has a latency of 3cy. However, the number of decoded macro-opcodes (as well as the resource cyles) depend on the MUL size. The two operand MULs have a more predictable profile (see below): imul %dx, %dx # latency: 3cy - uOPs: 1 - 1 JMul imul %edx, %edx # latency: 3cy - uOPs: 1 - 1 JMul imul %rdx, %rdx # latency: 6cy - uOPs: 1 - 4 JMul imul $3, %dx, %dx # latency: 4cy - uOPs: 2 - 2 JMul imul $3, %ecx, %ecx # latency: 3cy - uOPs: 1 - 1 JMul imul $3, %rdx, %rdx # latency: 6cy - uOPs: 1 - 4 JMul This patch updates the values in the Jaguar scheduling model and regenerates llvm-mca tests. Differential Revision: https://reviews.llvm.org/D66547 llvm-svn: 369661	2019-08-22 15:20:16 +00:00
Andrea Di Biagio	fae9f9f261	[X86][BtVer2] Fix latency and throughput of XCHG and XADD. On Jaguar, XCHG has a latency of 1cy and decodes to 2 macro-opcodes. Maximum throughput for XCHG is 1 IPC. The byte exchange has worse latency and decodes to 1 extra uOP; maximum observed throughput is 0.5 IPC. ``` xchgb %cl, %dl # Latency: 2cy - uOPs: 3 - 2 ALU xchgw %cx, %dx # Latency: 1cy - uOPs: 2 - 2 ALU xchgl %ecx, %edx # Latency: 1cy - uOPs: 2 - 2 ALU xchgq %rcx, %rdx # Latency: 1cy - uOPs: 2 - 2 ALU ``` The reg-mem forms of XCHG are atomic operations with an observed latency of 16cy. The resource usage is similar to the XCHGrr variants. The biggest difference is obviously the bus-locking, which prevents the LS to issue other memory uOPs in parallel until the unlocking store uOP is executed. ``` xchgb %cl, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy xchgw %cx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy xchgl %ecx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy xchgq %rcx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy ``` The exchanged in/out register operand becomes available after 11cy from the start of execution. Added test xchg.s to verify that we correctly see that register write committed in 11cy (and not 16cy). Reg-reg XADD instructions have the same latency/throughput than the byte exchange (register-register variant). ``` xaddb %cl, %dl # latency: 2cy - uOPs: 3 - 3 ALU xaddw %cx, %dx # latency: 2cy - uOPs: 3 - 3 ALU xaddl %ecx, %edx # latency: 2cy - uOPs: 3 - 3 ALU xaddq %rcx, %rdx # latency: 2cy - uOPs: 3 - 3 ALU ``` The non-atomic RM variants have a latency of 11cy, and decode to 4 macro-opcodes. They still consume 2 ALU pipes, and the exchange in/out register operand becomes available in 3cy (it matches the 'load-to-use latency'). ``` xaddb %cl, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU xaddw %cx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU xaddl %ecx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU xaddq %rcx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU ``` The atomic XADD variants execute in 16cy. The in/out register operand is available after 11cy from the start of execution. ``` lock xaddb %cl, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy lock xaddw %cx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy lock xaddl %ecx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy lock xaddq %rcx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy ``` Added test xadd.s to verify those latencies as well as read-advance values. Differential Revision: https://reviews.llvm.org/D66535 llvm-svn: 369642	2019-08-22 11:32:47 +00:00
Andrea Di Biagio	60bf5e7c65	[X86][BtVer2] Use ReadAfterLd entries for the register operands of CMPXCHG. This is a follow-up of r369365. llvm-svn: 369412	2019-08-20 17:05:56 +00:00
Andrea Di Biagio	fd00d5a846	[X86][BtVer2] Fix latency and throughput of atomic INC/DEC/NEG/NOT. Latency and throughput of LOCK INC/DEC/NEG/NOT is always 19cy. Number of uOPs is still 1. Differential Revision: https://reviews.llvm.org/D66469 llvm-svn: 369388	2019-08-20 14:31:27 +00:00
Simon Pilgrim	aa50d0d398	[MCA][X86] Add tests for LOCK variants of standard X86 arithmetic ops D66424 adds the base support for LOCK so we should be able to add special case support for all these cases in future patches llvm-svn: 369367	2019-08-20 11:13:20 +00:00
Andrea Di Biagio	d5dd3f579a	[X86][Btver2] Fix latency and throughput of CMPXCHG instructions. On Jaguar, CMPXCHG has a latency of 11cy, and a maximum throughput of 0.33 IPC. Throughput is superiorly limited to 0.33 because of the implicit in/out dependency on register EAX. In the case of repeated non-atomic CMPXCHG with the same memory location, store-to-load forwarding occurs and values for sequent loads are quickly forwarded from the store buffer. Interestingly, the functionality in LLVM that computes the reciprocal throughput doesn't seem to know about RMW instructions. That functionality only looks at the "consumed resource cycles" for the throughput computation. It should be fixed/improved by a future patch. In particular, for RMW instructions, that logic should also take into account for the write latency of in/out register operands. An atomic CMPXCHG has a latency of ~17cy. Throughput is also limited to ~17cy/inst due to cache locking, which prevents other memory uOPs to start executing before the "lock releasing" store uOP. CMPXCHG8rr and CMPXCHG8rm are treated specially because they decode to one less macro opcode. Their latency tend to be the same as the other RR/RM variants. RR variants are relatively fast 3cy (but still microcoded - 5 macro opcodes). CMPXCHG8B is 11cy and unfortunately doesn't seem to benefit from store-to-load forwarding. That means, throughput is clearly limited by the in/out dependency on GPR registers. The uOP composition is sadly unknown (due to the lack of PMCs for the Integer pipes). I have reused the same mix of consumed resource from the other CMPXCHG instructions for CMPXCHG8B too. LOCK CMPXCHG8B is instead 18cycles. CMPXCHG16B is 32cycles. Up to 38cycles when the LOCK prefix is specified. Due to the in/out dependencies, throughput is limited to 1 instruction every 32 (or 38) cycles dependeing on whether the LOCK prefix is specified or not. I wouldn't be surprised if the microcode for CMPXCHG16B is similar to 2x microcode from CMPXCHG8B. So, I have speculatively set the JALU01 consumption to 2x the resource cycles used for CMPXCHG8B. The two new hasLockPrefix() functions are used by the btver2 scheduling model check if a MCInst/MachineInst has a LOCK prefix. Calls to hasLockPrefix() have been encoded in predicates of variant scheduling classes that describe lat/thr of CMPXCHG. Differential Revision: https://reviews.llvm.org/D66424 llvm-svn: 369365	2019-08-20 10:23:55 +00:00
Andrea Di Biagio	7f44363da1	[X86] Move scheduling tests for CMPXCHG to the corresponding resources-x86_64.s files. NFC In D66424 it has been requested to move all the new tests added by r369278 into resources-x86_64.s. That is because only the 8b/16 ops should be tested by resources-cmpxchg.s. This partially reverts r369278. llvm-svn: 369288	2019-08-19 18:20:30 +00:00
Andrea Di Biagio	f2f9d97508	[X86] Added extensive scheduling model tests for all the CMPXCHG variants. NFC Addresses a review comment in D66424 llvm-svn: 369279	2019-08-19 17:07:26 +00:00
Andrea Di Biagio	9bbf3a5aeb	[MCA] Add flag -show-encoding to llvm-mca. Flag -show-encoding enables the printing of instruction encodings as part of the the instruction info view. Example (with flags -mtriple=x86_64-- -mcpu=btver2): Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [7]: Encoding Size [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions: 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4 In this example, column Encoding Size is the size in bytes of the instruction encoding. Column Encodings reports the actual instruction encodings as byte sequences in hex (objdump style). The computation of encodings is done by a utility class named mca::CodeEmitter. In future, I plan to expose the CodeEmitter to the instruction builder, so that information about instruction encoding sizes can be used by the simulator. That would be a first step towards simulating the throughput from the decoders in the hardware frontend. Differential Revision: https://reviews.llvm.org/D65948 llvm-svn: 368432	2019-08-09 11:26:27 +00:00
Craig Topper	7b7ab0396a	[X86] Limit vpermil2pd/vpermil2ps immediates to 4 bits in the assembly parser. The upper 4 bits of the immediate byte are used to encode a register. We need to limit the explicit immediate to fit in the remaining 4 bits. Fixes PR42899. llvm-svn: 368123	2019-08-07 05:34:27 +00:00
Andrea Di Biagio	66df8adc20	[MCA] Add support for printing immedate values as hex. Also enable lexing of masm binary and hex literals. This patch adds a new llvm-mca flag named -print-imm-hex. By default, the instruction printer prints immediate operands as decimals. Flag -print-imm-hex enables the instruction printer to print those operands in hex. This patch also adds support for MASM binary and hex literal numbers (example 0FFh, 101b). Added tests to verify the behavior of the new flag. Tests also verify that masm numeric literal operands are now recognized. Differential Revision: https://reviews.llvm.org/D65588 llvm-svn: 367671	2019-08-02 10:38:25 +00:00
Andrea Di Biagio	2d1bc962de	Set an explicit x86 triple for test bottleneck-analysis.s added by my r364045. NFC This should unbreak the ppc64 buildbots. llvm-svn: 364048	2019-06-21 14:05:58 +00:00
Andrea Di Biagio	88551367a3	[MCA][Bottleneck Analysis] Teach how to compute a critical sequence of instructions based on the simulation. This patch teaches the bottleneck analysis how to identify and print the most expensive sequence of instructions according to the simulation. Fixes PR37494. The goal is to help users identify the sequence of instruction which is most critical for performance. A dependency graph is internally used by the bottleneck analysis to describe data dependencies and processor resource interferences between instructions. There is one node in the graph for every instruction in the input assembly sequence. The number of nodes in the graph is independent from the number of iterations simulated by the tool. It means that a single node of the graph represents all the possible instances of a same instruction contributed by the simulated iterations. Edges are dynamically "discovered" by the bottleneck analysis by observing instruction state transitions and "backend pressure increase" events generated by the Execute stage. Information from the events is used to identify critical dependencies, and materialize edges in the graph. A dependency edge is uniquely identified by a pair of node identifiers plus an instance of struct DependencyEdge::Dependency (which provides more details about the actual dependency kind). The bottleneck analysis internally ranks dependency edges based on their impact on the runtime (see field DependencyEdge::Dependency::Cost). To this end, each edge of the graph has an associated cost. By default, the cost of an edge is a function of its latency (in cycles). In practice, the cost of an edge is also a function of the number of cycles where the dependency has been seen as 'contributing to backend pressure increases'. The idea is that the higher the cost of an edge, the higher is the impact of the dependency on performance. To put it in another way, the cost of an edge is a measure of criticality for performance. Note how a same edge may be found in multiple iteration of the simulated loop. The logic that adds new edges to the graph checks if an equivalent dependency already exists (duplicate edges are not allowed). If an equivalent dependency edge is found, field DependencyEdge::Frequency of that edge is incremented by one, and the new cost is cumulatively added to the existing edge cost. At the end of simulation, costs are propagated to nodes through the edges of the graph. The goal is to identify a critical sequence from a node of the root-set (composed by node of the graph with no predecessors) to a 'sink node' with no successors. Note that the graph is intentionally kept acyclic to minimize the complexity of the critical sequence computation algorithm (complexity is currently linear in the number of nodes in the graph). The critical path is finally computed as a sequence of dependency edges. For edges describing processor resource interferences, the view also prints a so-called "interference probability" value (by dividing field DependencyEdge::Frequency by the total number of iterations). Examples of critical sequence computations can be found in tests added/modified by this patch. On output streams that support colored output, instructions from the critical sequence are rendered with a different color. Strictly speaking the analysis conducted by the bottleneck analysis view is not a critical path analysis. The cost of an edge doesn't only depend on the dependency latency. More importantly, the cost of a same edge may be computed differently by different iterations. The number of dependencies is discovered dynamically based on the events generated by the simulator. However, their number is not fixed. This is especially true for edges that model processor resource interferences; an interference may not occur in every iteration. For that reason, it makes sense to also print out a "probability of interference". By construction, the accuracy of this analysis (as always) is strongly dependent on the simulation (and therefore the quality of the information available in the scheduling model). That being said, the critical sequence effectively identifies a performance criticality. Instructions from that sequence are expected to have a very big impact on performance. So, users can take advantage of this information to focus their attention on specific interactions between instructions. In my experience, it works quite well in practice, and produces useful output (in a reasonable amount time). Differential Revision: https://reviews.llvm.org/D63543 llvm-svn: 364045	2019-06-21 13:32:54 +00:00
Clement Courbet	d31b782266	Fix r363773: Update Barcelona MCA tests. llvm-svn: 363781	2019-06-19 10:00:36 +00:00
Roman Lebedev	526767bb3d	[NFC][X86][MCA] Barcelona: add load/store/load-store-throughput tests llvm-svn: 363775	2019-06-19 08:53:34 +00:00
Roman Lebedev	960197082d	[NFC][X86][MCA] BdVer2: add load-store-throughput test llvm-svn: 363774	2019-06-19 08:53:28 +00:00
Clement Courbet	ff802cb3e4	[X86] Add missing properties on llvm.x86.sse.{st,ld}mxcsr Summary: llvm.x86.sse.stmxcsr only writes to memory. llvm.x86.sse.ldmxcsr only reads from memory, and might generate an FPE. Reviewers: craig.topper, RKSimon Subscribers: llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D62896 llvm-svn: 363773	2019-06-19 08:44:31 +00:00
Fangrui Song	921cf42d37	[lit] Delete empty lines at the end of lit.local.cfg NFC llvm-svn: 363538	2019-06-17 09:51:07 +00:00
Roman Lebedev	eca9899a89	[NFC][MCA][X86] Add one more 'clear super register' pattern - movss/movsd load clears high XMM bits llvm-svn: 363498	2019-06-15 16:12:13 +00:00

1 2 3 4 5 ...

408 Commits