llvm-mirror

mirror of https://github.com/RPCS3/llvm-mirror.git synced 2024-11-25 04:02:41 +01:00

Author	SHA1	Message	Date
Simon Pilgrim	436c202090	[CostModel][X86] Adjust fptosi/fptoui SSE/AVX legalized costs based on llvm-mca reports. Update (mainly) vXf32/vXf64 -> vXi8/vXi16 fptosi/fptoui costs based on the worst case costs from the script in D103695. Move to using legalized types wherever possible, which allows us to prune the cost tables.	2021-07-12 20:38:25 +01:00
Craig Topper	01311414db	[X86] Teach X86FloatingPoint's handleCall to only erase the FP stack if there is a regmask operand that clobbers the FP stack. There are some calls to functions like `__alloca` that are missing a regmask operand. Lack of a regmask operand means that all registers that aren't mentioned by def operands are preserved. __alloca only updates EAX and ESP and has def operands for them so this is ok. Because there is no regmask the register allocator won't spill the FP registers across the call. Assuming we want to keep the FP stack untoched across these calls, we need to handle this is in the FP stackifier. We might want to add a proper regmask operand to the code that creates these calls to indicate all registers are preserved, but we'd still need this change to the FP stackifier to know to preserve the FP stack for such a regmask. The test is kind of long, but bugpoint wasn't able to reduce it any further. Fixes PR50782 Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D105762	2021-07-12 10:15:38 -07:00
Simon Pilgrim	063de0d715	[CostModel][X86] Adjust truncate SSE/AVX legalized costs based on llvm-mca reports. Update truncation costs based on the worst case costs from the script in D103695. Move to using legalized types wherever possible, which allows us to prune the cost tables.	2021-07-12 13:50:43 +01:00
Simon Pilgrim	605a389334	[X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result Annoyingly, i686 cmpsd handling still fails to remove the unnecessary neg(and(x,1))	2021-07-12 09:56:59 +01:00
Jeremy Morse	c82840bcfa	[X86] Return src/dest register from stack spill/restore recogniser LLVM provides target hooks to recognise stack spill and restore instructions, such as isLoadFromStackSlot, and it also provides post frame elimination versions such as isLoadFromStackSlotPostFE. These are supposed to return the store-source and load-destination registers; unfortunately on X86, the PostFE recognisers just return "1", apparently to signify "yes it's a spill/load". This patch alters the hooks to correctly return the store-source and load-destination registers: This is really useful for debug-info as we it helps follow variable values as they move on/off the stack. There should be no codegen changes: the only other users of these PostFE target hooks are MachineInstr::getRestoreSize and MachineInstr::getSpillSize, which don't attempt to interpret the returned register location. While we're here, delete the (InstrRef) LiveDebugValues heuristic that tries to find the spill source register by looking for a killed reg -- we should be able to rely on the target hooks for that. This involves temporarily turning off a n InstrRef LivedDebugValues test on aarch64 (patch to re-enable it is in D104521). Differential Revision: https://reviews.llvm.org/D105428	2021-07-09 18:12:30 +01:00
Simon Pilgrim	eeea5b4c4a	[X86] ReplaceNodeResults - fp_to_sint/uint - manually widen v2i32 results to let us add AssertSext/AssertZext Its proving tricky to move this to the generic legalizer code, so manually insert the v2i32 subvector into v4i32, insert the AssertSext/AssertZext node, then extract the subvector again. This avoids masks in the truncation/pack code, which means we avoid a PSHUFB in the fp_to_sint/uint code for sub-128 bit types (specific targets can still combine the packs to a pshufb if they have fast variable per-lane shuffles). This was noticed when I was trying to improve fp_to_sint/uint costs with D103695 (and some targets had very high fp_to_sint costs due to the PSHUFB), so we can then update the fp_to_uint codegen from D89697.	2021-07-09 12:07:33 +01:00
David Green	fd61052e59	[TTI] Remove IsPairwiseForm from getArithmeticReductionCost This patch removes the IsPairwiseForm flag from the Reduction Cost TTI hooks, along with some accompanying code for pattern matching reductions from trees starting at extract elements. IsPairWise is now assumed to be false, which was the predominant way that the value was used from both the Loop and SLP vectorizers. Since the adjustments such as D93860, the SLP vectorizer has not relied upon this distinction between paiwise and non-pairwise reductions. This also removes some code that was detecting reductions trees starting from extract elements inside the costmodel. This case was double-counting costs though, adding the individual costs on the individual instruction _and_ the total cost of the reduction. Removing it changes the costs in llvm/test/Analysis/CostModel/X86/reduction.ll to not double count. The cost of reduction intrinsics is still tested through the various tests in llvm/test/Analysis/CostModel/X86/reduce-xyz.ll. Differential Revision: https://reviews.llvm.org/D105484	2021-07-09 11:51:16 +01:00
Alexey Bataev	9770438876	[SLP][COST][X86]Improve cost model for masked gather. Revived D101297 in its original form + added some changes in X86 legalization cehcking for masked gathers. This solution is the most stable and the most correct one. We have to check the legality before trying to build the masked gather in SLP. Without this check we have incorrect cost (for SLP) in case if the masked gather is not legal/slower than the gather. And we're missing some vectorization opportunities. This can be fixed in the cost model, but in this case we need to add special checks for the cost of GEPs for ScatterVectorize node, add special check for small trees, etc., i.e. there are a lot of corner cases here and there, which insrease code base and make it harder to maintain the code. > Can't we rely on cost model to deal with this? This can be profitable for futher vectorization, when we can start from such gather loads as seed. The question from D101297. Actually, no, it can't. Actually, simple gather may give us better result, especially after we started vectorization of insertelements. Plus, like I said before, the cost for non-legal masked gathers leads to missed vectorization opportunities. Differential Revision: https://reviews.llvm.org/D105042	2021-07-08 11:53:30 -07:00
Matt Arsenault	fc47c36984	GlobalISel: Track original argument index in ArgInfo SelectionDAG's equivalents in ISD::InputArg/OutputArg track the original argument index. Mips relies on this, and its currently reinventing its own parallel CallLowering infrastructure which tracks these indexes on the side. Add this to help move towards deleting the custom mips handling.	2021-07-08 13:39:02 -04:00
Simon Pilgrim	5d69b0490b	[CostModel][X86] Account for older SSE targets with slow fp->int conversions Both the conversion cost and the xmm->gpr transfer cost tend to be a lot higher on early SSE targets	2021-07-08 18:08:24 +01:00
Jeremy Morse	d7cf7abb78	[DebugInfo][InstrRef][4/4] Support DBG_INSTR_REF through all backend passes This is a cleanup patch -- we're now able to support all flavours of variable location in instruction referencing mode. This patch updates various tests for debug instructions to be broader: numerous code paths try to ignore debug isntructions, and they now have to ignore the additional DBG_PHI and DBG_INSTR_REFs that we can generate. A small amount of rework happens for LiveDebugVariables: as we don't need to track live intervals through regalloc any more, we can get away with unlinking debug instructions before regalloc, then re-inserting them after. Note that this isn't (yet) true of DBG_VALUE_LISTs, they still have to go through live interval tracking. In SelectionDAG, add a helper lambda that emits half-formed DBG_INSTR_REFs for arguments in instr-ref mode, DBG_VALUE otherwise. This is one of the final locations where DBG_VALUEs are emitted for vreg arguments. X86InstrInfo now un-sets the debug instr number on SUB instructions that get mutated into CMP instructions. As the instruction no longer computes a subtraction, we can't use it for variable locations. Differential Revision: https://reviews.llvm.org/D88898	2021-07-08 16:42:24 +01:00
Simon Pilgrim	5668a526d1	[X86][Atom] Fix vector fp<->int resource/throughputs Match whats documented in the Intel AOM - almost all the conversion instructions requires BOTH ports (apart from the MMX cvtpi2ps/cvtpi2ps instructions which we already override) - this was being incorrectly modelled as EITHER port. Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.	2021-07-07 16:52:34 +01:00
Simon Pilgrim	8dde5bc762	[CostModel][X86] Adjust sext/zext SSE/AVX legalized costs based on llvm-mca reports. Update costs based on the worst case costs from the script in D103695. Move to using legalized types wherever possible, which allows us to prune the cost tables.	2021-07-07 13:58:27 +01:00
Simon Pilgrim	929bb9374e	[CostModel][X86] Adjust sitofp/uitofp SSE/AVX legalized costs based on llvm-mca reports. Update (mainly) vXi8/vXi16 -> vXf32/vXf64 sitofp/uitofp costs based on the worst case costs from the script in D103695. Move to using legalized types wherever possible, which allows us to prune the cost tables.	2021-07-07 12:03:45 +01:00
Simon Pilgrim	b8bec02bc7	[CostModel][X86] fptosi/fptoui to i8/i16 are truncated from fptosi to i32 Provide a generic fallback that performs the fptosi to i32 types, then truncates to sub-i32 scalars. These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.	2021-07-06 17:28:03 +01:00
Simon Pilgrim	1787dc0460	[CostModel][X86] i8/i16 sitofp/uitofp are sext/zext to i32 for sitofp Provide a generic fallback that extends sub-i32 scalars before using the existing sitofp instructions. These numbers can be tweaked for specific sse levels, but we should get the default handling in place first. We get the extension for free for non-vector loads.	2021-07-06 13:58:52 +01:00
Wang, Pengfei	cd7ebd1f50	[X86] Twist shuffle mask when fold HOP(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(HOP(X,Y)) This patch fixes PR50823. The shuffle mask should be twisted twice before gotten the correct one due to the difference between inner HOP and outer. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D104903	2021-07-05 21:29:42 +08:00
Simon Pilgrim	8ed3c99d18	[CostModel][X86] Handle costs for insert/extractelement with non-immediate indices via stack Determine the insert/extractelement costs when performing this as a sequence of aliased loads+stores via the stack.	2021-07-05 13:26:53 +01:00
Simon Pilgrim	4d75d8ef44	[CostModel][X86] Adjust i32/i64 to f32/f64 scalar based on llvm-mca reports (+ Agner). Older SSE targets have slower gpr->fpu scalar conversions - we also need to account for uitofp i32 > f32/f64 being lowered as sitofp i64 -> f32/f64	2021-07-05 13:26:53 +01:00
Nikita Popov	ecd2dc975e	[IRBuilder] Add type argument to CreateMaskedLoad/Gather Same as other CreateLoad-style APIs, these need an explicit type argument to support opaque pointers. Differential Revision: https://reviews.llvm.org/D105395	2021-07-04 12:17:59 +02:00
Amir Ayupov	faad37c262	[X86] Modify LOOP, HLT control flow attributes Add missing control flow attributes: - LOOP: isBranch, isTerminator - HLT: isTerminator This helps downstream disassemblers (such as BOLT) reconstruct the control flow graph more accurately. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D102297	2021-07-02 10:34:29 -07:00
Simon Pilgrim	3fc71bb70f	[X86][SLM] Keep similar scheduler costs types together. NFCI. The SLM model is inconsistent about where it kept its 'unsupported' schedule classes - better to keep them close to similar classes. I'm not sure why some ymm classes are defined and others are unsupported though (but I haven't altered them) - the only SLM-like CPU supporting any ymm is KNL and that currently uses the HSW model.	2021-07-02 14:50:24 +01:00
Simon Pilgrim	4d7e8eba5b	[CostModel][X86] Update comment describing source of costs - we now use llvm-mca more than IACA	2021-07-02 14:29:32 +01:00
Simon Pilgrim	5e6cee0948	[CostModel][X86] Drop some hard coded fp<->int scalarization costs Scalarization costs handling is a lot better now, and the hard coded costs were higher than the worse case numbers from the script in D103695	2021-07-02 14:29:32 +01:00
Simon Pilgrim	1a043ee5ed	[CostModel][X86] Find AVX conversion costs using legalized types if custom types didn't match Building on rG2a1ef8784ad9a, fallback to attempting to match against legalized types like we do for SSE targets.	2021-07-02 13:49:31 +01:00
Simon Pilgrim	c54e15d18e	[CostModel][X86] Adjust uitofp(vXi64) SSE/AVX legalized costs based on llvm-mca reports. Update v4i64 -> v4f32/v4f64 uitofp costs based on the worst case costs from the script in D103695. Fixes a few regressions before we start adding AVX costs for legalized types.	2021-07-02 13:09:00 +01:00
Matt Arsenault	20d89b9242	GlobalISel: Use LLT in call lowering callbacks This preserves the memory type so the lowerings can rely on them.	2021-07-01 12:15:54 -04:00
Simon Pilgrim	f0a2f7f0f9	[CostModel][X86] Adjust fp<->int vXi32 SSE legalized costs based on llvm-mca reports. Building on rG2a1ef8784ad9a, adjust the SSE cost tables to use the legalized types based on the worst case costs from the script in D103695. To account for different numbers of src/dst legalized type registers we must scale the cost by maximum of the src/dst, not just use src	2021-07-01 15:34:20 +01:00
Simon Pilgrim	b53b3c8d1c	[CostModel][X86] getCastInstrCost - attempt to match custom cast/conversion before legalized types. Move the (SSE-only) generic, legalized type conversion matching after the specific,custom conversion cases, allowing us to properly provide cost overrides. The next step will be to clean up some of the weird existing costs and then to enable AVX+ legalized costs, which will let us strip out a lot of the cost tables entries.	2021-07-01 12:06:40 +01:00
Jeremy Morse	7a8e30eba0	[DebugInfo][InstrRef][1/4] Support transformations that widen values Very late in compilation, backends like X86 will perform optimisations like this: $cx = MOV16rm $rax, ... -> $rcx = MOV64rm $rax, ... Widening the load from 16 bits to 64 bits. SEeing how the lower 16 bits remain the same, this doesn't affect execution. However, any debug instruction reference to the defined operand now refers to a 64 bit value, nto a 16 bit one, which might be unexpected. Elsewhere in codegen, there's often this pattern: CALL64pcrel32 @foo, implicit-def $rax %0:gr64 = COPY $rax %1:gr32 = COPY %0.sub_32bit Where we want to refer to the definition of $eax by the call, but don't want to refer the copies (they don't define values in the way LiveDebugValues sees it). To solve this, add a subregister field to the existing "substitutions" facility, so that we can describe a field within a larger value definition. I would imagine that this would be used most often when a value is widened, and we need to refer to the original, narrower definition. Differential Revision: https://reviews.llvm.org/D88891	2021-07-01 11:19:27 +01:00
Simon Pilgrim	f6f886a614	[X86] Canonicalize SGT/UGT compares with constants to use SGE/UGE to reduce the number of EFLAGs reads. (PR48760) This demonstrates a possible fix for PR48760 - for compares with constants, canonicalize the SGT/UGT condition code to use SGE/UGE which should reduce the number of EFLAGs bits we need to read. As discussed on PR48760, some EFLAG bits are treated independently which can require additional uops to merge together for certain CMOVcc/SETcc/etc. modes. I've limited this to cases where the constant increment doesn't result in a larger encoding or additional i64 constant materializations. Differential Revision: https://reviews.llvm.org/D101074	2021-06-30 18:46:50 +01:00
Simon Pilgrim	3ab40c9e92	[CostModel][X86] Adjust fp<->int vXi32 AVX1+ costs based on llvm-mca reports Based off the worse case numbers generated by D103695, the AVX1/2/512 sitofp/uitofp/fptosi/fptoui costs were higher than necessary (based off instruction counts instead of actual throughput). The SSE costs still need further fixes, but I hit an issue with the order in which SSE costs are checked - we need to check CUSTOM costs (with non-legal types) first, and then fallback to LEGALIZED types. I'm looking at this now, and this should let us start thinning out a lot of the duplicates in the costs tables. Then we can finally start work on vXi64 / vXi16 / vXi8 / vXi1 integers, which should let us look at sub-128-bit vectorization (D103925).	2021-06-30 15:23:34 +01:00
Craig Topper	72e0f3ae04	[X86] Tighten up some inline assembly constraint handling. Don't allow vectors to split into GPRs for 'r' and other scalar constraints. Prevents assertion in getCopyToPartsVector. Makes PR50907 give a better error instead of crashing.	2021-06-26 22:57:22 -07:00
Eric Astor	7492c6b2fb	[ms] [llvm-ml] Disable C-style comments	2021-06-25 23:09:13 -04:00
Luo, Yuanke	b033502dd5	[X86] Selecting fld0 for undefined value in fast ISEL. When set opt-bisect-limit to some value that is less than ISel pass in command line and CurBisectNum expired, "DAG to DAG" pass lower its opt level to O0. However "processimpdefs" and "X86 FP Stackifier" is not stopped due to the CurBisectNum expiration. So undefined fp0 is generated. This cause crash in the "X86 FP Stackifier" pass, because Stackifier doesn't expect any undefined fp value. Here is the scenario that cause compiler crash. successors: %bb.26 liveins: $r14 ST_FPrr $st0, implicit-def $fpsw, implicit $fpcw renamable $rdi = MOV64ri @.str.3.16422 renamable $rdx = LEA64r %stack.6, 1, $noreg, 0, $noreg ADJCALLSTACKDOWN64 0, 0, 0, implicit-def $rsp, implicit-def dead $eflags, implicit-def $ssp, implicit $rsp, implicit $ssp dead $esi = MOV32r0 implicit-def dead $eflags, implicit-def $rsi CALL64pcrel32 @foo, implicit $rsp, implicit $ssp, implicit $rdi, implicit $rsi, implicit $rdx, implicit-def dead $fp0 renamable $xmm0 = MOVSDrm_alt %stack.10, 1, $noreg, 0, $noreg :: (load 8 from %stack.10) ADJCALLSTACKUP64 0, 0, implicit-def $rsp, implicit-def dead $eflags, implicit-def $ssp, implicit $rsp, implicit $ssp renamable $fp2 = CHS_Fp80 killed undef renamable $fp0, implicit-def $fpsw JMP_1 %bb.26 The CALL64pcrel32 mark fp0 dead, so llvm free the stack slot for fp0 and the stack become empty. In the late instruction CHS_Fp80, it use undefined register fp0, the original code assume there must be a stack slot for the src register (fp0) without respecting it is undefined, so llvm report error. We have some discussion in https://reviews.llvm.org/D104440 and we decide to fix it in fast ISel. The fix is to lower undefined fp value to zero value, so that it release the burden of "X86 FP Stackifier" pass. Thank Craig for the suggestion and the initial patch to fix it. Differential Revision: https://reviews.llvm.org/D104678	2021-06-26 08:43:09 +08:00
Craig Topper	40a736ff72	[X86] Simplify part of the isel for X86ISD::FCMP/STRICT_FCMP/STRICT_FCMPS. We don't need to have the compare output a value and then copy it to FPSW for use by FNSTSW. Instead we can just have the compare output Glue and glue the FNSTSW to it. InstrEmitter effectively performed this optimization when emitting the Machine IR. Doing it directly simplifies the codes and reduces the work in InstrEmitter. There's no change in the machine IR at the end of isel before and after this change.	2021-06-25 11:39:01 -07:00
Serge Pavlov	1badfbbb03	[X86] Add description of FXAM instruction Previously this instruction could be used only in assembler. This change makes it available for compiler also. Scheduling information was copied from FTST instruction, hopefully this can be a satisfactory approximation. Differential Revision: https://reviews.llvm.org/D104853	2021-06-25 12:26:51 +07:00
Martin Storsjö	9d14adb9f6	[llvm] Rename StringRef _lower() method calls to _insensitive() This is a mechanical change. This actually also renames the similarly named methods in the SmallString class, however these methods don't seem to be used outside of the llvm subproject, so this doesn't break building of the rest of the monorepo.	2021-06-25 00:22:01 +03:00
Florian Hahn	bd6b477d08	[X86] Exclude invalid element types for bitcast/broadcast folding. It looks like the fold introduced in 63f3383ece25efa can cause crashes if the type of the bitcasted value is not a valid vector element type, like x86_mmx. To resolve the crash, reject invalid vector element types. The way it is done in the patch is a bit clunky. Perhaps there's a better way to check? Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D104792	2021-06-24 12:39:01 +01:00
Simon Pilgrim	479c2a6aed	[X86] Fold nested select_cc to select (cmp*ge/le Cond0, Cond1), LHS, Y) select (cmpeq Cond0, Cond1), LHS, (select (cmpugt Cond0, Cond1), LHS, Y) --> (select (cmpuge Cond0, Cond1), LHS, Y) etc, We already perform this fold in DAGCombiner for MVT::i1 comparison results, but these can still appear after legalization (in x86 case with MVT::i8 results), where we need to be more careful about generating new comparison codes. Pulled out of D101074 to help address the remaining regressions. Differential Revision: https://reviews.llvm.org/D104707	2021-06-24 11:27:57 +01:00
Sander de Smalen	ac11cfc716	[GlobalISel] NFC: Change LLT::vector to take ElementCount. This also adds new interfaces for the fixed- and scalable case: * LLT::fixed_vector * LLT::scalable_vector The strategy for migrating to the new interfaces was as follows: * If the new LLT is a (modified) clone of another LLT, taking the same number of elements, then use LLT::vector(OtherTy.getElementCount()) or if the number of elements is halfed/doubled, it uses .divideCoefficientBy(2) or operator. That is because there is no reason to specifically restrict the types to 'fixed_vector'. If the algorithm works on the number of elements (as unsigned), then just use fixed_vector. This will need to be fixed up in the future when modifying the algorithm to also work for scalable vectors, and will need then need additional tests to confirm the behaviour works the same for scalable vectors. * If the test used the '/Scalable=/true` flag of LLT::vector, then this is replaced by LLT::scalable_vector. Reviewed By: aemerson Differential Revision: https://reviews.llvm.org/D104451	2021-06-24 11:26:12 +01:00
Eli Friedman	0c81356419	Rename MachineMemOperand::getOrdering -> getSuccessOrdering. Since this method can apply to cmpxchg operations, make sure it's clear what value we're actually retrieving. This will help ensure we don't accidentally ignore the failure ordering of cmpxchg in the future. We could potentially introduce a getOrdering() method on AtomicSDNode that asserts the operation isn't cmpxchg, but not sure that's worthwhile. Differential Revision: https://reviews.llvm.org/D103338	2021-06-21 16:49:27 -07:00
Fangrui Song	bb7326c8ea	[AArch64][X86] Allow 64-bit label differences lower to IMAGE_REL_*_REL32 `IMAGE_REL_ARM64_REL64/IMAGE_REL_AMD64_REL64` do not exist and `.quad a - .` is currently not representable. For instrumentation, `.quad a - .` is useful representing a cross-section reference in a metadata section, to allow ELF medium/large code models. The COFF limitation makes such generic instrumentations inconvenient. I plan to make a PGO/coverage metadata section field relative in D104556. Differential Revision: https://reviews.llvm.org/D104564	2021-06-21 14:32:25 -07:00
Roman Lebedev	a8e6eca719	[NFC] AMD Zen 3: fix typo in a comment	2021-06-19 22:05:17 +03:00
Roman Lebedev	2b8fbae5a3	[X86] AMD Zen 3: don't confuse shift and shuffle, NFC These proc res groups occupy the exact same pipes, so this doesn't affect the modelling, but it's confusing nontheless.	2021-06-17 21:07:35 +03:00
Simon Pilgrim	1bc07d1287	[X86] combineSelect - refactor MIN/MAX detection code to make it easier to add additional select(setcc,x,y) folds. NFCI. I need to add some additional handling to address some of the regressions from D101074	2021-06-17 13:50:59 +01:00
Roman Lebedev	67009dec50	[NFC][X86] lowerVECTOR_SHUFFLE(): drop FIXME about widening to i128 (YMM half) element type As per the discussion in D103818, so far, this does not appear to be worthwhile. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D103818	2021-06-16 10:24:33 +03:00
Saleem Abdulrasool	1bd45c37e0	X86: balance the frame prologue and epilogue on Win64 This was broken in ba1509da7b89c850c89f0f98afbab375794cd3c8. The Win64 frame would not perform the setup of the Swift async context parameter but would tear down the setup in the epilogue resulting in crashes. This ensures that we do the full setup when we do the tear down. Although this is non-conforming to the Win64 calling convention, it corrects the setup and exposes the actual issue that the change introduced: incorrect frame setup. Reviewed By: rnk Differential Revision: https://reviews.llvm.org/D104246	2021-06-15 20:13:52 -07:00
Bob Haarman	99d0b11cd5	[X86] avoid assert with varargs, soft float, and no-implicit-float Fixes: - PR36507 Floating point varargs are not handled correctly with -mno-implicit-float - PR48528 __builtin_va_start assumes it can pass SSE registers when using -Xclang -msoft-float -Xclang -no-implicit-float On x86_64, floating-point parameters are normally passed in XMM registers. For va_start, we spill those to memory so va_arg can find them. There is an interaction here with -msoft-float and -no-implicit-float: When -msoft-float is in effect, instead of passing floating-point parameters in XMM registers, they are passed in general-purpose registers. When -no-implicit-float is in effect, it "disables implicit floating-point instructions" (per the LangRef). The intended effect is to not have the compiler generate floating-point code unless explicit floating-point operations are present in the source code, but what exactly counts as an explicit floating-point operation is not specified. The existing behavior of LLVM here has led to some surprises and PRs. This change modifies the behavior as follows: \| soft \| no-implicit \| old behavior \| new behavior \| \| no \| no \| spill XMM regs \| spill XMM regs \| \| yes \| no \| don't spill XMM \| don't spill XMM \| \| no \| yes \| don't spill XMM \| spill XMM regs \| \| yes \| yes \| assert \| don't spill XMM \| In particular, this avoids the assert that happens when -msoft-float and -no-implicit-float are both in effect. This seems like a perfectly reasonable combination: If we don't want to rely on hardware floating-point support, we want to both avoid using float registers to pass parameters and avoid having the compiler generate floating-point code that wasn't in the original program. Instead of crashing the compiler, the new behavior is to not synthesize floating-point code in this case. This fixes PR48528. The other interesting case is when -no-implicit-float is in effect, but -msoft-float is not. In that case, any floating-point parameters that are present will be in XMM registers, and so we have to spill them to correctly handle those. This fixes PR36507. The spill is conditional on %al indicating that parameters are present in XMM registers, so no floating-point code will be executed unless the function is called with floating-point parameters. Reviewed By: rnk Differential Revision: https://reviews.llvm.org/D104001	2021-06-15 11:27:35 -07:00
Roman Lebedev	586aaeabf1	[X86] Schedule-model second (mask) output of GATHER instruction Much like `mulx`'s `WriteIMulH`, there are two outputs of AVX2 GATHER instructions. This was changed back in rL160110, but the sched model change wasn't present. So right now, for sched models that are marked as complete (`znver3` only now), codegen'ning `GATHER` results in a crash: ``` DefIdx 1 exceeds machine model writes for early-clobber renamable $ymm3, dead early-clobber renamable $ymm2 = VPGATHERDDYrm killed renamable $ymm3(tied-def 0), undef renamable $rax, 4, renamable $ymm0, 0, $noreg, killed renamable $ymm2(tied-def 1) :: (load 32, align 1) ``` https://godbolt.org/z/Ks7zW7WGh I'm guessing we need to deal with this like we deal with `WriteIMulH`. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D104205	2021-06-15 12:04:33 +03:00
Craig Topper	e3c98d1dad	[X86] Use EVT::getVectorVT instead of changeVectorElementType in reduceVMULWidth. Changing vector element type doesn't work for v6i32->v6i16 now that v6i32 is an MVT and v6i16 is not. I would like to fix this in changeVectorElementType, but you need a LLVMContext to call getVectorVT which we can't get from an MVT. Fixes PR50709.	2021-06-14 22:07:04 -07:00
Saleem Abdulrasool	a4fccf0a10	X86: pass swift_async context in R14 on Win64 Pass swift_async context in a callee-saved register rather than as a regular parameter. This is similar to the Swift `self` and `error` parameters.	2021-06-14 11:02:21 -07:00
Eric Astor	d15098fdc8	[ms] [llvm-ml] When parsing MASM, "jmp short" instructions are case insensitive Handle "short" in a case-insensitive fashion in MASM. Required to correctly parse z_Windows_NT-586_asm.asm from the OpenMP runtime. Reviewed By: thakis Differential Revision: https://reviews.llvm.org/D104195	2021-06-13 18:36:00 -04:00
Eric Astor	3e051c60b8	Fix misspelled instruction in X86 assembly parser Did not correctly handle "jecxz short <address>". Discovered while working on LLVM-ML; shows up in z_Windows_NT-586_asm.asm from the OpenMP runtime Reviewed By: MaskRay Differential Revision: https://reviews.llvm.org/D104194	2021-06-13 18:34:15 -04:00
Luo, Yuanke	426a9ac6ee	[X86] Check immediate before get it. For CMP imm instruction, when the operand 1 is symbol address we should check if it is immediate first. Here is the example code. `CMP64mi32 $noreg, 8, killed renamable $rcx, @d, $noreg, @a, implicit-def $eflags` Many thanks to Craig, Topper for the test case to reproduce this issue. Differential Revision: https://reviews.llvm.org/D104037	2021-06-13 15:40:52 +08:00
Luo, Yuanke	c9854fe645	Revert "[X86] Check immediate before get it." This reverts commit 9eb2f723c24523194b833779d20b027bf89a4f55.	2021-06-13 13:55:38 +08:00
Luo, Yuanke	4f7d0be5fe	[X86] Check immediate before get it. For CMP imm instruction, when the operand 1 is symbol address we should check if it is immediate first. Here is the example code. `CMP64mi32 $noreg, 8, killed renamable $rcx, @d, $noreg, @a, implicit-def $eflags` Many thanks to Craig, Topper for the test case to reproduce this issue. Differential Revision: https://reviews.llvm.org/D104037	2021-06-13 09:08:40 +08:00
Craig Topper	e263936592	[X86] Add ISD::FREEZE and ISD::AssertAlign to the list of opcodes that don't guarantee upper 32 bits are zero. The freeze issue was reported here https://llvm.discourse.group/t/bug-or-feature-freeze-instruction/3639 I don't have a test for AssertAlign. I just noticed it was missing and assume it should be similar to the other two Asserts. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D104178	2021-06-12 09:52:29 -07:00
Florian Hahn	f2662d35c8	Revert "[X86FixupLEAs] Transform the sequence LEA/SUB to SUB/SUB" This reverts commit 1b748faf2bae246e2fc77d88420df13c2e60f4df because it breaks building the llvm-test-suite with -verify-machineinstrs on X86: http://green.lab.llvm.org/green/job/test-suite-verify-machineinstrs-x86_64-O3/9585/ Running llc -verify-machineinstr on X86 crashes on the IR below: target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" %struct.widget = type { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, [16 x [16 x i16]], [6 x [32 x i32]], [16 x [16 x i32]], [4 x [12 x [4 x [4 x i32]]]], [16 x i32], i8, i32, i32*, i32, i32, i32, i32, i32, %struct.baz, %struct.wobble.1, i32, i32, i32, i32, i32, i32, %struct.quux.2, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, [3 x i32], i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32**, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, [3 x [2 x i32]], [3 x [2 x i32]], i32, i32, i64, i64, %struct.zot.3, %struct.zot.3, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.baz = type { i32, i32, i32, i32, i32, i32, i32, i32, i32, %struct.snork, %struct.wombat.0, %struct.wobble, i32, i32, i32, i32, i32, i32, i32, i32, i32 (%struct.widget, %struct.eggs), i32, i32, i32, i32 } %struct.snork = type { %struct.spam, %struct.zot, i32 (%struct.wombat, %struct.widget, %struct.snork) } %struct.spam = type { i32, i32, i32, i32, i8, i32 } %struct.zot = type { i32, i32, i32, i32, i32, i8, i32* } %struct.wombat = type { i32, i32, i32, i32, i32, i32, i32, i32, void (i32, i32, i32, i32), void (%struct.wombat, %struct.widget, %struct.zot)* } %struct.wombat.0 = type { [4 x [11 x %struct.quux]], [2 x [9 x %struct.quux]], [2 x [10 x %struct.quux]], [2 x [6 x %struct.quux]], [4 x %struct.quux], [4 x %struct.quux], [3 x %struct.quux] } %struct.quux = type { i16, i8 } %struct.wobble = type { [2 x %struct.quux], [4 x %struct.quux], [3 x [4 x %struct.quux]], [10 x [4 x %struct.quux]], [10 x [15 x %struct.quux]], [10 x [15 x %struct.quux]], [10 x [5 x %struct.quux]], [10 x [5 x %struct.quux]], [10 x [15 x %struct.quux]], [10 x [15 x %struct.quux]] } %struct.eggs = type { [1000 x i8], [1000 x i8], [1000 x i8], i32, i32, i32, i32, i32, i32, i32, i32 } %struct.wobble.1 = type { i32, [2 x i32], i32, i32, %struct.wobble.1, %struct.wobble.1, i32, [2 x [4 x [4 x [2 x i32]]]], i32, i64, i64, i32, i32, [4 x i8], [4 x i8], i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.quux.2 = type { i32, i32, i32, i32, i32, %struct.quux.2* } %struct.zot.3 = type { i64, i16, i16, i16 } define void @blam(%struct.widget* %arg, i32 %arg1) local_unnamed_addr { bb: %tmp = load i32, i32* undef, align 4 %tmp2 = sdiv i32 %tmp, 6 %tmp3 = sdiv i32 undef, 6 %tmp4 = load i32, i32* undef, align 4 %tmp5 = icmp eq i32 %tmp4, 4 %tmp6 = select i1 %tmp5, i32 %tmp3, i32 %tmp2 %tmp7 = getelementptr inbounds [4 x [4 x i32]], [4 x [4 x i32]]* undef, i64 0, i64 0, i64 0 %tmp8 = zext i16 undef to i32 %tmp9 = zext i16 undef to i32 %tmp10 = load i16, i16* undef, align 2 %tmp11 = zext i16 %tmp10 to i32 %tmp12 = zext i16 undef to i32 %tmp13 = zext i16 undef to i32 %tmp14 = zext i16 undef to i32 %tmp15 = load i16, i16* undef, align 2 %tmp16 = zext i16 %tmp15 to i32 %tmp17 = zext i16 undef to i32 %tmp18 = sub nsw i32 %tmp8, %tmp9 %tmp19 = shl nsw i32 undef, 1 %tmp20 = add nsw i32 %tmp19, %tmp18 %tmp21 = sub nsw i32 %tmp11, %tmp12 %tmp22 = shl nsw i32 undef, 1 %tmp23 = add nsw i32 %tmp22, %tmp21 %tmp24 = sub nsw i32 %tmp13, %tmp14 %tmp25 = shl nsw i32 undef, 1 %tmp26 = add nsw i32 %tmp25, %tmp24 %tmp27 = sub nsw i32 %tmp16, %tmp17 %tmp28 = shl nsw i32 undef, 1 %tmp29 = add nsw i32 %tmp28, %tmp27 %tmp30 = sub nsw i32 %tmp20, %tmp29 %tmp31 = sub nsw i32 %tmp23, %tmp26 %tmp32 = shl nsw i32 %tmp30, 1 %tmp33 = add nsw i32 %tmp32, %tmp31 store i32 %tmp33, i32* undef, align 4 %tmp34 = mul nsw i32 %tmp31, -2 %tmp35 = add nsw i32 %tmp34, %tmp30 store i32 %tmp35, i32* undef, align 4 %tmp36 = select i1 %tmp5, i32 undef, i32 undef br label %bb37 bb37: ; preds = %bb %tmp38 = load i32, i32* undef, align 4 %tmp39 = ashr i32 %tmp38, %tmp6 %tmp40 = load i32, i32* undef, align 4 %tmp41 = sdiv i32 %tmp39, %tmp40 store i32 %tmp41, i32* undef, align 4 ret void }	2021-06-12 11:41:38 +01:00
Florian Hahn	daff62b038	Revert "[X86FixupLEAs] Sub register usage of LEA dest should block LEA/SUB optimization" This reverts commit f35bcea1d4748889b8240defdf00cb7a71cbe070 because it depends on 1b748faf2bae246e2fc77d88420df13c2e60f4df, which breaks building the llvm-test-suite with -verify-machineinstrs on X86. See 154adc0f135cff3f8a8861c335d2b88c8049d098 for more details.	2021-06-12 11:40:47 +01:00
Guozhi Wei	0b1bb7b700	[X86FixupLEAs] Sub register usage of LEA dest should block LEA/SUB optimization In function searchALUInst, sub register usage of LEA dest should also block LEA/SUB optimization, otherwise the sub register usage gets an undefined value. This patch fixes https://bugs.llvm.org/show_bug.cgi?id=50615. Differential Revision: https://reviews.llvm.org/D103922	2021-06-11 09:45:56 -07:00
Simon Pilgrim	165132af1b	[ADT] Remove APInt/APSInt toString() std::string variants <string> is currently the highest impact header in a clang+llvm build: https://commondatastorage.googleapis.com/chromium-browser-clang/llvm-include-analysis.html One of the most common places this is being included is the APInt.h header, which needs it for an old toString() implementation that returns std::string - an inefficient method compared to the SmallString versions that it actually wraps. This patch replaces these APInt/APSInt methods with a pair of llvm::toString() helpers inside StringExtras.h, adjusts users accordingly and removes the <string> from APInt.h - I was hoping that more of these users could be converted to use the SmallString methods, but it appears that most end up creating a std::string anyhow. I avoided trying to use the raw_ostream << operators as well as I didn't want to lose having the integer radix explicit in the code. Differential Revision: https://reviews.llvm.org/D103888	2021-06-11 13:19:15 +01:00
Bing1 Yu	c7250ce1db	[X86] Support __tile_stream_loadd intrinsic for new AMX interface Adding support for __tile_stream_loadd intrinsic. Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D103784	2021-06-11 17:28:43 +08:00
Luo, Yuanke	83aebff64b	[X86][NFC] Fix typo.	2021-06-10 22:49:11 +08:00
Craig Topper	c39f9ef639	[X86] Check destination element type before forming VTRUNCS/VTRUNCUS in combineTruncateWithSat. Fixes crash reported here https://reviews.llvm.org/D73607 Using a store to keep the trunc intact. Returning v16i24 would cause the trunc to be optimized away in SelectionDAGBuilder. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D103940	2021-06-09 07:08:17 -07:00
Simon Pilgrim	91a6df2ba4	[X86][SLM] Adjust XMM non-PMULLD throughput costs to half rate. Match what's reported in the costs table, Agner's tables and the Intel AOM	2021-06-09 13:51:40 +01:00
Nick Desaulniers	42052632ff	reland [IR] make -stack-alignment= into a module attr Relands commit 433c8d950cb3a1fa0977355ce0367e8c763a3f13 with fixes for MIPS. Similar to D102742, specifying the stack alignment via CodegenOpts means that this flag gets dropped during LTO, unless the command line is re-specified as a plugin opt. Instead, encode this information as a module level attribute so that we don't have to expose this llvm internal flag when linking the Linux kernel with LTO. Looks like external dependencies might need a fix: * https://github.com/llvm-hs/llvm-hs/issues/345 * https://github.com/halide/Halide/issues/6079 Link: https://github.com/ClangBuiltLinux/linux/issues/1377 Reviewed By: tejohnson Differential Revision: https://reviews.llvm.org/D103048	2021-06-08 10:59:46 -07:00
Nick Desaulniers	579f298a64	Revert "[IR] make -stack-alignment= into a module attr" This reverts commit 433c8d950cb3a1fa0977355ce0367e8c763a3f13. Breaks the MIPS build.	2021-06-08 08:55:50 -07:00
Nick Desaulniers	5c936095e3	[IR] make -stack-alignment= into a module attr Similar to D102742, specifying the stack alignment via CodegenOpts means that this flag gets dropped during LTO, unless the command line is re-specified as a plugin opt. Instead, encode this information as a module level attribute so that we don't have to expose this llvm internal flag when linking the Linux kernel with LTO. Looks like external dependencies might need a fix: * https://github.com/llvm-hs/llvm-hs/issues/345 * https://github.com/halide/Halide/issues/6079 Link: https://github.com/ClangBuiltLinux/linux/issues/1377 Reviewed By: tejohnson Differential Revision: https://reviews.llvm.org/D103048	2021-06-08 08:31:04 -07:00
Simon Pilgrim	b21f333bb4	[CostModel][X86] Improve AVX1/AVX2 truncation costs Based off the worse case numbers generated by D103695, we were overestimating the cost of a number of vector truncations: AVX2: v2i32->v2i8, v2i64->v2i16 + v4i64->v4i32 AVX1: v2i32->v2i8, v4i64->v4i16 + v16i16->v16i8 Once we have a working set of conversion costs, the intention is to cleanup the tables and use legalized types a lot more to reduce the number of entries we currently have.	2021-06-08 10:41:03 +01:00
Harald van Dijk	9e8b1c8ad9	[X32] Add Triple::isX32(), use it. So far, support for x86_64-linux-gnux32 has been handled by explicit comparisons of Triple.getEnvironment() to GNUX32. This worked as long as x86_64-linux-gnux32 was the only X32 environment to worry about, but we now have x86_64-linux-muslx32 as well. To support this, this change adds an isX32() function and uses it. It replaces all checks for GNUX32 or MuslX32 by isX32(), except for the following: - Triple::isGNUEnvironment() and Triple::isMusl() are supposed to treat GNUX32 and MuslX32 differently. - computeTargetTriple() needs to be able to transform triples to add or remove X32 from the environment and needs to map GNU to GNUX32, and Musl to MuslX32. - getMultiarchTriple() completely lacks any Musl support and retains the explicit check for GNUX32 as it can only return x86_64-linux-gnux32. Reviewed By: MaskRay Differential Revision: https://reviews.llvm.org/D103777	2021-06-07 20:48:39 +01:00
Simon Pilgrim	779c968967	[CostModel][X86] Add 512-bit bswap costs	2021-06-06 22:36:34 +01:00
Simon Pilgrim	3d02a9f278	[CostModel][X86] Improve AVX512 FDIV costs Add missing v16f32/v8f64 costs and adjust other costs as well based off the SkylakeServer model	2021-06-06 21:41:05 +01:00
Simon Pilgrim	0e6ebc4153	[X86][SSE] LowerFP_TO_INT - remove dead code. NFCI. Non-Strict v2f32->v2i64 cases have already early-returned to be handled by legalization.	2021-06-06 20:04:15 +01:00
Simon Pilgrim	2cad20b386	[X86][SSE] combineVectorTruncation - simplify PSHUFB-is-better logic. NFCI. OutSVT is guaranteed to be i8/i16 and we accept any InSVT that isn't i64	2021-06-06 20:04:14 +01:00
Simon Pilgrim	ee630c4d10	X86MachObjectWriter.cpp - silence null deference warnings. NFCI. The MCSymbol data should always be present for non-absolute sections so assert that it is to silence static analysis warnings.	2021-06-06 15:33:47 +01:00
Nikita Popov	6fa17c5cd4	[TargetLowering] Use IRBuilderBase instead of IRBuilder<> (NFC) Don't require a specific kind of IRBuilder for TargetLowering hooks. This allows us to drop the IRBuilder.h include from TargetLowering.h. Differential Revision: https://reviews.llvm.org/D103759	2021-06-06 16:29:50 +02:00
Simon Pilgrim	1271ac390e	X86Operand.h - fix uninitialized variable warnings in constructor. NFCI.	2021-06-06 15:25:03 +01:00
Nikita Popov	a84c0ba160	[CodeGen] Add missing includes (NFC) These currently rely on the IRBuilder.h include in TargetLowering.h. Make them explicit.	2021-06-06 15:48:27 +02:00
Fangrui Song	0d92547dcb	Fix some -Wunused-but-set-variable in -DLLVM_ENABLE_ASSERTIONS=off build	2021-06-04 23:34:43 -07:00
Roman Lebedev	55213cd08b	[X86] AMD Zen 3: double the LoopMicroOpBufferSize While the IndVars issue (PR50384) has been resolved, and the compile performance improved, a new blocker emerged, the codegen machine instruction scheduling is also quadratic. So we still can't really specify the right value here. Filed PR50584.	2021-06-05 01:23:58 +03:00
Rong Xu	559805b594	[SampleFDO] New hierarchical discriminator for FS SampleFDO (llvm-profdata part) This patch was split from https://reviews.llvm.org/D102246 [SampleFDO] New hierarchical discriminator for Flow Sensitive SampleFDO This is for llvm-profdata part of change. It sets the bit masks for the profile reader in llvm-profdata. Also add an internal option "-fs-discriminator-pass" for show and merge command to process the profile offline. This patch also moved setDiscriminatorMaskedBitFrom() to SampleProfileReader::create() to simplify the interface. Differential Revision: https://reviews.llvm.org/D103550	2021-06-04 11:22:06 -07:00
Fangrui Song	dfa955b8df	[MC] Delete unneeded MCAsmParser &Parser	2021-06-02 16:10:18 -07:00
Fangrui Song	5ddfda0ee0	[MC] Change "unexpected tokens" to "expected newline" and remove unneeded "in .xxx directive"	2021-06-02 16:08:05 -07:00
Simon Pilgrim	7b5702378b	[X86][SSE] combineScalarToVector - only reuse broadcasts for scalar_to_vector if the source operands scalar types match We were hitting an issue when the scalar_to_vector source was being implicitly truncated (in this case to i8 to vXi1) but we were also using the i8 source in a broadcast to a vXi8 value. Fixes PR50374	2021-06-02 22:05:40 +01:00
Min-Yih Hsu	fc07b73b83	[CodeGen][NFC] Remove unused virtual function `TargetFrameLowering::emitCalleeSavedFrameMoves` with 4 arguments is not used anywhere in CodeGen. Thus it shouldn't be exposed as a virtual function. NFC. Differential Revision: https://reviews.llvm.org/D103328	2021-06-02 13:11:12 -07:00
Rong Xu	f505b894a2	[SampleFDO] New hierarchical discriminator for FS SampleFDO (ProfileData part) This patch was split from https://reviews.llvm.org/D102246 [SampleFDO] New hierarchical discriminator for Flow Sensitive SampleFDO This is mainly for ProfileData part of change. It will load FS Profile when such profile is detected. For an extbinary format profile, create_llvm_prof tool will add a flag to profile summary section. For other format profiles, the users need to use an internal option (-profile-isfs) to tell the compiler that the profile uses FS discriminators. This patch also simplified the bit API used by FS discriminators. Differential Revision: https://reviews.llvm.org/D103041	2021-06-02 10:32:52 -07:00
Michael Benfield	f8d5955717	[various] Remove or use variables which are unused but set. This is in preparation for the -Wunused-but-set-variable warning. Differential Revision: https://reviews.llvm.org/D102942	2021-06-01 15:38:48 -07:00
Daniel Sanders	71a22fb7f8	[globalisel][legalizer] Separate the deprecated LegalizerInfo from the current one It's still in use in a few places so we can't delete it yet but there's not many at this point. Differential Revision: https://reviews.llvm.org/D103352	2021-06-01 13:23:48 -07:00
Guozhi Wei	b2dfe60e88	[X86FixupLEAs] Transform the sequence LEA/SUB to SUB/SUB This patch transforms the sequence lea (reg1, reg2), reg3 sub reg3, reg4 to two sub instructions sub reg1, reg4 sub reg2, reg4 Similar optimization can also be applied to LEA/ADD sequence. The modifications to TwoAddressInstructionPass is to ensure the operands of ADD instruction has expected order (the dest register of LEA should be src register of ADD). Differential Revision: https://reviews.llvm.org/D101970	2021-06-01 10:31:30 -07:00
Roman Lebedev	0573b00888	[X86] AMD Zen 3 has fast variable per-lane shuffles ... but lane-crossing shuffles are slow.	2021-06-01 10:46:05 +03:00
Roman Lebedev	19a9e819da	[X86] Split FeatureFastVariableShuffle tuning into Lane-Crossing and Per-Lane variants Currently, X86 backend only has a global one-size-fits-all `FeatureFastVariableShuffle` feature, which controls profitability of both the cross-lane and per-lane variable shuffles. I guess, this has been fine so far. But at least on AMD Zen 3, while per-line variable shuffles (e.g. `VPSHUFB`) are as fast as as shuffles with fixed/immediate mask, while lane-crossing shuffles, e.g. `VPERMPS` is performing worse. So to get the benefits of variable-mask shuffles, but not the drawbacks of lane-crossing shuffles, as suggested by @RKSimon, split the feature flag into two. Differential Revision: https://reviews.llvm.org/D103274	2021-06-01 10:39:36 +03:00
Tim Northover	859ff3505c	SwiftTailCC: teach verifier musttail rules applicable to this CC. SwiftTailCC has a different set of requirements than the C calling convention for a tail call. The exact argument sequence doesn't have to match, but fewer ABI-affecting attributes are allowed. Also make sure the musttail diagnostic triggers if a musttail call isn't actually a tail call.	2021-05-28 11:12:00 +01:00
Simon Pilgrim	0c78aae183	[CostModel][X86] Improve accuracy of sext/zext to 256-bit vector costs on AVX1 targets Determined from llvm-mca analysis (btver2 vs bdver2 vs sandybridge), the split+extends+concat sequence on AVX1 capable targets are cheaper than the #ops that the cost was previously based on.	2021-05-27 18:17:50 +01:00
Craig Topper	3180a39f2d	[X86] Fold (shift undef, X)->0 for vector shifts by immediate. We could previously do this by accident through the later call to getTargetConstantBitsFromNode I think, but that only worked if N0 had a single use. This patch makes it explicit for undef and doesn't have a use count check. I think this is needed to move the (shl X, 1)->(add X, X) fold to isel for PR50468. We need to be sure X won't be IMPLICIT_DEF which might prevent the same vreg from being used for both operands. Differential Revision: https://reviews.llvm.org/D103192	2021-05-27 09:31:47 -07:00
Simon Pilgrim	154058d3b3	[CostModel][X86] AVX512 truncation ops are slower than cost models indicate. The SkylakeServer model (and later IceLake/TigerLake targets according to Agner) have the PMOV truncations as uops=2, rthroughput=2 instructions. Noticed while trying to reduce the diffs between cost tables and llvm-mca analysis.	2021-05-27 16:07:42 +01:00
Luo, Yuanke	47e83818b4	[X86][AMX] Fix a bug on tile config. The previous code detect if a MBB is bottom block to determine if it is a backedge of a loop. We should check latch block instead of bottom block and we should check the header and the bottom block are in the same loop. Differential Revision: https://reviews.llvm.org/D103145	2021-05-26 21:57:49 +08:00
Simon Pilgrim	6128b86b1d	[X86][SLM] Fix vector PSHUFB + variable shift resource/throughputs Match whats documented in the Intel AOM (+Agner) - PSHUFB xmm is really slow, and mmx/xmm vector shifts are half rate. Noticed while working to get the cost tables to more closely match llvm-mca analysis, in this case for shifts and truncations.	2021-05-26 11:14:21 +01:00
Simon Pilgrim	f18fae2383	[X86][Atom] Fix vector variable shift resource/throughputs Match whats documented in the Intel AOM - the non-immediate variants of the PSLL/PSRA/PSRL* shift instructions requires BOTH ports - this was being incorrectly modelled as EITHER port. Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.	2021-05-26 10:30:59 +01:00
Roman Lebedev	395eb3f2e5	[NFC][X86] clang-format X86TTIImpl::getInterleavedMemoryOpCostAVX2() I plan to make changes to it, and undoing formatting each time is not going to be fun.	2021-05-26 12:27:47 +03:00

1 2 3 4 5 ...

21745 Commits