llvm-mirror

mirror of https://github.com/RPCS3/llvm-mirror.git synced 2025-02-01 13:11:39 +01:00

Author	SHA1	Message	Date
Nicholas Guy	f4899fdead	[ARM] Rearrange SizeReduction when using -Oz Move the Thumb2SizeReduce pass to before IfConversion when optimising for minimal code size. Running the Thumb2SizeReduction pass before IfConversionallows T1 instructions to propagate to the final output, rather than the ifConverter modifying T2 instructions and preventing them from being reduced later. This change does introduce a regression regarding execution time, so it's only applied when optimising for size. Running the LLVM Test Suite with this change produces a geomean difference of -0.1% for the size..text metric. Differential Revision: https://reviews.llvm.org/D82439	2020-07-02 09:19:38 +01:00
Sam Parker	dd087b6361	[NFC][ARM] Add test.	2020-07-01 09:28:56 +01:00
Sam Parker	e85afea7aa	[ARM][LowOverheadLoops] Handle reductions While validating live-out values, record instructions that look like a reduction. This will comprise of a vector op (for now only vadd), a vorr (vmov) which store the previous value of vadd and then a vpsel in the exit block which is predicated upon a vctp. This vctp will combine the last two iterations using the vmov and vadd into a vector which can then be consumed by a vaddv. Once we have determined that it's safe to perform tail-predication, we need to change this sequence of instructions so that the predication doesn't produce incorrect code. This involves changing the register allocation of the vadd so it updates itself and the predication on the final iteration will not update the falsely predicated lanes. This mimics what the vmov, vctp and vpsel do and so we then don't need any of those instructions. Differential Revision: https://reviews.llvm.org/D75533	2020-07-01 08:31:49 +01:00
Samuel Tebbs	dcd6d8787a	[ARM] Allow the fabs intrinsic to be tail predicated This patch stops the fabs intrinsic from blocking tail predication. Differential Revision: https://reviews.llvm.org/D82570	2020-06-30 17:27:28 +01:00
Samuel Tebbs	d908315744	[ARM] Allow the usub_sat and ssub_sat intrinsics to be tail predicated This patch stops the usub_sat and ssub_sat intrinsics from blocking tail predication. Differential Revision: https://reviews.llvm.org/D82571	2020-06-30 17:16:58 +01:00
Sjoerd Meijer	4fb902bfc8	[ARM][MVE] Tail-predication: clean-up of unused code After the rewrite of this pass (D79175) I missed one thing: the inserted VCTP intrinsic can be cloned to exit blocks if there are instructions present in it that perform the same operation, but this wasn't triggering anymore. However, it turns out that for handling reductions, see D75533, it's actually easier not not to have the VCTP in exit blocks, so this removes that code. This was possible because it turned out that some other code that depended on this, rematerialization of the trip count enabling more dead code removal later, wasn't doing much anymore due to more aggressive dead code removal that was added to the low-overhead loops pass. Differential Revision: https://reviews.llvm.org/D82773	2020-06-30 17:09:36 +01:00
Samuel Tebbs	4fc642e8a5	[ARM] Allow rounding intrinsics to be tail predicated This patch stops the trunc, rint, round, floor and ceil intrinsics from blocking tail predication. Differential Revision: https://reviews.llvm.org/D82553	2020-06-30 16:52:25 +01:00
Sam Parker	5b21582cc5	[NFC][ARM] Tail predication reduction tests	2020-06-30 13:27:22 +01:00
David Green	6b973734b0	[ARM] Better reductions MVE has native reductions for integer add and min/max. The others need to be expanded to a series of extract's and scalar operators to reduce the vector into a single scalar. The default codegen for that expands the reduction into a series of in-order operations. This modifies that to something more suitable for MVE. The basic idea is to use vector operations until there are 4 remaining items then switch to pairwise operations. For example a v8f16 fadd reduction would become: Y = VREV X Z = ADD(X, Y) z0 = Z[0] + Z[1] z1 = Z[2] + Z[3] return z0 + z1 The awkwardness (there is always some) comes in from something like a v4f16, which is first legalized by adding identity values to the extra lanes of the reduction, and which can then not be optimized away through the vrev; fadd combo, the inserts remain. I've made sure they custom lower so that we can produce the pairwise additions before the extra values are added. Differential Revision: https://reviews.llvm.org/D81397	2020-06-29 16:04:13 +01:00
David Green	33b3ee4e0a	[ARM] VCVTT fpround instruction selection Similar to the recent patch for fpext, this adds vcvtb and vcvtt with insert into vector instruction selection patterns for fptruncs. This helps clear up a lot of register shuffling that we would otherwise do. Differential Revision: https://reviews.llvm.org/D81637	2020-06-26 10:24:06 +01:00
David Green	6ea3fc012b	[ARM] VCVTT instruction selection We current extract and convert from a top lane of a f16 vector using a VMOVX;VCVTB pair. We can simplify that to use a single VCVTT. The pattern is mostly copied from a vector extract pattern, but produces a VCVTTHS f32 directly. This had to move some code around so that ARMInstrVFP had access to the required pattern frags that were previously part of ARMInstrNEON. Differential Revision: https://reviews.llvm.org/D81556	2020-06-26 08:58:55 +01:00
Sjoerd Meijer	ef2fc8e627	[SelectionDAG] Lower @llvm.get.active.lane.mask to setcc This lowers intrinsic @llvm.get.active.lane.mask to a setcc node, i.e. an icmp ule, and creates vectors for its 2 arguments on which the comparison is performed. Differential Revision: https://reviews.llvm.org/D82292	2020-06-26 07:46:38 +01:00
Sjoerd Meijer	9e7cf4b604	[ARM] Don't revert get.active.lane.mask in ARM Tail-Predication pass Don't revert intrinsic get.active.lane.mask here, this is moved to isel legalization in D82292. Differential Revision: https://reviews.llvm.org/D82105	2020-06-26 07:42:39 +01:00
David Green	bc2c8f2f3b	[ARM] Split FPExt loads This extends PerformSplittingToWideningLoad to also handle FP_Ext, as well as sign and zero extends. It uses an integer extending load followed by a VCVTL on the bottom lanes to efficiently perform an fpext on a smaller than legal type. The existing code had to be rewritten a little to not just split the node in two and let legalization handle it from there, but to actually split into legal chunks. Differential Revision: https://reviews.llvm.org/D81340	2020-06-25 21:55:13 +01:00
David Green	f40cf5ffb9	[ARM] MVE VCVT lowering for f16->f32 extends This adds code to lower f16 to f32 fp_exts's using an MVE VCVT instructions, similar to a recent similar patch for fp_trunc. Again it goes through the lowering of a BUILD_VECTOR, but is slightly simpler only having to deal with interleaved indices. It adds a VCVTL node to lower to, similar to VCVTN. Differential Revision: https://reviews.llvm.org/D81339	2020-06-25 20:54:26 +01:00
David Green	5a95ea8473	[ARM] Add FP_ROUND handling to splitting MVE stores This splits MVE vector stores of a fp_trunc in the same way that we do for standard trunc's. It extends PerformSplittingToNarrowingStores to handle fp_round, splitting the store into pieces and adding a VCVTNb to perform the actual fp_round. The actual store is then converted to an integer store so that it can truncate bottom lanes of the result. Differential Revision: https://reviews.llvm.org/D81141	2020-06-25 19:37:15 +01:00
David Green	6e22f03ef5	[ARM] MVE VCVT lowering for f32->f16 truncs This adds code to lower f32 to f16 fp_trunc's using a pair of MVE VCVT instructions. Due to v4f16 not being legal, fp_round are often split up fairly early. So this reconstructs the vcvt's from a buildvector of fp_rounds from two vector inputs. Something like: BUILDVECTOR(FP_ROUND(EXTRACT_ELT(X, 0), FP_ROUND(EXTRACT_ELT(Y, 0), FP_ROUND(EXTRACT_ELT(X, 1), FP_ROUND(EXTRACT_ELT(Y, 1), ...) It adds a VCVTN node to handle this, which like VMOVN or VQMOVN lowers into the top/bottom lanes of an MVE instruction. Differential Revision: https://reviews.llvm.org/D81139	2020-06-25 15:59:36 +01:00
Sam Tebbs	2aff9bba46	[ARM] Allow tail predication on sadd_sat and uadd_sat intrinsics This patch stops the sadd_sat and uadd_sat intrinsics from blocking tail predication. Differential revision: https://reviews.llvm.org/D82377	2020-06-25 11:54:29 +01:00
David Green	680cf8ff46	[ARM] Mark more integer instructions as not having side effects. LDRD and STRD along with UBFX and SBFX are selected from DAGToDAG transforms, so do not have tblgen patterns. They don't get marked as having side effects so cannot be scheduled as efficiently as you would like. This specifically marks then as not having side effects. Differential Revision: https://reviews.llvm.org/D82358	2020-06-23 22:45:51 +01:00
Tres Popp	e6cca551e7	Revert "[CGP] Enable CodeGenPrepares phi type convertion." This reverts commit 67121d7b82ed78a47ea32f0c87b7317e2b469ab2. This is causing compile times to be 2x slower on some large binaries.	2020-06-22 13:06:18 +02:00
David Green	2825171bc9	[CGP] Enable CodeGenPrepares phi type convertion.	2020-06-21 16:46:16 +01:00
Sjoerd Meijer	c43371a1bf	[ARM][MVE] tail-predication: renamed internal option. Renamed -force-tail-predication to -force-mve-tail-predication because that's more descriptive and consistent.	2020-06-19 15:07:06 +01:00
Lucas Prates	a515304bf7	[ARM] Supporting lowering of half-precision FP arguments and returns in AArch32's backend Summary: Half-precision floating point arguments and returns are currently promoted to either float or int32 in clang's CodeGen and there's no existing support for the lowering of `half` arguments and returns from IR in AArch32's backend. Such frontend coercions, implemented as coercion through memory in clang, can cause a series of issues in argument lowering, as causing arguments to be stored on the wrong bits on big-endian architectures and incurring in missing overflow detections in the return of certain functions. This patch introduces the handling of half-precision arguments and returns in the backend using the actual "half" type on the IR. Using the "half" type the backend is able to properly enforce the AAPCS' directions for those arguments, making sure they are stored on the proper bits of the registers and performing the necessary floating point convertions. Reviewers: rjmccall, olista01, asl, efriedma, ostannard, SjoerdMeijer Reviewed By: ostannard Subscribers: stuij, hiraditya, dmgreen, llvm-commits, chill, dnsampaio, danielkiss, kristof.beyls, cfe-commits Tags: #clang, #llvm Differential Revision: https://reviews.llvm.org/D75169	2020-06-18 13:15:13 +01:00
Sjoerd Meijer	0d40769e87	[ARM] Reimplement MVE Tail-Predication pass using @llvm.get.active.lane.mask To set up a tail-predicated loop, we need to to calculate the number of elements processed by the loop. We can now use intrinsic @llvm.get.active.lane.mask() to do this, which is emitted by the vectoriser in D79100. This intrinsic generates a predicate for the masked loads/stores, and consumes the Backedge Taken Count (BTC) as its second argument. We can now use that to reconstruct the loop tripcount, instead of the IR pattern match approach we were using before. Many thanks to Eli Friedman and Sam Parker for all their help with this work. This also adds overflow checks for the different, new expressions that we create: the loop tripcount, and the sub expression that calculates the remaining elements to be processed. For the latter, SCEV is not able to calculate precise enough bounds, so we work around that at the moment, but is not entirely correct yet, it's conservative. The overflow checks can be overruled with a force flag, which is thus potentially unsafe (but not really because the vectoriser is the only place where this intrinsic is emitted at the moment). It's also good to mention that the tail-predication pass is not yet enabled by default. We will follow up to see if we can implement these overflow checks better, either by a change in SCEV or we may want revise the definition of llvm.get.active.lane.mask. Differential Revision: https://reviews.llvm.org/D79175	2020-06-17 15:17:42 +01:00
David Green	9c4e61f00c	[LSR] Filter for postinc formulae In more complicated loops we can easily hit the complexity limits of loop strength reduction. If we do and filtering occurs, it's all too easy to remove the wrong formulae for post-inc preferring accesses due to it attempting to maximise register re-use. The patch adds an alternative filtering step when the target is preferring postinc to pick postinc formulae instead, hopefully lowering the complexity to below the limit so that aggressive filtering is not needed. There is also a change in here to stop considering existing addrecs as free under postinc. We should already be modelling them as a reg so don't want it to cause us to get the cost wrong. (I'm not sure that code makes sense in general, but there are X86 tests specifically for it where it seems to be helping so have left it around for the standard non-post-inc case). Differential Revision: https://reviews.llvm.org/D80273	2020-06-17 12:32:04 +01:00
David Green	3175ed5612	[ARM] Fix crash trying to generate i1 immediates These code patterns attempt to call isVMOVModifiedImm on a splat of i1 values, leading to an unreachable being hit. I've guarded the call on a more specific set of sizes, as i1 vectors are legal under MVE. Differential Revision: https://reviews.llvm.org/D81860	2020-06-16 12:27:24 +01:00
David Green	8a5c6f3865	[ARM] Add some MVE vecreduce tests. NFC	2020-06-09 12:07:19 +01:00
David Green	499648daa2	[ARM] VQMOVN demand bits analysis Similar to VMOVN, a VQMOVN will only demand the top/bottom lanes of it's first input. However unlike VMOVN it will need access to the entire second argument, as that value is saturated not just moved in place. Differential Revision: https://reviews.llvm.org/D80515	2020-06-05 18:41:02 +01:00
David Green	5cd7010367	[ARM] FP16 conversion tests. NFC	2020-06-04 13:13:56 +01:00
David Green	9a7508e506	[ARM] Extra MVE VMLAV reduction patterns These patterns for i8 and i16 VMLA's were missing. They end up from legalized vector.reduce.add.v8i16 and vector.reduce.add.v16i8, and although the instruction works differently (the mul and add are performed in a higher precision), I believe it is OK because only an i8/i16 are demanded from them, and so the results will be the same. At least, they pass any testing I can think to run on them. There are some tests that end up looking worse, but are quite artificial due to passing half vector types through a call boundary. I would not expect the vmull to realistically come up like that, and a vmlava is likely better a lot of the time. Differential Revision: https://reviews.llvm.org/D80524	2020-05-29 16:23:24 +01:00
David Green	6307317d50	[ARM] More tests for MVE LSR and float issues. NFC	2020-05-28 22:04:12 +01:00
Victor Campos	8541a94229	[ARM] Fix rewrite of frame index in Thumb2's address mode i8s4 Summary: In Thumb2's frame index rewriting process, the address mode i8s4, which is used by LDRD and STRD instructions, is handled by taking the immediate offset operand and multiplying it by 4. This behaviour is wrong, however. In this specific address mode, the MachineInstr's immediate operand is already in the expected form. By consequence of that, multiplying it once more by 4 yields a flawed offset value, four times greater than it should be. Differential Revision: https://reviews.llvm.org/D80557	2020-05-27 13:09:13 +01:00
David Green	5d1d2066e7	[ARM] MVE VMINV/VMAXV test additions. NFC	2020-05-26 14:00:14 +01:00
David Green	92f08097a7	[ARM] VMULH tests for when other parts are working. NFC	2020-05-25 12:46:18 +01:00
Jean-Michel Gorius	2d66ce0e5e	Revert "[CodeGen] Add support for multiple memory operands in MachineInstr::mayAlias" This temporarily reverts commit 7019cea26dfef5882c96f278c32d0f9c49a5e516. It seems that, for some targets, there are instructions with a lot of memory operands (probably more than would be expected). This causes a lot of buildbots to timeout and notify failed builds. While investigations are ongoing to find out why this happens, revert the changes.	2020-05-22 21:26:46 +02:00
Jean-Michel Gorius	b6e158e140	[CodeGen] Add support for multiple memory operands in MachineInstr::mayAlias Summary: To support all targets, the mayAlias member function needs to support instructions with multiple operands. This revision also changes the order of the emitted instructions in some test cases. Reviewers: efriedma, hfinkel, craig.topper, dmgreen Reviewed By: efriedma Subscribers: MatzeB, dmgreen, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D80161	2020-05-21 23:02:54 +02:00
Sjoerd Meijer	a5cc4d095a	[HardwareLoops] llvm.loop.decrement.reg definition This is split off from D80316, slightly tightening the definition of overloaded hardwareloop intrinsic llvm.loop.decrement.reg specifying that both operands its result have the same type.	2020-05-21 10:48:16 +01:00
Pierre-vh	66d4cf9f6d	[Target][ARM] Make Low Overhead Loops coexist with VPT blocks. Previously, the LowOverheadLoops pass couldn't handle VPT blocks with conditions, or with multiple VCTPs. This patch improves the LowOverheadLoops pass so it can handle those cases. It also adds support for VCMPs before the VCTP. Differential Revision: https://reviews.llvm.org/D78206	2020-05-20 12:24:55 +01:00
Sam Parker	c326fd4f01	[NFC][ARM] Add more tail predication tests	2020-05-19 14:01:10 +01:00
David Green	e11168f2ca	[ARM] Patterns for VQSHRN Given a VQMOVN(VSHR), we can fold that into a VQSHRN simply enough using a few tablegen patterns. Differential Revision: https://reviews.llvm.org/D77720	2020-05-16 17:46:43 +01:00
David Green	c1a15ae1a8	[ARM] Combines for VMOVN This adds two combines for VMOVN, one to fold VMOVN[tb](c, VQMOVNb(a, b)) => VQMOVN[tb](c, b) The other to perform demand bits analysis on the lanes of a VMOVN. We know that only the bottom lanes of the second operand and the top or bottom lanes of the Qd operand are needed in the result, depending on if the VMOVN is bottom or top. Differential Revision: https://reviews.llvm.org/D77718	2020-05-16 15:13:16 +01:00
David Green	4120e7a927	[ARM] MVE saturating truncates This adds some custom lowering for VQMOVN, an instruction that can be used to perform saturating truncates from a pair of min(max(X, -0x8000), 0x7fff), providing those constants are correct. This leaves a VQMOVNBs which saturates the value and inserts that into the bottom lanes of an existing vector. We then need to do something with the other lanes, extending the value using a vmovlb. Ideally, as will often be the case, only the bottom lane of what remains will be demanded, allowing the vmovlb to be removed. Which should mean the instruction is either equal or a win most of the time, and allows some extra follow-up folding to happen. Differential Revision: https://reviews.llvm.org/D77590	2020-05-16 15:10:20 +01:00
David Green	b6aab3138f	[ARM] Extra VQMOVN/VQSHRN tests. NFC	2020-05-16 14:23:26 +01:00
David Green	4c9b5189ca	[ARM] Change more triples to arm-none-none-eabi. NFC	2020-05-15 22:53:07 +01:00
Anna Welker	8fb628942b	[ARM][MVE] Add support for incrementing scatters Adds support to build pre-incrementing scatters. If the increment (i.e., add instruction) that is merged into the scatter is the loop increment, an incrementing write-back scatter can be built, which then assumes the role of the loop increment. Differential Revision: https://reviews.llvm.org/D79859	2020-05-15 17:02:00 +01:00
David Green	439830ee0e	[ARM] Convert floating point splats to integer Under MVE a vdup will always take a gpr register, not a floating point value. During DAG combine we convert the types to a bitcast to an integer in an attempt to fold the bitcast into other instructions. This is OK, but only works inside the same basic block. To do the same trick across a basic block boundary we need to convert the type in codegenprepare, before the splat is sunk into the loop. This adds a convertSplatType function to codegenprepare to do that, putting bitcasts around the splat to force the type to an integer. There is then some adjustment to the code in shouldSinkOperands to handle the extra bitcasts. Differential Revision: https://reviews.llvm.org/D78728	2020-05-13 15:24:16 +01:00
David Green	0021a951bd	[ARM] Sink splats to fma intrinsics Similar to fmul/fadd, we can sink a splat into a loop containing a fma in order to use more register instruction variants. For that there are also adjustments to the sinking code to handle more than 2 arguments. Differential Revision: https://reviews.llvm.org/D78386	2020-05-13 14:58:30 +01:00
Pierre-vh	39a7b5b535	[LSR][ARM] Add new TTI hook to mark some LSR chains as profitable This patch adds a new TTI hook to allow targets to tell LSR that a chain including some instruction is already profitable and should not be optimized. This patch also adds an implementation of this TTI hook for ARM so LSR doesn't optimize chains that include the VCTP intrinsic. Differential Revision: https://reviews.llvm.org/D79418	2020-05-13 14:18:28 +01:00
Pierre-vh	90e5c93ad7	[Target][ARM] Replace re-uses of old VPR values with VPNOTs Differential Revision: https://reviews.llvm.org/D76847	2020-05-12 12:09:57 +01:00
Eli Friedman	f704804dd2	[SelectionDAG] Don't promote the alignment of allocas beyond the stack alignment. allocas in LLVM IR have a specified alignment. When that alignment is specified, the alloca has at least that alignment at runtime. If the specified type of the alloca has a higher preferred alignment, SelectionDAG currently ignores that specified alignment, and increases the alignment. It does this even if it would trigger stack realignment. I don't think this makes sense, so this patch changes that. I was looking into this for SVE in particular: for SVE, overaligning vscale'ed types is extra expensive because it requires realigning the stack multiple times, or using dynamic allocation. (This currently isn't implemented.) I updated the expected assembly for a couple tests; in particular, for arg-copy-elide.ll, the optimization in question does not increase the alignment the way SelectionDAG normally would. For the rest, I just increased the specified alignment on the allocas to match what SelectionDAG was inferring. Differential Revision: https://reviews.llvm.org/D79532	2020-05-11 17:39:00 -07:00

1 2 3 4 5 ...

1078 Commits