1
0
mirror of https://github.com/RPCS3/llvm-mirror.git synced 2024-11-23 19:23:23 +01:00

[llvm-mca][docs] Improve the CommandLine documentation.

This patch replaces all the remaining occurrences of string "MCA" with
":program:`llvm-mca`".  Somehow I missed those strings when I committed r338394.

This patch also improves section "Instruction Dispatch".

llvm-svn: 338881
This commit is contained in:
Andrea Di Biagio 2018-08-03 12:44:56 +00:00
parent e27c5b613c
commit 1aca2c2e82

View File

@ -454,8 +454,8 @@ The ``-all-stats`` command line option enables extra statistics and performance
counters for the dispatch logic, the reorder buffer, the retire control unit,
and the register file.
Below is an example of ``-all-stats`` output generated by MCA for the
dot-product example discussed in the previous sections.
Below is an example of ``-all-stats`` output generated by :program:`llvm-mca`
for the dot-product example discussed in the previous sections.
.. code-block:: none
@ -514,17 +514,16 @@ SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch
logic is unable to dispatch a group of two instructions because the scheduler's
queue is full.
Looking at the *Dispatch Logic* table, we see that the pipeline was only able
to dispatch two instructions 51.5% of the time. The dispatch group was limited
to one instruction 44.6% of the cycles, which corresponds to 272 cycles. The
Looking at the *Dispatch Logic* table, we see that the pipeline was only able to
dispatch two instructions 51.5% of the time. The dispatch group was limited to
one instruction 44.6% of the cycles, which corresponds to 272 cycles. The
dispatch statistics are displayed by either using the command option
``-all-stats`` or ``-dispatch-stats``.
The next table, *Schedulers*, presents a histogram displaying a count,
representing the number of instructions issued on some number of cycles. In
this case, of the 610 simulated cycles, single
instructions were issued 306 times (50.2%) and there were 7 cycles where
no instructions were issued.
this case, of the 610 simulated cycles, single instructions were issued 306
times (50.2%) and there were 7 cycles where no instructions were issued.
The *Scheduler's queue usage* table shows that the maximum number of buffer
entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01
@ -543,28 +542,28 @@ A full scheduler queue is either caused by data dependency chains or by a
sub-optimal usage of hardware resources. Sometimes, resource pressure can be
mitigated by rewriting the kernel using different instructions that consume
different scheduler resources. Schedulers with a small queue are less resilient
to bottlenecks caused by the presence of long data dependencies.
The scheduler statistics are displayed by
using the command option ``-all-stats`` or ``-scheduler-stats``.
to bottlenecks caused by the presence of long data dependencies. The scheduler
statistics are displayed by using the command option ``-all-stats`` or
``-scheduler-stats``.
The next table, *Retire Control Unit*, presents a histogram displaying a count,
representing the number of instructions retired on some number of cycles. In
this case, of the 610 simulated cycles, two instructions were retired during
the same cycle 399 times (65.4%) and there were 109 cycles where no
instructions were retired. The retire statistics are displayed by using the
command option ``-all-stats`` or ``-retire-stats``.
this case, of the 610 simulated cycles, two instructions were retired during the
same cycle 399 times (65.4%) and there were 109 cycles where no instructions
were retired. The retire statistics are displayed by using the command option
``-all-stats`` or ``-retire-stats``.
The last table presented is *Register File statistics*. Each physical register
file (PRF) used by the pipeline is presented in this table. In the case of AMD
Jaguar, there are two register files, one for floating-point registers
(JFpuPRF) and one for integer registers (JIntegerPRF). The table shows that of
the 900 instructions processed, there were 900 mappings created. Since this
dot-product example utilized only floating point registers, the JFPuPRF was
responsible for creating the 900 mappings. However, we see that the pipeline
only used a maximum of 35 of 72 available register slots at any given time. We
can conclude that the floating point PRF was the only register file used for
the example, and that it was never resource constrained. The register file
statistics are displayed by using the command option ``-all-stats`` or
Jaguar, there are two register files, one for floating-point registers (JFpuPRF)
and one for integer registers (JIntegerPRF). The table shows that of the 900
instructions processed, there were 900 mappings created. Since this dot-product
example utilized only floating point registers, the JFPuPRF was responsible for
creating the 900 mappings. However, we see that the pipeline only used a
maximum of 35 of 72 available register slots at any given time. We can conclude
that the floating point PRF was the only register file used for the example, and
that it was never resource constrained. The register file statistics are
displayed by using the command option ``-all-stats`` or
``-register-file-stats``.
In this example, we can conclude that the IPC is mostly limited by data
@ -572,8 +571,8 @@ dependencies, and not by resource pressure.
Instruction Flow
^^^^^^^^^^^^^^^^
This section describes the instruction flow through MCA's default out-of-order
pipeline, as well as the functional units involved in the process.
This section describes the instruction flow through the default pipeline of
:program:`llvm-mca`, as well as the functional units involved in the process.
The default pipeline implements the following sequence of stages used to
process instructions.
@ -585,9 +584,9 @@ process instructions.
The default pipeline only models the out-of-order portion of a processor.
Therefore, the instruction fetch and decode stages are not modeled. Performance
bottlenecks in the frontend are not diagnosed. MCA assumes that instructions
have all been decoded and placed into a queue. Also, MCA does not model branch
prediction.
bottlenecks in the frontend are not diagnosed. :program:`llvm-mca` assumes that
instructions have all been decoded and placed into a queue before the simulation
start. Also, :program:`llvm-mca` does not model branch prediction.
Instruction Dispatch
""""""""""""""""""""
@ -607,19 +606,19 @@ An instruction can be dispatched if:
* The schedulers are not full.
Scheduling models can optionally specify which register files are available on
the processor. MCA uses that information to initialize register file
descriptors. Users can limit the number of physical registers that are
the processor. :program:`llvm-mca` uses that information to initialize register
file descriptors. Users can limit the number of physical registers that are
globally available for register renaming by using the command option
``-register-file-size``. A value of zero for this option means *unbounded*.
By knowing how many registers are available for renaming, MCA can predict
dispatch stalls caused by the lack of registers.
``-register-file-size``. A value of zero for this option means *unbounded*. By
knowing how many registers are available for renaming, the tool can predict
dispatch stalls caused by the lack of physical registers.
The number of reorder buffer entries consumed by an instruction depends on the
number of micro-opcodes specified by the target scheduling model. MCA's
reorder buffer's purpose is to track the progress of instructions that are
"in-flight," and to retire instructions in program order. The number of
entries in the reorder buffer defaults to the `MicroOpBufferSize` provided by
the target scheduling model.
number of micro-opcodes specified for that instruction by the target scheduling
model. The reorder buffer is responsible for tracking the progress of
instructions that are "in-flight", and retiring them in program order. The
number of entries in the reorder buffer defaults to the value specified by field
`MicroOpBufferSize` in the target scheduling model.
Instructions that are dispatched to the schedulers consume scheduler buffer
entries. :program:`llvm-mca` queries the scheduling model to determine the set