Changes between Version 20 and Version 21 of GcnTimings


Ignore:
Timestamp:
06/11/16 16:00:23 (8 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GcnTimings

    v20 v21  
    44<h2>GCN ISA Instruction Timings</h2>
    55<h3>Preliminary explanations</h3>
    6 <p>The almost instructions are executed within 4 cycles (scalar and vector). Hence, to
    7 achieve maximum performance, 4 wavefront per compute units must be ran.</p>
    8 <p>NOTE: Simple single dword (4-byte) instruction is executed in 4 cycles (thanks fast
    9 dispatching from cache). However, 2 dword instruction can require 4 extra cycles
    10 to execution due to bigger size in memory and limits of instruction dispatching.
     6<p>Almost all instructions (scalar and vector) are executed within 4 cycles. Hence, to
     7achieve maximum performance, 4 wavefronts should be executed per compute unit.</p>
     8<p>NOTE: a simple single dword (4-byte) instruction is executed in 4 cycles (thanks to fast
     9dispatching from cache). However, a 2 dword (8-byte) instruction may require 4 extra cycles
     10for execution due to bigger size in memory and limits of instruction dispatching.
    1111To achieve best performance, we recommend to use single dword instructions.</p>
    12 <p>In some tables present DPFACTOR term. This term indicates that number of cycles depends
    13 on the model of GPU as follows:</p>
     12<p>A DPFACTOR term is present in some tables; it indicates that the number of cycles depends
     13on the model of the GPU as follows:</p>
    1414<table>
    1515<thead>
     
    127127</tbody>
    128128</table>
    129 <p>Waves - number of concurrent waves that can be computed by single SIMD unit<br />
    130 SGPRs - number of maximum SGPRs that can be allocated that occupancy<br />
    131 VPGRs - number of maximum VGPRs that can be allocated that occupancy<br />
    132 LdsW/I - Maximum amount of LDS space per vector lane per wavefront in dwords<br />
    133 Issue - number of maximum instruction per clock that can be processed  </p>
    134 <p>Each compute unit partitioned into four SIMD units. So, maximum number of waves per
     129<p>Waves - number of concurrent waves that can be computed by a single SIMD unit
     130SGPRs - number of maximum SGPRs that can be allocated at that occupancy
     131VPGRs - number of maximum VGPRs that can be allocated at that occupancy
     132LdsW/I - maximum amount of LDS space per vector lane per wavefront in dwords
     133Issue - maximum number of instructions per clock</p>
     134<p>Each compute unit is partitioned into four SIMD units. So, the maximum number of waves per
    135135compute unit is 40.</p>
    136136<h3>Instruction alignment</h3>
     
    139139<li>any penalty costs 4 cycles</li>
    140140<li>program divided by in 32-byte blocks</li>
    141 <li>only first 3 places (dwords) in 32-byte block is free (no penalty). Any 2-dword
    142 instruction outside these first 3 dwords adds single penalty.</li>
    143 <li>if instructions is longer (more than four cycles) then last cycles/4 dwords are free</li>
    144 <li>if 16 or more cycle 2-dword instruction and 2 dword instruction in 4 dword, then
    145 no penalty for second 2-dword instruction.</li>
    146 <li>best place to jump is 5 first dwords in 32-byte block. Jump to rest of dwords causes
    147 1-3 penalties, depending on number of dword (N-4, where N is number of dword). This rule
     141<li>only the first 3 dwords in the 32-byte block incur no penalty. Any 2-dword
     142instruction outside these first 3 dwords adds a single penalty.</li>
     143<li>if the instructions is longer (more than four cycles) then the last cycles/4 dwords are free</li>
     144<li>if 16 or more cycle 2-dword instruction and 2-dword instruction in 4 dword, then there is
     145no penalty for the second 2-dword instruction.</li>
     146<li>best place to jump is the 5 first dwords in the 32-byte block. Jump to rest of the dwords causes
     1471-3 penalties, depending on number of dwords (N-4, where N is the dword number). This rule
    148148does not apply to backward jumps (???)</li>
    149 <li>any conditional jump instruction should be in first half of 32-byte block, otherwise
    150 1-4 penalties will be added if jump was not taken, depending on number of dword
    151 (N-3, where N is number of dword).</li>
     149<li>any conditional jump instruction should be in first half of the 32-byte block, otherwise
     1501-4 penalties are added if jump is not taken, depending on dword number (N-3, where N is dword number).</li>
    152151</ul>
    153 <p>IMPORTANT: If occupancy is greater than 1 wave per compute unit, then penalties,
     152<p>IMPORTANT: If the occupancy is greater than 1 wave per compute unit, then the penalties,
    154153branches, and scalar instructions will be masked while executing
    155154more waves than 4*CUs. For best results is recommended to execute many waves
     
    157156<h3>Instruction scheduling</h3>
    158157<ul>
    159 <li>if many wavefront executed in single CU (if many wavefronts) then scalar, vector and
    160 data-share, memory (???) execution units can run independently (parallely) way,
     158<li>if many wavefronts are executed in a single CU (if many wavefronts) then scalar, vector and
     159data-share, memory (???) execution units can run independently in parallel,
    161160achieving many instructions per cycles.</li>
    162 <li>between any integer V_ADD*, V_SUB*, V_FIRSTREADLINE_B32, V_READLANE_B32 operation
    163 and any scalar ALU instruction is 16-cycle delay. Masked if more waves than 4*CUs</li>
    164 <li>any conditional jump directly that checks VCCZ or EXECZ after instruction that changes
    165 VCC or EXEC adds single penalty (4 cycles). Masked if more waves than 4*CUs</li>
    166 <li>any conditional jump directly that checks SCC after instruction that changes SCC,
    167 EXEC, VCC adds single penalty (4 cycles). Masked if more waves than 4*CUs</li>
     161<li>between any integer V_ADD*, V_SUB*, V_FIRSTREADLINE_B32, V_READLANE_B32 operations
     162and any scalar ALU instructions there is 16-cycle delay. Masked if there are more waves than 4*CUs.</li>
     163<li>any conditional jump that directly checks VCCZ or EXECZ after an instruction that changes
     164VCC or EXEC adds a single penalty (4 cycles). Masked if there are more waves than 4*CUs.</li>
     165<li>any conditional jump that directly checks SCC after an instruction that changes SCC,
     166EXEC, VCC adds a single penalty (4 cycles). Masked if there are more waves than 4*CUs.</li>
    168167</ul>
    169168<h3>SOP2 Instruction timings</h3>
    170 <p>All SOP2 instructions (S_CBRANCH_G_FORK not checked) takes 4 cycles.</p>
     169<p>All SOP2 instructions (S_CBRANCH_G_FORK not checked) take 4 cycles.</p>
    171170<h3>SOPK Instruction timings</h3>
    172 <p>All SOPK instructions (S_CBRANCH_I_FORK  not checked) takes 4 cycles.
    173 S_SETREG_B32 and S_SETREG_IMM32_B32 takes 8 cycles.</p>
     171<p>All SOPK instructions (S_CBRANCH_I_FORK  not checked) take 4 cycles.
     172S_SETREG_B32 and S_SETREG_IMM32_B32 take 8 cycles.</p>
    174173<h3>SOP1 Instruction timings</h3>
    175 <p>The S_*_SAVEEXEC_B64 instructions takes 8 cycles. Other ALU instructions (except
     174<p>The S_*_SAVEEXEC_B64 instructions take 8 cycles. Other ALU instructions (except
    176175S_MOV_REGRD_B32, S_CBRANCH_JOIN, S_RFE_B64) take 4 cycles.</p>
    177176<h3>SOPC Instruction timings</h3>
    178 <p>All comparison and bit checking instructions takes 4 cycles.</p>
     177<p>All comparison and bit checking instructions take 4 cycles.</p>
    179178<h3>SOPP Instruction timings</h3>
    180 <p>Jumps costs 4 (no jump) or 20 cycles (???) if jump will performed.</p>
     179<p>Jumps cost 4 cycle (no jump) or 20 cycles (???) if jump is performed.</p>
    181180<h3>SMRD Instruction timings</h3>
    182181<p>Timings of SMRD instructions includes only time to fetch and execute instruction without
     
    237236</table>
    238237<h3>VOP2 Instruction timings</h3>
    239 <p>All VOP2 instructions takes 4 cycles. All instruction can achieve throughput 1 instruction
     238<p>All VOP2 instructions take 4 cycles. All instruction can achieve throughput 1 instruction
    240239per cycle.</p>
    241240<h3>VOP1 Instruction timings</h3>
    242 <p>Maximum throughput of these instruction can be calculated by using expression
    243 <code>(1/(CYCLES/4))</code> - for 4 cycles is 1 instruction per cycle, for 8 cycles is 1/2 instruction
    244 per cycle and etc.
     241<p>Maximum throughput of these instructions can be calculated by using the expression
     242<code>(1/(CYCLES/4))</code> - for 4 cycles it is 1 instruction per cycle, for 8 cycles it is 1/2 instruction
     243per cycle, etc.
    245244Timings of VOP1 instructions are in this table:</p>
    246245<table>
     
    455454</table>
    456455<h3>VOPC Instruction timings</h3>
    457 <p>Maximum throughput of these instruction can be calculated by using expression
    458 <code>(1/(CYCLES/4))</code> - for 4 cycles is 1 instruction per cycle, for 8 cycles is 1/2 instruction
    459 per cycle and etc.
    460 All 32-bit comparison instructions takes 4 cycles. All 64-bit comparison instructions takes
     456<p>Maximum throughput of these instructions can be calculated by using expression
     457<code>(1/(CYCLES/4))</code> - for 4 cycles it is 1 instruction per cycle, for 8 cycles it is 1/2 instruction
     458per cycle, etc.
     459All 32-bit comparison instructions take 4 cycles. All 64-bit comparison instructions take
    461460DPFACTOR*4 cycles.</p>
    462461<h3>VOP3 Instruction timings</h3>
    463 <p>Maximum throughput of these instruction can be calculated by using expression
    464 <code>(1/(CYCLES/4))</code> - for 4 cycles is 1 instruction per cycle, for 8 cycles is 1/2 instruction
     462<p>Maximum throughput of these instructions can be calculated by using expression
     463<code>(1/(CYCLES/4))</code> - for 4 cycles it is 1 instruction per cycle, for 8 cycles it is 1/2 instruction
    465464per cycle and etc.
    466465Timings of VOP3 instructions are in this table:</p>
     
    653652<h3>DS Instruction timings</h3>
    654653<p>Timings of DS instructions includes only execution without waiting for completing
    655 LDS/GDS memory access on single wavefront. Throughput indicates maximal possible
     654LDS/GDS memory access on a single wavefront. Throughput indicates maximal possible
    656655throughput that excludes any other delays and penalties.
    657656Timings of DS instructions are in this table:</p>
     
    13671366</tbody>
    13681367</table>
    1369 <p>About bank conflict: The LDS memory is partitioned by 32 banks. The bank number is in
    1370 2-6 bit of the address. Bank conflict encounters when two addresses have this same
    1371 bank, but are not equal begins from 7 bit address
    1372 (the first 2 bits of addresses doesn't matter).
    1373 Any bank conflict adds penalty to timing and throughput. In worst case, the throughput
    1374 can be not greater 1/32 request per cycle.</p>
     1368<p>About bank conflicts: The LDS memory is partitioned in 32 banks. The bank number is in
     1369bits 2-6 of the address. A bank conflict occurs when two addresses hit the same
     1370bank, but the addresses are different starting from the 7bit
     1371(the first 2 bits of the address doesn't matter).
     1372Any bank conflict adds penalty to timing and throughput. In the worst case, the throughput
     1373can be not greater 1/32 requests per cycle.</p>
    13751374<h3>MUBUF Instruction timings</h3>
    13761375<p>Timings of MUBUF instructions includes only execution without waiting for completing
    1377 main memory access on single wavefront. Additional GLCX adds X cycles to instruction
    1378 if instruction uses GLC modifier. Timings of MUBUF instructions are in this table:</p>
     1376main memory access on a single wavefront. Additional GLCX adds X cycles to instruction
     1377if the instruction uses the GLC modifier. Timings of MUBUF instructions are in this table:</p>
    13791378<table>
    13801379<thead>