Changes between Version 20 and Version 21 of GcnTimings
- Timestamp:
- 06/11/16 16:00:23 (8 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
GcnTimings
v20 v21 4 4 <h2>GCN ISA Instruction Timings</h2> 5 5 <h3>Preliminary explanations</h3> 6 <p> The almost instructions are executed within 4 cycles (scalar and vector). Hence, to7 achieve maximum performance, 4 wavefront per compute units must be ran.</p>8 <p>NOTE: Simple single dword (4-byte) instruction is executed in 4 cycles (thanksfast9 dispatching from cache). However, 2 dword instruction canrequire 4 extra cycles10 toexecution due to bigger size in memory and limits of instruction dispatching.6 <p>Almost all instructions (scalar and vector) are executed within 4 cycles. Hence, to 7 achieve maximum performance, 4 wavefronts should be executed per compute unit.</p> 8 <p>NOTE: a simple single dword (4-byte) instruction is executed in 4 cycles (thanks to fast 9 dispatching from cache). However, a 2 dword (8-byte) instruction may require 4 extra cycles 10 for execution due to bigger size in memory and limits of instruction dispatching. 11 11 To achieve best performance, we recommend to use single dword instructions.</p> 12 <p> In some tables present DPFACTOR term. This term indicates thatnumber of cycles depends13 on the model of GPU as follows:</p>12 <p>A DPFACTOR term is present in some tables; it indicates that the number of cycles depends 13 on the model of the GPU as follows:</p> 14 14 <table> 15 15 <thead> … … 127 127 </tbody> 128 128 </table> 129 <p>Waves - number of concurrent waves that can be computed by single SIMD unit<br />130 SGPRs - number of maximum SGPRs that can be allocated that occupancy<br />131 VPGRs - number of maximum VGPRs that can be allocated that occupancy<br />132 LdsW/I - Maximum amount of LDS space per vector lane per wavefront in dwords<br />133 Issue - number of maximum instruction per clock that can be processed</p>134 <p>Each compute unit partitioned into four SIMD units. So,maximum number of waves per129 <p>Waves - number of concurrent waves that can be computed by a single SIMD unit 130 SGPRs - number of maximum SGPRs that can be allocated at that occupancy 131 VPGRs - number of maximum VGPRs that can be allocated at that occupancy 132 LdsW/I - maximum amount of LDS space per vector lane per wavefront in dwords 133 Issue - maximum number of instructions per clock</p> 134 <p>Each compute unit is partitioned into four SIMD units. So, the maximum number of waves per 135 135 compute unit is 40.</p> 136 136 <h3>Instruction alignment</h3> … … 139 139 <li>any penalty costs 4 cycles</li> 140 140 <li>program divided by in 32-byte blocks</li> 141 <li>only first 3 places (dwords) in 32-byte block is free (no penalty). Any 2-dword142 instruction outside these first 3 dwords adds single penalty.</li>143 <li>if instructions is longer (more than four cycles) thenlast cycles/4 dwords are free</li>144 <li>if 16 or more cycle 2-dword instruction and 2 dword instruction in 4 dword, then145 no penalty for second 2-dword instruction.</li>146 <li>best place to jump is 5 first dwords in 32-byte block. Jump to rest ofdwords causes147 1-3 penalties, depending on number of dword (N-4, where N is number of dword). This rule141 <li>only the first 3 dwords in the 32-byte block incur no penalty. Any 2-dword 142 instruction outside these first 3 dwords adds a single penalty.</li> 143 <li>if the instructions is longer (more than four cycles) then the last cycles/4 dwords are free</li> 144 <li>if 16 or more cycle 2-dword instruction and 2-dword instruction in 4 dword, then there is 145 no penalty for the second 2-dword instruction.</li> 146 <li>best place to jump is the 5 first dwords in the 32-byte block. Jump to rest of the dwords causes 147 1-3 penalties, depending on number of dwords (N-4, where N is the dword number). This rule 148 148 does not apply to backward jumps (???)</li> 149 <li>any conditional jump instruction should be in first half of 32-byte block, otherwise 150 1-4 penalties will be added if jump was not taken, depending on number of dword 151 (N-3, where N is number of dword).</li> 149 <li>any conditional jump instruction should be in first half of the 32-byte block, otherwise 150 1-4 penalties are added if jump is not taken, depending on dword number (N-3, where N is dword number).</li> 152 151 </ul> 153 <p>IMPORTANT: If occupancy is greater than 1 wave per compute unit, thenpenalties,152 <p>IMPORTANT: If the occupancy is greater than 1 wave per compute unit, then the penalties, 154 153 branches, and scalar instructions will be masked while executing 155 154 more waves than 4*CUs. For best results is recommended to execute many waves … … 157 156 <h3>Instruction scheduling</h3> 158 157 <ul> 159 <li>if many wavefront executed insingle CU (if many wavefronts) then scalar, vector and160 data-share, memory (???) execution units can run independently (parallely) way,158 <li>if many wavefronts are executed in a single CU (if many wavefronts) then scalar, vector and 159 data-share, memory (???) execution units can run independently in parallel, 161 160 achieving many instructions per cycles.</li> 162 <li>between any integer V_ADD*, V_SUB*, V_FIRSTREADLINE_B32, V_READLANE_B32 operation 163 and any scalar ALU instruction is 16-cycle delay. Masked if more waves than 4*CUs</li>164 <li>any conditional jump directly that checks VCCZ or EXECZ afterinstruction that changes165 VCC or EXEC adds single penalty (4 cycles). Masked if more waves than 4*CUs</li>166 <li>any conditional jump directly that checks SCC afterinstruction that changes SCC,167 EXEC, VCC adds single penalty (4 cycles). Masked if more waves than 4*CUs</li>161 <li>between any integer V_ADD*, V_SUB*, V_FIRSTREADLINE_B32, V_READLANE_B32 operations 162 and any scalar ALU instructions there is 16-cycle delay. Masked if there are more waves than 4*CUs.</li> 163 <li>any conditional jump that directly checks VCCZ or EXECZ after an instruction that changes 164 VCC or EXEC adds a single penalty (4 cycles). Masked if there are more waves than 4*CUs.</li> 165 <li>any conditional jump that directly checks SCC after an instruction that changes SCC, 166 EXEC, VCC adds a single penalty (4 cycles). Masked if there are more waves than 4*CUs.</li> 168 167 </ul> 169 168 <h3>SOP2 Instruction timings</h3> 170 <p>All SOP2 instructions (S_CBRANCH_G_FORK not checked) take s4 cycles.</p>169 <p>All SOP2 instructions (S_CBRANCH_G_FORK not checked) take 4 cycles.</p> 171 170 <h3>SOPK Instruction timings</h3> 172 <p>All SOPK instructions (S_CBRANCH_I_FORK not checked) take s4 cycles.173 S_SETREG_B32 and S_SETREG_IMM32_B32 take s8 cycles.</p>171 <p>All SOPK instructions (S_CBRANCH_I_FORK not checked) take 4 cycles. 172 S_SETREG_B32 and S_SETREG_IMM32_B32 take 8 cycles.</p> 174 173 <h3>SOP1 Instruction timings</h3> 175 <p>The S_*_SAVEEXEC_B64 instructions take s8 cycles. Other ALU instructions (except174 <p>The S_*_SAVEEXEC_B64 instructions take 8 cycles. Other ALU instructions (except 176 175 S_MOV_REGRD_B32, S_CBRANCH_JOIN, S_RFE_B64) take 4 cycles.</p> 177 176 <h3>SOPC Instruction timings</h3> 178 <p>All comparison and bit checking instructions take s4 cycles.</p>177 <p>All comparison and bit checking instructions take 4 cycles.</p> 179 178 <h3>SOPP Instruction timings</h3> 180 <p>Jumps cost s 4 (no jump) or 20 cycles (???) if jump willperformed.</p>179 <p>Jumps cost 4 cycle (no jump) or 20 cycles (???) if jump is performed.</p> 181 180 <h3>SMRD Instruction timings</h3> 182 181 <p>Timings of SMRD instructions includes only time to fetch and execute instruction without … … 237 236 </table> 238 237 <h3>VOP2 Instruction timings</h3> 239 <p>All VOP2 instructions take s4 cycles. All instruction can achieve throughput 1 instruction238 <p>All VOP2 instructions take 4 cycles. All instruction can achieve throughput 1 instruction 240 239 per cycle.</p> 241 240 <h3>VOP1 Instruction timings</h3> 242 <p>Maximum throughput of these instruction can be calculated by usingexpression243 <code>(1/(CYCLES/4))</code> - for 4 cycles i s 1 instruction per cycle, for 8 cyclesis 1/2 instruction244 per cycle andetc.241 <p>Maximum throughput of these instructions can be calculated by using the expression 242 <code>(1/(CYCLES/4))</code> - for 4 cycles it is 1 instruction per cycle, for 8 cycles it is 1/2 instruction 243 per cycle, etc. 245 244 Timings of VOP1 instructions are in this table:</p> 246 245 <table> … … 455 454 </table> 456 455 <h3>VOPC Instruction timings</h3> 457 <p>Maximum throughput of these instruction can be calculated by using expression458 <code>(1/(CYCLES/4))</code> - for 4 cycles i s 1 instruction per cycle, for 8 cyclesis 1/2 instruction459 per cycle andetc.460 All 32-bit comparison instructions take s 4 cycles. All 64-bit comparison instructions takes456 <p>Maximum throughput of these instructions can be calculated by using expression 457 <code>(1/(CYCLES/4))</code> - for 4 cycles it is 1 instruction per cycle, for 8 cycles it is 1/2 instruction 458 per cycle, etc. 459 All 32-bit comparison instructions take 4 cycles. All 64-bit comparison instructions take 461 460 DPFACTOR*4 cycles.</p> 462 461 <h3>VOP3 Instruction timings</h3> 463 <p>Maximum throughput of these instruction can be calculated by using expression464 <code>(1/(CYCLES/4))</code> - for 4 cycles i s 1 instruction per cycle, for 8 cyclesis 1/2 instruction462 <p>Maximum throughput of these instructions can be calculated by using expression 463 <code>(1/(CYCLES/4))</code> - for 4 cycles it is 1 instruction per cycle, for 8 cycles it is 1/2 instruction 465 464 per cycle and etc. 466 465 Timings of VOP3 instructions are in this table:</p> … … 653 652 <h3>DS Instruction timings</h3> 654 653 <p>Timings of DS instructions includes only execution without waiting for completing 655 LDS/GDS memory access on single wavefront. Throughput indicates maximal possible654 LDS/GDS memory access on a single wavefront. Throughput indicates maximal possible 656 655 throughput that excludes any other delays and penalties. 657 656 Timings of DS instructions are in this table:</p> … … 1367 1366 </tbody> 1368 1367 </table> 1369 <p>About bank conflict : The LDS memory is partitioned by32 banks. The bank number is in1370 2-6 bit of the address. Bank conflict encounters when two addresses have thissame1371 bank, but are not equal begins from 7 bit address1372 (the first 2 bits of addresses doesn't matter).1373 Any bank conflict adds penalty to timing and throughput. In worst case, the throughput1374 can be not greater 1/32 request per cycle.</p>1368 <p>About bank conflicts: The LDS memory is partitioned in 32 banks. The bank number is in 1369 bits 2-6 of the address. A bank conflict occurs when two addresses hit the same 1370 bank, but the addresses are different starting from the 7bit 1371 (the first 2 bits of the address doesn't matter). 1372 Any bank conflict adds penalty to timing and throughput. In the worst case, the throughput 1373 can be not greater 1/32 requests per cycle.</p> 1375 1374 <h3>MUBUF Instruction timings</h3> 1376 1375 <p>Timings of MUBUF instructions includes only execution without waiting for completing 1377 main memory access on single wavefront. Additional GLCX adds X cycles to instruction1378 if instruction usesGLC modifier. Timings of MUBUF instructions are in this table:</p>1376 main memory access on a single wavefront. Additional GLCX adds X cycles to instruction 1377 if the instruction uses the GLC modifier. Timings of MUBUF instructions are in this table:</p> 1379 1378 <table> 1380 1379 <thead>