Version 17 (modified by trac, 8 years ago) (diff) |
---|
GCN ISA Instruction Timings
Preliminary explanations
The almost instructions are executed within 4 cycles (scalar and vector). Hence, to achieve maximum performance, 4 wavefront per compute units must be ran.
NOTE: Simple single dword (4-byte) instruction is executed in 4 cycles (thanks fast dispatching from cache). However, 2 dword instruction can require 4 extra cycles to execution due to bigger size in memory and limits of instruction dispatching. To achieve best performance, we recommend to use single dword instructions.
In some tables present DPFACTOR term. This term indicates that number of cycles depends on the model of GPU as follows:
DPFACTOR | DP speed | GPU subfamily |
---|---|---|
1 | 1/2 | professional Hawaii |
2 | 1/4 | Highend Tahiti: Radeon HD7970 |
4 | 1/8 | Highend Hawaii: R9 290 |
8 | 1/16 | Other GPU's |
Occupancy table
Waves | SGPRs | VGPRs | LdsW/I | Issue |
---|---|---|---|---|
1 | 128 | 256 | 64 | 1 |
2 | 128 | 128 | 32 | 2 |
3 | 128 | 84 | 21 | 3 |
4 | 128 | 64 | 16 | 4 |
5 | 96 | 48 | 12 | 5 |
6 | 80 | 40 | 10 | 5 |
7 | 72 | 36 | 9 | 5 |
8 | 64 | 32 | 8 | 5 |
9 | 56 | 28 | 7 | 5 |
10 | 48 | 24 | 6 | 5 |
Waves - number of concurrent waves that can be computed by single SIMD unit
SGPRs - number of maximum SGPRs that can be allocated that occupancy
VPGRs - number of maximum VGPRs that can be allocated that occupancy
LdsW/I - Maximum amount of LDS space per vector lane per wavefront in dwords
Issue - number of maximum instruction per clock that can be processed
Each compute unit partitioned into four SIMD units. So, maximum number of waves per compute unit is 40.
Instruction alignment
Aligmnent Rules for 2-dword instructions (GCN 1.0/1.1):
- any penalty costs 4 cycles
- program divided by in 32-byte blocks
- only first 3 places (dwords) in 32-byte block is free (no penalty). Any 2-dword instruction outside these first 3 dwords adds single penalty.
- if instructions is longer (more than four cycles) then last cycles/4 dwords are free
- if 16 or more cycle 2-dword instruction and 2 dword instruction in 4 dword, then no penalty for second 2-dword instruction.
- best place to jump is 5 first dwords in 32-byte block. Jump to rest of dwords causes 1-3 penalties, depending on number of dword (N-4, where N is number of dword). This rule does not apply to backward jumps (???)
- any conditional jump instruction should be in first half of 32-byte block, otherwise 1-4 penalties will be added if jump was not taken, depending on number of dword (N-3, where N is number of dword).
IMPORTANT: If occupancy is greater than 1 wave per compute unit, then penalties, branches, and scalar instructions will be masked while executing more waves than 4*CUs. For best results is recommended to execute many waves (multiple of 4*CUs) with occupancy greater than 1.
Instruction scheduling
- between any integer V_ADD*, V_SUB*, V_FIRSTREADLINE_B32, V_READLANE_B32 operation and any scalar ALU instruction is 16-cycle delay. Masked if more waves than 4*CUs
- any conditional jump directly that checks VCCZ or EXECZ after instruction that changes VCC or EXEC adds single penalty (4 cycles). Masked if more waves than 4*CUs
- any conditional jump directly that checks SCC after instruction that changes SCC, EXEC, VCC adds single penalty (4 cycles). Masked if more waves than 4*CUs
SOP2 Instruction timings
All SOP2 instructions (S_CBRANCH_G_FORK not checked) takes 4 cycles.
SOPK Instruction timings
All SOPK instructions (S_CBRANCH_I_FORK not checked) takes 4 cycles. S_SETREG_B32 and S_SETREG_IMM32_B32 takes 8 cycles.
SOP1 Instruction timings
The S_*_SAVEEXEC_B64 instructions takes 8 cycles. Other ALU instructions (except S_MOV_REGRD_B32, S_CBRANCH_JOIN, S_RFE_B64) take 4 cycles.
SOPC Instruction timings
All comparison and bit checking instructions takes 4 cycles.
SOPP Instruction timings
Jumps costs 4 (no jump) or 20 cycles (???) if jump will performed.
SMRD Instruction timings
Timings of SMRD instructions includes only time to fetch and execute instruction without loading data from memory. Timings of SMRD instructions are in this table:
Instruction | Cycles | Instruction | Cycles |
---|---|---|---|
S_BUFFER_LOAD_DWORD | 4 | S_LOAD_DWORD | 4 |
S_BUFFER_LOAD_DWORDX2 | 4 | S_LOAD_DWORDX2 | 4 |
S_BUFFER_LOAD_DWORDX4 | 4 | S_LOAD_DWORDX4 | 4 |
S_BUFFER_LOAD_DWORDX8 | 8 | S_LOAD_DWORDX8 | 8 |
S_BUFFER_LOAD_DWORDX16 | 16-24 | S_LOAD_DWORDX16 | 16-24 |
S_DCACHE_INV | 4 | S_MEMTIME | 4 |
S_DCACHE_INV_VOL | 4 |
VOP2 Instruction timings
All VOP2 instructions takes 4 cycles. All instruction can achieve throughput 1 instruction per cycle.
VOP1 Instruction timings
Maximum throughput of these instruction can be calculated by using expression
(1/(CYCLES/4))
- for 4 cycles is 1 instruction per cycle, for 8 cycles is 1/2 instruction
per cycle and etc.
Timings of VOP1 instructions are in this table:
Instruction | Cycles | Instruction | Cycles |
---|---|---|---|
V_BFREV_B32 | 4 | V_FREXP_EXP_I32_F32 | 4 |
V_CEIL_F32 | 4 | V_FREXP_EXP_I32_F64 | DPFACTOR*4 |
V_CEIL_F64 | DPFACTOR*4 | V_FREXP_MANT_F32 | 4 |
V_CLREXCP | 4 | V_FREXP_MANT_F64 | DPFACTOR*4 |
V_COS_F32 | 16 | V_LOG_CLAMP_F32 | 16 |
V_CVT_F16_F32 | 4 | V_LOG_F32 | 16 |
V_CVT_F32_F16 | 4 | V_LOG_LEGACY_F32 | 16 |
V_CVT_F32_F64 | DPFACTOR*4 | V_MOVRELD_B32 | 4 |
V_CVT_F32_I32 | 4 | V_MOVRELSD_B32 | 4 |
V_CVT_F32_U32 | 4 | V_MOVRELS_B32 | 4 |
V_CVT_F32_UBYTE0 | 4 | V_MOV_B32 | 4 |
V_CVT_F32_UBYTE1 | 4 | V_MOV_FED_B32 | 4 |
V_CVT_F32_UBYTE2 | 4 | V_NOP | 4 |
V_CVT_F32_UBYTE3 | 4 | V_NOT_B32 | 4 |
V_CVT_F64_F32 | DPFACTOR*4 | V_RCP_CLAMP_F32 | 16 |
V_CVT_F64_I32 | DPFACTOR*4 | V_RCP_CLAMP_F64 | DPFACTOR*8 |
V_CVT_F64_U32 | DPFACTOR*4 | V_RCP_F32 | 16 |
V_CVT_FLR_I32_F32 | 4 | V_RCP_F64 | DPFACTOR*8 |
V_CVT_I32_F32 | 4 | V_RCP_IFLAG_F32 | 16 |
V_CVT_I32_F64 | DPFACTOR*4 | V_RCP_LEGACY_F32 | 16 |
V_CVT_OFF_F32_I4 | 4 | V_READFIRSTLANE_B32 | 4 |
V_CVT_RPI_I32_F32 | 4 | V_RNDNE_F32 | 4 |
V_CVT_U32_F32 | 4 | V_RNDNE_F64 | DPFACTOR*4 |
V_CVT_U32_F64 | DPFACTOR*4 | V_RSQ_CLAMP_F32 | 16 |
V_EXP_F32 | 16 | V_RSQ_CLAMP_F64 | DPFACTOR*8 |
V_EXP_LEGACY_F32 | 16 | V_RSQ_F32 | 16 |
V_FFBH_I32 | 4 | V_RSQ_F64 | DPFACTOR*8 |
V_FFBH_U32 | 4 | V_RSQ_LEGACY_F32 | 16 |
V_FFBL_B32 | 4 | V_SIN_F32 | 16 |
V_FLOOR_F32 | 4 | V_SQRT_F32 | 16 |
V_FLOOR_F64 | DPFACTOR*4 | V_SQRT_F64 | DPFACTOR*8 |
V_FRACT_F32 | 4 | V_TRUNC_F32 | 4 |
V_FRACT_F64 | DPFACTOR*4 | V_TRUNC_F64 | DPFACTOR*4 |
VOPC Instruction timings
Maximum throughput of these instruction can be calculated by using expression
(1/(CYCLES/4))
- for 4 cycles is 1 instruction per cycle, for 8 cycles is 1/2 instruction
per cycle and etc.
All 32-bit comparison instructions takes 4 cycles. All 64-bit comparison instructions takes
DPFACTOR*4 cycles.
VOP3 Instruction timings
Maximum throughput of these instruction can be calculated by using expression
(1/(CYCLES/4))
- for 4 cycles is 1 instruction per cycle, for 8 cycles is 1/2 instruction
per cycle and etc.
Timings of VOP3 instructions are in this table:
Instruction | Cycles | Instruction | Cycles |
---|---|---|---|
V_ADD_F64 | DPFACTOR*4 | V_MAD_U64_U32 | 16 |
V_ALIGNBIT_B32 | 4 | V_MAX3_F32 | 4 |
V_ALIGNBYTE_B32 | 4 | V_MAX3_I32 | 4 |
V_ASHR_I64 | DPFACTOR*4 | V_MAX3_U32 | 4 |
V_BFE_I32 | 4 | V_MAX_F64 | DPFACTOR*4 |
V_BFE_U32 | 4 | V_MED3_F32 | 4 |
V_BFI_B32 | 4 | V_MED3_I32 | 4 |
V_CUBEID_F32 | 4 | V_MED3_U32 | 4 |
V_CUBEMA_F32 | 4 | V_MIN3_F32 | 4 |
V_CUBESC_F32 | 4 | V_MIN3_I32 | 4 |
V_CUBETC_F32 | 4 | V_MIN3_U32 | 4 |
V_CVT_PK_U8_F32 | 4 | V_MIN_F64 | DPFACTOR*4 |
V_DIV_FIXUP_F32 | 16 | V_MQSAD_PK_U16_U8 | 16 |
V_DIV_FIXUP_F64 | DPFACTOR*4 | V_MQSAD_U32_U8 | 16 |
V_DIV_FMAS_F32 | 16 | V_MQSAD_U8 | 16 |
V_DIV_FMAS_F64 | DPFACTOR*8 | V_MSAD_U8 | 4 |
V_DIV_SCALE_F32 | 16 | V_MULLIT_F32 | 4 |
V_DIV_SCALE_F64 | DPFACTOR*4 | V_MUL_F64 | DPFACTOR*8 |
V_FMA_F32 | 16 | V_MUL_HI_I32 | 16 |
V_FMA_F64 | DPFACTOR*8 | V_MUL_HI_U32 | 16 |
V_LDEXP_F64 | DPFACTOR*4 | V_MUL_LO_I32 | 16 |
V_LERP_U8 | 4 | V_MUL_LO_U32 | 16 |
V_LSHL_B64 | DPFACTOR*4 | V_QSAD_PK_U16_U8 | 16 |
V_LSHR_B64 | DPFACTOR*4 | V_QSAD_U8 | 16 |
V_MAD_F32 | 4 | V_SAD_HI_U8 | 4 |
V_MAD_I32_I24 | 4 | V_SAD_U16 | 4 |
V_MAD_I64_I32 | 16 | V_SAD_U32 | 4 |
V_MAD_LEGACY_F32 | 4 | V_SAD_U8 | 4 |
V_MAD_U32_U24 | 4 | V_TRIG_PREOP_F64 | DPFACTOR*8 |
DS Instruction timings
Timings of DS instructions includes only execution without waiting for completing LDS/GDS memory access on single wavefront. Timings of DS instructions are in this table:
Instruction | Cycles | Throughput |
---|---|---|
DS_ADD_RTN_U32 | 8 | 1/4 |
DS_ADD_RTN_U64 | 12 | 1/6 |
DS_ADD_SRC2_U32 | 4 | 1/4 |
DS_ADD_SRC2_U64 | 8 | 1/8 |
DS_ADD_U32 | 8 | 1/4 |
DS_ADD_U64 | 12 | 1/6 |
DS_AND_B32 | 8 | 1/4 |
DS_AND_B64 | 12 | 1/6 |
DS_AND_RTN_B32 | 8 | 1/4 |
DS_AND_RTN_B64 | 12 | 1/6 |
DS_AND_SRC2_B32 | 4 | 1/4 |
DS_AND_SRC2_B64 | 8 | 1/8 |
DS_APPEND | 4 | ? |
DS_CMPST_B32 | 12 | 1/6 |
DS_CMPST_B64 | 20 | 1/10 |
DS_CMPST_F32 | 12 | 1/6 |
DS_CMPST_F64 | 20 | 1/10 |
DS_CMPST_RTN_B32 | 12 | 1/6 |
DS_CMPST_RTN_B64 | 20 | 1/10 |
DS_CMPST_RTN_F32 | 12 | 1/6 |
DS_CMPST_RTN_F64 | 20 | 1/10 |
DS_CONDXCHG32_RTN_B128 | ? | ? |
DS_CONDXCHG32_RTN_B64 | ? | ? |
DS_CONSUME | 4 | ? |
DS_DEC_RTN_U32 | 8 | 1/4 |
DS_DEC_RTN_U64 | 12 | 1/6 |
DS_DEC_SRC2_U32 | 4 | 1/4 |
DS_DEC_SRC2_U64 | 8 | 1/8 |
DS_DEC_U32 | 8 | 1/4 |
DS_DEC_U64 | 12 | 1/6 |
DS_GWS_BARRIER | ? | ? |
DS_GWS_INIT | ? | ? |
DS_GWS_SEMA_BR | ? | ? |
DS_GWS_SEMA_P | ? | ? |
DS_GWS_SEMA_RELEASE_ALL | ? | ? |
DS_GWS_SEMA_V | ? | ? |
DS_INC_RTN_U32 | 8 | 1/4 |
DS_INC_RTN_U64 | 12 | 1/6 |
DS_INC_SRC2_U32 | 4 | 1/4 |
DS_INC_SRC2_U64 | 8 | 1/8 |
DS_INC_U32 | 8 | 1/4 |
DS_INC_U64 | 12 | 1/6 |
DS_MAX_F32 | 8 | 1/4 |
DS_MAX_F64 | 12 | 1/6 |
DS_MAX_I32 | 8 | 1/4 |
DS_MAX_I64 | 12 | 1/6 |
DS_MAX_RTN_F32 | 8 | 1/4 |
DS_MAX_RTN_F64 | 12 | 1/6 |
DS_MAX_RTN_I32 | 8 | 1/4 |
DS_MAX_RTN_I64 | 12 | 1/6 |
DS_MAX_RTN_U32 | 8 | 1/4 |
DS_MAX_RTN_U64 | 12 | 1/6 |
DS_MAX_SRC2_F32 | 4 | 1/4 |
DS_MAX_SRC2_F64 | 8 | 1/8 |
DS_MAX_SRC2_I32 | 4 | 1/4 |
DS_MAX_SRC2_I64 | 8 | 1/8 |
DS_MAX_SRC2_U32 | 4 | 1/4 |
DS_MAX_SRC2_U64 | 8 | 1/8 |
DS_MAX_U32 | 8 | 1/4 |
DS_MAX_U64 | 12 | 1/6 |
DS_MIN_F32 | 8 | 1/4 |
DS_MIN_F64 | 12 | 1/6 |
DS_MIN_I32 | 8 | 1/4 |
DS_MIN_I64 | 12 | 1/6 |
DS_MIN_RTN_F32 | 8 | 1/4 |
DS_MIN_RTN_F64 | 12 | 1/6 |
DS_MIN_RTN_I32 | 8 | 1/4 |
DS_MIN_RTN_I64 | 12 | 1/6 |
DS_MIN_RTN_U32 | 8 | 1/4 |
DS_MIN_RTN_U64 | 12 | 1/6 |
DS_MIN_SRC2_F32 | 4 | 1/4 |
DS_MIN_SRC2_F64 | 8 | 1/8 |
DS_MIN_SRC2_I32 | 4 | 1/4 |
DS_MIN_SRC2_I64 | 8 | 1/8 |
DS_MIN_SRC2_U32 | 4 | 1/4 |
DS_MIN_SRC2_U64 | 8 | 1/8 |
DS_MIN_U32 | 8 | 1/4 |
DS_MIN_U64 | 12 | 1/6 |
DS_MSKOR_B32 | 12 | 1/6 |
DS_MSKOR_B64 | 20 | 1/10 |
DS_MSKOR_RTN_B32 | 12 | 1/6 |
DS_MSKOR_RTN_B64 | 20 | 1/10 |
DS_NOP | 4 | ? |
DS_ORDERED_COUNT (???) | ? | ? |
DS_OR_B32 | 8 | 1/4 |
DS_OR_B64 | 12 | 1/6 |
DS_OR_RTN_B32 | 8 | 1/4 |
DS_OR_RTN_B64 | 12 | 1/6 |
DS_OR_SRC2_B32 | 4 | 1/4 |
DS_OR_SRC2_B64 | 8 | 1/8 |
DS_READ2ST64_B32 | 8 | 1/4 |
DS_READ2ST64_B64 | 16 | 1/8 |
DS_READ2_B32 | 8 | 1/4 |
DS_READ2_B64 | 16 | 1/8 |
DS_READ_B128 | 16 | 1/8 |
DS_READ_B32 | 4 | 1/2 |
DS_READ_B64 | 8 | 1/4 |
DS_READ_B96 | 16 | 1/8 |
DS_READ_I16 | 4 | 1/2 |
DS_READ_I8 | 4 | 1/2 |
DS_READ_U16 | 4 | 1/2 |
DS_READ_U8 | 4 | 1/2 |
DS_RSUB_RTN_U32 | 8 | 1/4 |
DS_RSUB_RTN_U64 | 12 | 1/6 |
DS_RSUB_SRC2_U32 | 4 | 1/4 |
DS_RSUB_SRC2_U64 | 8 | 1/8 |
DS_RSUB_U32 | 8 | 1/4 |
DS_RSUB_U64 | 12 | 1/6 |
DS_SUB_RTN_U32 | 8 | 1/4 |
DS_SUB_RTN_U64 | 12 | 1/6 |
DS_SUB_SRC2_U32 | 4 | 1/4 |
DS_SUB_SRC2_U64 | 8 | 1/8 |
DS_SUB_U32 | 8 | 1/4 |
DS_SUB_U64 | 12 | 1/6 |
DS_SWIZZLE_B32 | 4 | 1/2 |
DS_WRAP_RTN_B32 | ? | ? |
DS_WRITE2ST64_B32 | 12 | 1/6 |
DS_WRITE2ST64_B64 | 20 | 1/10 |
DS_WRITE2_B32 | 12 | 1/6 |
DS_WRITE2_B64 | 20 | 1/10 |
DS_WRITE_B128 | 20 | 1/10 |
DS_WRITE_B16 | 8 | 1/4 |
DS_WRITE_B32 | 8 | 1/4 |
DS_WRITE_B64 | 12 | 1/8 |
DS_WRITE_B8 | 8 | 1/4 |
DS_WRITE_B96 | 16 | 1/10 |
DS_WRITE_SRC2_B32 | 12 | 1/4 |
DS_WRITE_SRC2_B64 | 20 | 1/8 |
DS_WRXCHG2ST64_RTN_B32 | 12 | 1/6 |
DS_WRXCHG2ST64_RTN_B64 | 20 | 1/12 |
DS_WRXCHG2_RTN_B32 | 12 | 1/6 |
DS_WRXCHG2_RTN_B64 | 20 | 1/12 |
DS_WRXCHG_RTN_B32 | 8 | 1/4 |
DS_WRXCHG_RTN_B64 | 12 | 1/6 |
DS_XOR_B32 | 8 | 1/4 |
DS_XOR_B64 | 12 | 1/6 |
DS_XOR_RTN_B32 | 8 | 1/4 |
DS_XOR_RTN_B64 | 12 | 1/6 |
DS_XOR_SRC2_B32 | 4 | 1/4 |
DS_XOR_SRC2_B64 | 8 | 1/8 |