GcnTimings – CLRadeonExtender

Context Navigation

Version 19 (modified by trac, 8 years ago) (diff)
--

Back to Table of content

GCN ISA Instruction Timings

Preliminary explanations

The almost instructions are executed within 4 cycles (scalar and vector). Hence, to achieve maximum performance, 4 wavefront per compute units must be ran.

NOTE: Simple single dword (4-byte) instruction is executed in 4 cycles (thanks fast dispatching from cache). However, 2 dword instruction can require 4 extra cycles to execution due to bigger size in memory and limits of instruction dispatching. To achieve best performance, we recommend to use single dword instructions.

In some tables present DPFACTOR term. This term indicates that number of cycles depends on the model of GPU as follows:

DPFACTOR	DP speed	GPU subfamily
1	1/2	professional Hawaii
2	1/4	Highend Tahiti: Radeon HD7970
4	1/8	Highend Hawaii: R9 290
8	1/16	Other GPU's

Occupancy table

Waves	SGPRs	VGPRs	LdsW/I	Issue
1	128	256	64	1
2	128	128	32	2
3	128	84	21	3
4	128	64	16	4
5	96	48	12	5
6	80	40	10	5
7	72	36	9	5
8	64	32	8	5
9	56	28	7	5
10	48	24	6	5

Waves - number of concurrent waves that can be computed by single SIMD unit
SGPRs - number of maximum SGPRs that can be allocated that occupancy
VPGRs - number of maximum VGPRs that can be allocated that occupancy
LdsW/I - Maximum amount of LDS space per vector lane per wavefront in dwords
Issue - number of maximum instruction per clock that can be processed

Each compute unit partitioned into four SIMD units. So, maximum number of waves per compute unit is 40.

Instruction alignment

Aligmnent Rules for 2-dword instructions (GCN 1.0/1.1):

any penalty costs 4 cycles
program divided by in 32-byte blocks
only first 3 places (dwords) in 32-byte block is free (no penalty). Any 2-dword instruction outside these first 3 dwords adds single penalty.
if instructions is longer (more than four cycles) then last cycles/4 dwords are free
if 16 or more cycle 2-dword instruction and 2 dword instruction in 4 dword, then no penalty for second 2-dword instruction.
best place to jump is 5 first dwords in 32-byte block. Jump to rest of dwords causes 1-3 penalties, depending on number of dword (N-4, where N is number of dword). This rule does not apply to backward jumps (???)
any conditional jump instruction should be in first half of 32-byte block, otherwise 1-4 penalties will be added if jump was not taken, depending on number of dword (N-3, where N is number of dword).

IMPORTANT: If occupancy is greater than 1 wave per compute unit, then penalties, branches, and scalar instructions will be masked while executing more waves than 4*CUs. For best results is recommended to execute many waves (multiple of 4*CUs) with occupancy greater than 1.

Instruction scheduling

if many wavefront executed in single CU (if many wavefronts) then scalar, vector and data-share, memory (???) execution units can run independently (parallely) way, achieving many instructions per cycles.
between any integer V_ADD*, V_SUB*, V_FIRSTREADLINE_B32, V_READLANE_B32 operation and any scalar ALU instruction is 16-cycle delay. Masked if more waves than 4*CUs
any conditional jump directly that checks VCCZ or EXECZ after instruction that changes VCC or EXEC adds single penalty (4 cycles). Masked if more waves than 4*CUs
any conditional jump directly that checks SCC after instruction that changes SCC, EXEC, VCC adds single penalty (4 cycles). Masked if more waves than 4*CUs

SOP2 Instruction timings

All SOP2 instructions (S_CBRANCH_G_FORK not checked) takes 4 cycles.

SOPK Instruction timings

All SOPK instructions (S_CBRANCH_I_FORK not checked) takes 4 cycles. S_SETREG_B32 and S_SETREG_IMM32_B32 takes 8 cycles.

SOP1 Instruction timings

The S_*_SAVEEXEC_B64 instructions takes 8 cycles. Other ALU instructions (except S_MOV_REGRD_B32, S_CBRANCH_JOIN, S_RFE_B64) take 4 cycles.

SOPC Instruction timings

All comparison and bit checking instructions takes 4 cycles.

SOPP Instruction timings

Jumps costs 4 (no jump) or 20 cycles (???) if jump will performed.

SMRD Instruction timings

Timings of SMRD instructions includes only time to fetch and execute instruction without loading data from memory. Timings of SMRD instructions are in this table:

Instruction	Cycles	Instruction	Cycles
S_BUFFER_LOAD_DWORD	4	S_LOAD_DWORD	4
S_BUFFER_LOAD_DWORDX2	4	S_LOAD_DWORDX2	4
S_BUFFER_LOAD_DWORDX4	4	S_LOAD_DWORDX4	4
S_BUFFER_LOAD_DWORDX8	8	S_LOAD_DWORDX8	8
S_BUFFER_LOAD_DWORDX16	16-24	S_LOAD_DWORDX16	16-24
S_DCACHE_INV	4	S_MEMTIME	4
S_DCACHE_INV_VOL	4

VOP2 Instruction timings

All VOP2 instructions takes 4 cycles. All instruction can achieve throughput 1 instruction per cycle.

VOP1 Instruction timings

Maximum throughput of these instruction can be calculated by using expression (1/(CYCLES/4)) - for 4 cycles is 1 instruction per cycle, for 8 cycles is 1/2 instruction per cycle and etc. Timings of VOP1 instructions are in this table:

Instruction	Cycles	Instruction	Cycles
V_BFREV_B32	4	V_FREXP_EXP_I32_F32	4
V_CEIL_F32	4	V_FREXP_EXP_I32_F64	DPFACTOR*4
V_CEIL_F64	DPFACTOR*4	V_FREXP_MANT_F32	4
V_CLREXCP	4	V_FREXP_MANT_F64	DPFACTOR*4
V_COS_F32	16	V_LOG_CLAMP_F32	16
V_CVT_F16_F32	4	V_LOG_F32	16
V_CVT_F32_F16	4	V_LOG_LEGACY_F32	16
V_CVT_F32_F64	DPFACTOR*4	V_MOVRELD_B32	4
V_CVT_F32_I32	4	V_MOVRELSD_B32	4
V_CVT_F32_U32	4	V_MOVRELS_B32	4
V_CVT_F32_UBYTE0	4	V_MOV_B32	4
V_CVT_F32_UBYTE1	4	V_MOV_FED_B32	4
V_CVT_F32_UBYTE2	4	V_NOP	4
V_CVT_F32_UBYTE3	4	V_NOT_B32	4
V_CVT_F64_F32	DPFACTOR*4	V_RCP_CLAMP_F32	16
V_CVT_F64_I32	DPFACTOR*4	V_RCP_CLAMP_F64	DPFACTOR*8
V_CVT_F64_U32	DPFACTOR*4	V_RCP_F32	16
V_CVT_FLR_I32_F32	4	V_RCP_F64	DPFACTOR*8
V_CVT_I32_F32	4	V_RCP_IFLAG_F32	16
V_CVT_I32_F64	DPFACTOR*4	V_RCP_LEGACY_F32	16
V_CVT_OFF_F32_I4	4	V_READFIRSTLANE_B32	4
V_CVT_RPI_I32_F32	4	V_RNDNE_F32	4
V_CVT_U32_F32	4	V_RNDNE_F64	DPFACTOR*4
V_CVT_U32_F64	DPFACTOR*4	V_RSQ_CLAMP_F32	16
V_EXP_F32	16	V_RSQ_CLAMP_F64	DPFACTOR*8
V_EXP_LEGACY_F32	16	V_RSQ_F32	16
V_FFBH_I32	4	V_RSQ_F64	DPFACTOR*8
V_FFBH_U32	4	V_RSQ_LEGACY_F32	16
V_FFBL_B32	4	V_SIN_F32	16
V_FLOOR_F32	4	V_SQRT_F32	16
V_FLOOR_F64	DPFACTOR*4	V_SQRT_F64	DPFACTOR*8
V_FRACT_F32	4	V_TRUNC_F32	4
V_FRACT_F64	DPFACTOR*4	V_TRUNC_F64	DPFACTOR*4

VOPC Instruction timings

VOP3 Instruction timings

Instruction	Cycles	Instruction	Cycles
V_ADD_F64	DPFACTOR*4	V_MAD_U64_U32	16
V_ALIGNBIT_B32	4	V_MAX3_F32	4
V_ALIGNBYTE_B32	4	V_MAX3_I32	4
V_ASHR_I64	DPFACTOR*4	V_MAX3_U32	4
V_BFE_I32	4	V_MAX_F64	DPFACTOR*4
V_BFE_U32	4	V_MED3_F32	4
V_BFI_B32	4	V_MED3_I32	4
V_CUBEID_F32	4	V_MED3_U32	4
V_CUBEMA_F32	4	V_MIN3_F32	4
V_CUBESC_F32	4	V_MIN3_I32	4
V_CUBETC_F32	4	V_MIN3_U32	4
V_CVT_PK_U8_F32	4	V_MIN_F64	DPFACTOR*4
V_DIV_FIXUP_F32	16	V_MQSAD_PK_U16_U8	16
V_DIV_FIXUP_F64	DPFACTOR*4	V_MQSAD_U32_U8	16
V_DIV_FMAS_F32	16	V_MQSAD_U8	16
V_DIV_FMAS_F64	DPFACTOR*8	V_MSAD_U8	4
V_DIV_SCALE_F32	16	V_MULLIT_F32	4
V_DIV_SCALE_F64	DPFACTOR*4	V_MUL_F64	DPFACTOR*8
V_FMA_F32	16	V_MUL_HI_I32	16
V_FMA_F64	DPFACTOR*8	V_MUL_HI_U32	16
V_LDEXP_F64	DPFACTOR*4	V_MUL_LO_I32	16
V_LERP_U8	4	V_MUL_LO_U32	16
V_LSHL_B64	DPFACTOR*4	V_QSAD_PK_U16_U8	16
V_LSHR_B64	DPFACTOR*4	V_QSAD_U8	16
V_MAD_F32	4	V_SAD_HI_U8	4
V_MAD_I32_I24	4	V_SAD_U16	4
V_MAD_I64_I32	16	V_SAD_U32	4
V_MAD_LEGACY_F32	4	V_SAD_U8	4
V_MAD_U32_U24	4	V_TRIG_PREOP_F64	DPFACTOR*8

DS Instruction timings

Timings of DS instructions includes only execution without waiting for completing LDS/GDS memory access on single wavefront. Throughput indicates maximal possible throughput that excludes any other delays and penalties. Timings of DS instructions are in this table:

Instruction	Cycles	Throughput
DS_ADD_RTN_U32	8	1/4
DS_ADD_RTN_U64	12	1/6
DS_ADD_SRC2_U32	4	1/4
DS_ADD_SRC2_U64	8	1/8
DS_ADD_U32	8	1/4
DS_ADD_U64	12	1/6
DS_AND_B32	8	1/4
DS_AND_B64	12	1/6
DS_AND_RTN_B32	8	1/4
DS_AND_RTN_B64	12	1/6
DS_AND_SRC2_B32	4	1/4
DS_AND_SRC2_B64	8	1/8
DS_APPEND	4	?
DS_CMPST_B32	12	1/6
DS_CMPST_B64	20	1/10
DS_CMPST_F32	12	1/6
DS_CMPST_F64	20	1/10
DS_CMPST_RTN_B32	12	1/6
DS_CMPST_RTN_B64	20	1/10
DS_CMPST_RTN_F32	12	1/6
DS_CMPST_RTN_F64	20	1/10
DS_CONDXCHG32_RTN_B128	?	?
DS_CONDXCHG32_RTN_B64	?	?
DS_CONSUME	4	?
DS_DEC_RTN_U32	8	1/4
DS_DEC_RTN_U64	12	1/6
DS_DEC_SRC2_U32	4	1/4
DS_DEC_SRC2_U64	8	1/8
DS_DEC_U32	8	1/4
DS_DEC_U64	12	1/6
DS_GWS_BARRIER	?	?
DS_GWS_INIT	?	?
DS_GWS_SEMA_BR	?	?
DS_GWS_SEMA_P	?	?
DS_GWS_SEMA_RELEASE_ALL	?	?
DS_GWS_SEMA_V	?	?
DS_INC_RTN_U32	8	1/4
DS_INC_RTN_U64	12	1/6
DS_INC_SRC2_U32	4	1/4
DS_INC_SRC2_U64	8	1/8
DS_INC_U32	8	1/4
DS_INC_U64	12	1/6
DS_MAX_F32	8	1/4
DS_MAX_F64	12	1/6
DS_MAX_I32	8	1/4
DS_MAX_I64	12	1/6
DS_MAX_RTN_F32	8	1/4
DS_MAX_RTN_F64	12	1/6
DS_MAX_RTN_I32	8	1/4
DS_MAX_RTN_I64	12	1/6
DS_MAX_RTN_U32	8	1/4
DS_MAX_RTN_U64	12	1/6
DS_MAX_SRC2_F32	4	1/4
DS_MAX_SRC2_F64	8	1/8
DS_MAX_SRC2_I32	4	1/4
DS_MAX_SRC2_I64	8	1/8
DS_MAX_SRC2_U32	4	1/4
DS_MAX_SRC2_U64	8	1/8
DS_MAX_U32	8	1/4
DS_MAX_U64	12	1/6
DS_MIN_F32	8	1/4
DS_MIN_F64	12	1/6
DS_MIN_I32	8	1/4
DS_MIN_I64	12	1/6
DS_MIN_RTN_F32	8	1/4
DS_MIN_RTN_F64	12	1/6
DS_MIN_RTN_I32	8	1/4
DS_MIN_RTN_I64	12	1/6
DS_MIN_RTN_U32	8	1/4
DS_MIN_RTN_U64	12	1/6
DS_MIN_SRC2_F32	4	1/4
DS_MIN_SRC2_F64	8	1/8
DS_MIN_SRC2_I32	4	1/4
DS_MIN_SRC2_I64	8	1/8
DS_MIN_SRC2_U32	4	1/4
DS_MIN_SRC2_U64	8	1/8
DS_MIN_U32	8	1/4
DS_MIN_U64	12	1/6
DS_MSKOR_B32	12	1/6
DS_MSKOR_B64	20	1/10
DS_MSKOR_RTN_B32	12	1/6
DS_MSKOR_RTN_B64	20	1/10
DS_NOP	4	?
DS_ORDERED_COUNT (???)	?	?
DS_OR_B32	8	1/4
DS_OR_B64	12	1/6
DS_OR_RTN_B32	8	1/4
DS_OR_RTN_B64	12	1/6
DS_OR_SRC2_B32	4	1/4
DS_OR_SRC2_B64	8	1/8
DS_READ2ST64_B32	8	1/4
DS_READ2ST64_B64	16	1/8
DS_READ2_B32	8	1/4
DS_READ2_B64	16	1/8
DS_READ_B128	16	1/8
DS_READ_B32	4	1/2
DS_READ_B64	8	1/4
DS_READ_B96	16	1/8
DS_READ_I16	4	1/2
DS_READ_I8	4	1/2
DS_READ_U16	4	1/2
DS_READ_U8	4	1/2
DS_RSUB_RTN_U32	8	1/4
DS_RSUB_RTN_U64	12	1/6
DS_RSUB_SRC2_U32	4	1/4
DS_RSUB_SRC2_U64	8	1/8
DS_RSUB_U32	8	1/4
DS_RSUB_U64	12	1/6
DS_SUB_RTN_U32	8	1/4
DS_SUB_RTN_U64	12	1/6
DS_SUB_SRC2_U32	4	1/4
DS_SUB_SRC2_U64	8	1/8
DS_SUB_U32	8	1/4
DS_SUB_U64	12	1/6
DS_SWIZZLE_B32	4	1/2
DS_WRAP_RTN_B32	?	?
DS_WRITE2ST64_B32	12	1/6
DS_WRITE2ST64_B64	20	1/10
DS_WRITE2_B32	12	1/6
DS_WRITE2_B64	20	1/10
DS_WRITE_B128	20	1/10
DS_WRITE_B16	8	1/4
DS_WRITE_B32	8	1/4
DS_WRITE_B64	12	1/8
DS_WRITE_B8	8	1/4
DS_WRITE_B96	16	1/10
DS_WRITE_SRC2_B32	12	1/4
DS_WRITE_SRC2_B64	20	1/8
DS_WRXCHG2ST64_RTN_B32	12	1/6
DS_WRXCHG2ST64_RTN_B64	20	1/12
DS_WRXCHG2_RTN_B32	12	1/6
DS_WRXCHG2_RTN_B64	20	1/12
DS_WRXCHG_RTN_B32	8	1/4
DS_WRXCHG_RTN_B64	12	1/6
DS_XOR_B32	8	1/4
DS_XOR_B64	12	1/6
DS_XOR_RTN_B32	8	1/4
DS_XOR_RTN_B64	12	1/6
DS_XOR_SRC2_B32	4	1/4
DS_XOR_SRC2_B64	8	1/8

MUBUF Instruction timings

Timings of MUBUF instructions includes only execution without waiting for completing main memory access on single wavefront. Additional GLCX adds X cycles to instruction if instruction uses GLC modifier. Timings of MUBUF instructions are in this table:

Instruction	Cycles
BUFFER_ATOMIC_ADD	16+GLC1
BUFFER_ATOMIC_ADD_X2	16+GLC2
BUFFER_ATOMIC_AND	16+GLC1
BUFFER_ATOMIC_AND_X2	16
BUFFER_ATOMIC_CMPSWAP	32
BUFFER_ATOMIC_CMPSWAP_X2	32
BUFFER_ATOMIC_DEC	16+GLC1
BUFFER_ATOMIC_DEC_X2	16+GLC2
BUFFER_ATOMIC_FCMPSWAP	32
BUFFER_ATOMIC_FCMPSWAP_X2	32
BUFFER_ATOMIC_FMAX	16+GLC1
BUFFER_ATOMIC_FMAX_X2	16+GLC2
BUFFER_ATOMIC_FMIN	16+GLC1
BUFFER_ATOMIC_FMIN_X2	16+GLC2
BUFFER_ATOMIC_INC	16+GLC1
BUFFER_ATOMIC_INC_X2	16+GLC2
BUFFER_ATOMIC_OR	16+GLC1
BUFFER_ATOMIC_OR_X2	16+GLC2
BUFFER_ATOMIC_RSUB	16+GLC1
BUFFER_ATOMIC_RSUB_X2	16+GLC2
BUFFER_ATOMIC_SMAX	16+GLC1
BUFFER_ATOMIC_SMAX_X2	16+GLC2
BUFFER_ATOMIC_SMIN	16+GLC1
BUFFER_ATOMIC_SMIN_X2	16+GLC2
BUFFER_ATOMIC_SUB	16+GLC1
BUFFER_ATOMIC_SUB_X2	16+GLC2
BUFFER_ATOMIC_SWAP	16+GLC1
BUFFER_ATOMIC_SWAP_X2	16+GLC2
BUFFER_ATOMIC_UMAX	16+GLC1
BUFFER_ATOMIC_UMAX_X2	16+GLC2
BUFFER_ATOMIC_UMIN	16+GLC1
BUFFER_ATOMIC_UMIN_X2	16+GLC2
BUFFER_ATOMIC_XOR	16+GLC1
BUFFER_ATOMIC_XOR_X2	16+GLC2
BUFFER_LOAD_DWORD	8
BUFFER_LOAD_DWORDX2	18
BUFFER_LOAD_DWORDX3	16
BUFFER_LOAD_DWORDX4	16
BUFFER_LOAD_FORMAT_X	8
BUFFER_LOAD_FORMAT_XY	18?
BUFFER_LOAD_FORMAT_XYZ	16
BUFFER_LOAD_FORMAT_XYZW	16
BUFFER_LOAD_SBYTE	8
BUFFER_LOAD_SSHORT	8
BUFFER_LOAD_UBYTE	8
BUFFER_LOAD_USHORT	8
BUFFER_STORE_BYTE	16
BUFFER_STORE_DWORD	16
BUFFER_STORE_DWORDX2	16
BUFFER_STORE_DWORDX3	16
BUFFER_STORE_DWORDX4	16
BUFFER_STORE_FORMAT_X	16
BUFFER_STORE_FORMAT_XY	16
BUFFER_STORE_FORMAT_XYZ	16
BUFFER_STORE_FORMAT_XYZW	16
BUFFER_STORE_SHORT	16
BUFFER_WBINVL1	?
BUFFER_WBINVL1_SC	?

Download in other formats:

Plain Text

Waves	SGPRs	VGPRs	LdsW/I	Issue
1	128	256	64	1
2	128	128	32	2
3	128	84	21	3
4	128	64	16	4
5	96	48	12	5
6	80	40	10	5
7	72	36	9	5
8	64	32	8	5
9	56	28	7	5
10	48	24	6	5

Waves	SGPRs	VGPRs	LdsW/I	Issue
1	128	256	64	1
2	128	128	32	2
3	128	84	21	3
4	128	64	16	4
5	96	48	12	5
6	80	40	10	5
7	72	36	9	5
8	64	32	8	5
9	56	28	7	5
10	48	24	6	5

Waves	SGPRs	VGPRs	LdsW/I	Issue
1	128	256	64	1
2	128	128	32	2
3	128	84	21	3
4	128	64	16	4
5	96	48	12	5
6	80	40	10	5
7	72	36	9	5
8	64	32	8	5
9	56	28	7	5
10	48	24	6	5