Context Navigation

Changes between Version 9 and Version 10 of GcnInstrsFlat

Timestamp:: 11/28/17 22:00:30 (6 years ago)
Author:: trac
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

GcnInstrsFlat

-                      v9
+                      v10
 <p>These instructions allow to access to main memory, LDS and scratch buffer.
 FLAT instructions fetch address from 2 vector registers that hold 64-bit address.
 FLAT instruction presents only in GCN 1.1 or later architecture.</p>
 <p>List of fields for the FLAT encoding (GCN 1.1 - 1.4):</p>
+FLAT instructions presents only in GCN 1.1 or later architecture.</p>
+<p>List of fields for the FLAT encoding (GCN 1.1/1.2):</p>
 <table>
 <thead>
 …
 <td>NV</td>
 <td>Non-Volatile (GCN 1.4)</td>
+</tr>
+<tr>
+<td>56-63</td>
+<td>VDST</td>
+<td>Vector destination register</td>
+</tr>
+</tbody>
+</table>
+<p>List of fields for the FLAT encoding (GCN 1.4):</p>
+<table>
+<thead>
+<tr>
+<th>Bits</th>
+<th>Name</th>
+<th>Description</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>0-12</td>
+<td>OFFSET</td>
+<td>Byte offset</td>
+</tr>
+<tr>
+<td>13</td>
+<td>LDS</td>
+<td>transfer DATA to LDS and memory</td>
+</tr>
+<tr>
+<td>14-15</td>
+<td>SEG</td>
+<td>Memory segment (instrunction type)</td>
+</tr>
+<tr>
+<td>16</td>
+<td>GLC</td>
+<td>Operation globally coherent</td>
+</tr>
+<tr>
+<td>17</td>
+<td>SLC</td>
+<td>System level coherent</td>
+</tr>
+<tr>
+<td>18-24</td>
+<td>OPCODE</td>
+<td>Operation code</td>
+</tr>
+<tr>
+<td>25-31</td>
+<td>ENCODING</td>
+<td>Encoding type. Must be 0b110111</td>
+</tr>
+<tr>
+<td>32-39</td>
+<td>VADDR</td>
+<td>Vector address registers</td>
+</tr>
+<tr>
+<td>40-47</td>
+<td>VDATA</td>
+<td>Vector data register</td>
+</tr>
+<tr>
+<td>48-54</td>
+<td>SADDR</td>
+<td>Scalar SGPR offset (for GLOBAL/SCRATCH) (0x7f value disables it)</td>
+</tr>
+<tr>
+<td>55</td>
+<td>NV</td>
+<td>Non-Volatile</td>
 </tr>
 <tr>
 …
 SCRATCH instruction syntax: INSTRUCTION VADDR(2), VDATA, SADDR|OFF [MODIFIERS]</p>
 <p>Modifiers can be supplied in any order. Modifiers list: SLC, GLC, TFE,
 LDS, NV, OFFSET:OFFSET. The TFE flag requires additional the VDATA register.
+LDS, NV, INST_OFFSET:OFFSET. The TFE flag requires additional the VDATA register.
 LDS, NV and OFFSET are available only in GCN 1.4 architecture.</p>
 <p>FLAT instruction can complete out of order with each other. This can be caused by different
 resources from/to that instruction can load/store. FLAT instruction increase VMCNT if access
 to main memory, or LKGMCNT if accesses to LDS.</p>
 <p>OFFSET can be 13-bit signed for GLOBAL_* and SCRATCH_* instructions or
 -bit unsigned for FLAT_* instructions.</p>
+<p>OFFSET (INST_OFFSET modifier) can be 13-bit signed for GLOBAL_* and SCRATCH_*
+instructions or 12-bit unsigned for FLAT_* instructions.</p>
 <h3>Instructions by opcode</h3>
 <p>List of the FLAT instructions by opcode (GCN 1.1/1.2):</p>
 …
 </tbody>
 </table>
+<p>List of the FLAT/GLOBAL/SCRATCH instructions by opcode (GCN 1.4):</p>
+<table>
+<thead>
+<tr>
+<th>Opcode</th>
+<th>FLAT</th>
+<th>GLOBAL</th>
+<th>SCRATCH</th>
+<th>Mnemonic</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>16 (0x10)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_UBYTE</td>
+</tr>
+<tr>
+<td>17 (0x11)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_SBYTE</td>
+</tr>
+<tr>
+<td>18 (0x12)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_USHORT</td>
+</tr>
+<tr>
+<td>19 (0x13)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_SSHORT</td>
+</tr>
+<tr>
+<td>20 (0x14)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_DWORD</td>
+</tr>
+<tr>
+<td>21 (0x15)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_DWORDX2</td>
+</tr>
+<tr>
+<td>22 (0x16)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_DWORDX3</td>
+</tr>
+<tr>
+<td>23 (0x17)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_DWORDX4</td>
+</tr>
+<tr>
+<td>24 (0x18)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_STORE_BYTE</td>
+</tr>
+<tr>
+<td>25 (0x19)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_STORE_BYTE_D16_HI</td>
+</tr>
+<tr>
+<td>26 (0x1a)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_STORE_SHORT</td>
+</tr>
+<tr>
+<td>27 (0x1b)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_STORE_SHORT_D16_HI</td>
+</tr>
+<tr>
+<td>28 (0x1c)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_STORE_DWORD</td>
+</tr>
+<tr>
+<td>29 (0x1d)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_STORE_DWORDX2</td>
+</tr>
+<tr>
+<td>30 (0x1e)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_STORE_DWORDX3</td>
+</tr>
+<tr>
+<td>31 (0x1f)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_STORE_DWORDX4</td>
+</tr>
+<tr>
+<td>32 (0x20)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_UBYTE_D16</td>
+</tr>
+<tr>
+<td>33 (0x21)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_UBYTE_D16_HI</td>
+</tr>
+<tr>
+<td>34 (0x22)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_SBYTE_D16</td>
+</tr>
+<tr>
+<td>35 (0x23)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_SBYTE_D16_HI</td>
+</tr>
+<tr>
+<td>36 (0x24)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_SHORT_D16</td>
+</tr>
+<tr>
+<td>37 (0x25)</td>
+<td>✓</td>
+<td>✓</td>
+<td>✓</td>
+<td>*_LOAD_SHORT_D16_HI</td>
+</tr>
+<tr>
+<td>64 (0x40)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_SWAP</td>
+</tr>
+<tr>
+<td>65 (0x41)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_CMPSWAP</td>
+</tr>
+<tr>
+<td>66 (0x42)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_ADD</td>
+</tr>
+<tr>
+<td>67 (0x43)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_SUB</td>
+</tr>
+<tr>
+<td>68 (0x44)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_SMIN</td>
+</tr>
+<tr>
+<td>69 (0x45)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_UMIN</td>
+</tr>
+<tr>
+<td>70 (0x46)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_SMAX</td>
+</tr>
+<tr>
+<td>71 (0x47)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_UMAX</td>
+</tr>
+<tr>
+<td>72 (0x48)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_AND</td>
+</tr>
+<tr>
+<td>73 (0x49)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_OR</td>
+</tr>
+<tr>
+<td>74 (0x4a)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_XOR</td>
+</tr>
+<tr>
+<td>75 (0x4b)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_INC</td>
+</tr>
+<tr>
+<td>76 (0x4c)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_DEC</td>
+</tr>
+<tr>
+<td>96 (0x60)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_SWAP_X2</td>
+</tr>
+<tr>
+<td>97 (0x61)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_CMPSWAP_X2</td>
+</tr>
+<tr>
+<td>98 (0x62)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_ADD_X2</td>
+</tr>
+<tr>
+<td>99 (0x63)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_SUB_X2</td>
+</tr>
+<tr>
+<td>100 (0x64)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_SMIN_X2</td>
+</tr>
+<tr>
+<td>101 (0x65)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_UMIN_X2</td>
+</tr>
+<tr>
+<td>102 (0x66)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_SMAX_X2</td>
+</tr>
+<tr>
+<td>103 (0x67)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_UMAX_X2</td>
+</tr>
+<tr>
+<td>104 (0x68)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_AND_X2</td>
+</tr>
+<tr>
+<td>105 (0x69)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_OR_X2</td>
+</tr>
+<tr>
+<td>106 (0x6a)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_XOR_X2</td>
+</tr>
+<tr>
+<td>107 (0x6b)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_INC_X2</td>
+</tr>
+<tr>
+<td>108 (0x6c)</td>
+<td>✓</td>
+<td>✓</td>
+<td></td>
+<td>*_ATOMIC_DEC_X2</td>
+</tr>
+</tbody>
+</table>
+<p>The '*' means prefix of instruction (FLAT, GLOBAL or SCRATCH).</p>
 <h3>Instruction set</h3>
 <p>Alphabetically sorted instruction list:</p>
 <h4>FLAT_ATOMIC_ADD</h4>
 <p>Opcode: 50 (0x32) for GCN 1.1; 66 (0x42) for GCN 1.2<br />
+<p>Opcode: 50 (0x32) for GCN 1.1; 66 (0x42) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_ADD VDST, VADDR(2), VDATA<br />
 Description: Add VDATA to value of VADDR address, and store result to this address.
+Description: Add VDATA to value of memory address, and store result to this address.
 If GLC flag is set then return previous value from this address to VDST,
 otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT32* VM = (UINT32*)VADDR
+<code>UINT32* VM = (UINT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = *VM + VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_ADD_X2</h4>
 <p>Opcode: 82 (0x52) for GCN 1.1; 98 (0x62) for GCN 1.2<br />
+<p>Opcode: 82 (0x52) for GCN 1.1; 98 (0x62) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_ADD_X2 VDST(2), VADDR(2), VDATA(2)<br />
 Description: Add 64-bit VDATA to 64-bit value of VADDR address, and store result
+Description: Add 64-bit VDATA to 64-bit value of memory address, and store result
 to this address. If GLC flag is set then return previous value from address to VDST,
 otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT64* VM = (UINT64*)VADDR
+<code>UINT64* VM = (UINT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = *VM + VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_AND</h4>
 <p>Opcode: 57 (0x39) for GCN 1.1; 72 (0x48) for GCN 1.2<br />
+<p>Opcode: 57 (0x39) for GCN 1.1; 72 (0x48) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_AND VDST, VADDR(2), VDATA<br />
 Description: Do bitwise AND on VDATA and value of VADDR address,
+Description: Do bitwise AND on VDATA and value of memory address,
 and store result to this address. If GLC flag is set then return previous value
 from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT32* VM = (UINT32*)VADDR
+<code>UINT32* VM = (UINT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = *VM &amp; VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_AND_X2</h4>
 <p>Opcode: 89 (0x59) for GCN 1.1; 104 (0x68) for GCN 1.2<br />
+<p>Opcode: 89 (0x59) for GCN 1.1; 104 (0x68) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_AND_X2 VDST(2), VADDR(2), VDATA(2)<br />
 Description: Do 64-bit bitwise AND on VDATA and value of VADDR address,
+Description: Do 64-bit bitwise AND on VDATA and value of memory address,
 and store result to this address. If GLC flag is set then return previous value
 from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT64* VM = (UINT64*)VADDR
+<code>UINT64* VM = (UINT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = *VM &amp; VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_CMPSWAP</h4>
 <p>Opcode: 49 (0x31) for GCN 1.1; 65 (0x41) for GCN 1.2<br />
+<p>Opcode: 49 (0x31) for GCN 1.1; 65 (0x41) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_CMPSWAP VDST, VADDR(2), VDATA(2)<br />
 Description: Store lower VDATA dword into VADDR address  if previous value
+Description: Store lower VDATA dword into memory address  if previous value
 from that address is equal VDATA&gt;&gt;32, otherwise keep old value from address.
 If GLC flag is set then return previous value from address to VDST,
 otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT32* VM = (UINT32*)VADDR
+<code>UINT32* VM = (UINT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = *VM==(VDATA&gt;&gt;32) ? VDATA&amp;0xffffffff : *VM // part of atomic
 VDST = (GLC) ? P : VDST // last part of atomic</code></p>
 <h4>FLAT_ATOMIC_CMPSWAP_X2</h4>
 <p>Opcode: 81 (0x51) for GCN 1.1; 97 (0x61) for GCN 1.2<br />
+<p>Opcode: 81 (0x51) for GCN 1.1; 97 (0x61) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_CMPSWAP_X2 VDST(2), VADDR(2), VDATA(4)<br />
 Description: Store lower VDATA 64-bit word into VADDR address if previous value
+Description: Store lower VDATA 64-bit word into memory address if previous value
 from address is equal VDATA&gt;&gt;64, otherwise keep old value from VADDR.
 If GLC flag is set then return previous value from VADDR to VDST,
 otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT64* VM = (UINT64*)VADDR
+<code>UINT64* VM = (UINT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = *VM==(VDATA[2:3]) ? VDATA[0:1] : *VM // part of atomic
 VDST = (GLC) ? P : VDST // last part of atomic</code></p>
 <h4>FLAT_ATOMIC_DEC</h4>
 <p>Opcode: 61 (0x3d) for GCN 1.1; 76 (0x4c) for GCN 1.2<br />
+<p>Opcode: 61 (0x3d) for GCN 1.1; 76 (0x4c) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_DEC VDST, VADDR(2), VDATA<br />
 Description: Compare value from VADDR address and if less or equal than VDATA
 and this value is not zero, then decrement value from VADDR address,
+Description: Compare value from memory address and if less or equal than VDATA
+and this value is not zero, then decrement value from memory address,
 otherwise store VDATA to this address. If GLC flag is set then return previous value
 from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT32* VM = (UINT32*)VADDR
+<code>UINT32* VM = (UINT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = (*VM &lt;= VDATA &amp;&amp; *VM!=0) ? *VM-1 : VDATA // atomic
 VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_DEC_X2</h4>
 <p>Opcode: 93 (0x5d) for GCN 1.1; 108 (0x6c) for GCN 1.2<br />
+<p>Opcode: 93 (0x5d) for GCN 1.1; 108 (0x6c) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_DEC_X2 VDST(2), VADDR(2), VDATA(2)<br />
 Description: Compare 64-bit value from VADDR address and if less or equal than VDATA
 and this value is not zero, then decrement value from VADDR address,
+Description: Compare 64-bit value from memory address and if less or equal than VDATA
+and this value is not zero, then decrement value from memory address,
 otherwise store VDATA to this address. If GLC flag is set then return previous value
 from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT64* VM = (UINT64*)VADDR
+<code>UINT64* VM = (UINT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = (*VM &lt;= VDATA &amp;&amp; *VM!=0) ? *VM-1 : VDATA // atomic
 VDST = (GLC) ? P : VDST // atomic</code></p>
 …
 <p>Opcode: 62 (0x3e) for GCN 1.1<br />
 Syntax: FLAT_ATOMIC_FCMPSWAP VDST, VADDR(1:2), VDATA(2)<br />
 Description: Store lower VDATA dword into VADDR address if previous single floating point
+Description: Store lower VDATA dword into memory address if previous single floating point
 value from address is equal singe floating point value VDATA&gt;&gt;32,
 otherwise keep old value from VADDR address.
+otherwise keep old value from memory address.
 If GLC flag is set then return previous value from this address to VDST,
 otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>FLOAT* VM = (FLOAT*)VADDR
+<code>FLOAT* VM = (FLOAT*)(VADDR + INST_OFFSET)
 FLOAT P = *VM; *VM = *VM==ASFLOAT(VDATA&gt;&gt;32) ? VDATA&amp;0xffffffff : *VM // part of atomic
 VDST[0] = (GLC) ? P : VDST // last part of atomic</code></p>
 …
 <p>Opcode: 94 (0x5e) for GCN 1.1<br />
 Syntax: FLAT_ATOMIC_FCMPSWAP_X2 VDATA(2), VADDR(2), SRSRC(4), SOFFSET<br />
 Description: Store lower VDATA 64-bit word into VADDR address if previous double
+Description: Store lower VDATA 64-bit word into memory address if previous double
 floating point value from address is equal singe floating point value VDATA&gt;&gt;32,
 otherwise keep old value from VADDR address.
+otherwise keep old value from memory address.
 If GLC flag is set then return previous value from address to VDST, otherwise keep
 VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>DOUBLE* VM = (DOUBLE*)VMADDR
+<code>DOUBLE* VM = (DOUBLE*)(VADDR + INST_OFFSET)
 DOUBLE P = *VM; *VM = *VM==ASDOUBLE(VDATA[2:3]) ? VDATA[0:1] : *VM // part of atomic
 VDST = (GLC) ? P : VDST // last part of atomic</code></p>
 …
 Syntax: FLAT_ATOMIC_FMAX VDST, VADDR(2), VDATA<br />
 Description: Choose greatest single floating point value from VDATA and from
 VADDR address, and store result to this address.
+memory address, and store result to this address.
 If GLC flag is set then return previous value from address to VDST, otherwise keep
 VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>FLOAT* VM = (FLOAT*)VADDR
+<code>FLOAT* VM = (FLOAT*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = MAX(*VM, ASFLOAT(VDATA)); VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>BUFFER_ATOMIC_FMAX_X2</h4>
 …
 Syntax: FLAT_ATOMIC_FMAX_X2 VDST(2), VADDR(2), VDATA(2)<br />
 Description: Choose greatest double floating point value from VDATA and from
 VADDR address, and store result to this address.
+memory address, and store result to this address.
 If GLC flag is set then return previous value from address to VDST,
 otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>DOUBLE* VM = (DOUBLE*)VADDR
+<code>DOUBLE* VM = (DOUBLE*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = MAX(*VM, ASDOUBLE(VDATA)); VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_FMIN</h4>
 …
 Syntax: FLAT_ATOMIC_FMIN VDST, VADDR(2), VDATA<br />
 Description: Choose smallest single floating point value from VDATA and from
 VADDR address, and store result to this address.
+memory address, and store result to this address.
 If GLC flag is set then return previous value from address to VDST, otherwise keep
 VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>FLOAT* VM = (FLOAT*)VADDR
+<code>FLOAT* VM = (FLOAT*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = MIN(*VM, ASFLOAT(VDATA)); VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>BUFFER_ATOMIC_FMIN_X2</h4>
 …
 Syntax: FLAT_ATOMIC_FMIN_X2 VDST(2), VADDR(2), VDATA(2)<br />
 Description: Choose smallest double floating point value from VDATA and from
 VADDR address, and store result to this address.
+memory address, and store result to this address.
 If GLC flag is set then return previous value from address to VDST,
 otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>DOUBLE* VM = (DOUBLE*)VADDR
+<code>DOUBLE* VM = (DOUBLE*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = MIN(*VM, ASDOUBLE(VDATA)); VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_INC</h4>
 <p>Opcode: 60 (0x3c) for GCN 1.1; 75 (0x4b) for GCN 1.2<br />
+<p>Opcode: 60 (0x3c) for GCN 1.1; 75 (0x4b) for GCN 1.2/1.4<br />
 Syntax: FLT_ATOMIC_INC VDST, VADDR(2), VDATA<br />
 Description: Compare value from VADDR address and if less than VDATA,
+Description: Compare value from memory address and if less than VDATA,
 then increment value from address, otherwise store zero to address.
 If GLC flag is set then return previous value from this address to VDST,
 otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT32* VM = (UINT32*)VADDR
+<code>UINT32* VM = (UINT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = (*VM &lt; VDATA) ? *VM+1 : 0; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_INC_X2</h4>
 <p>Opcode: 92 (0x5c) for GCN 1.1; 107 (0x9b) for GCN 1.2<br />
+<p>Opcode: 92 (0x5c) for GCN 1.1; 107 (0x9b) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_INC_X2 VDST(2), VADDR(2), VADDR(2)<br />
 Description: Compare 64-bit value from VADDR address and if less than VDATA,
+Description: Compare 64-bit value from memory address and if less than VDATA,
 then increment value from address, otherwise store zero to address.
 If GLC flag is set then return previous value from this address to VDST,
 otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT64* VM = (UINT64*)VADDR
+<code>UINT64* VM = (UINT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = (*VM &lt; VDATA) ? *VM+1 : 0; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_OR</h4>
 <p>Opcode: 58 (0x3a) for GCN 1.1; 73 (0x49) for GCN 1.2<br />
+<p>Opcode: 58 (0x3a) for GCN 1.1; 73 (0x49) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_OR VDST, VADDR(2), VDATA<br />
 Description: Do bitwise OR on VDATA and value of VADDR address,
+Description: Do bitwise OR on VDATA and value of memory address,
 and store result to this address. If GLC flag is set then return previous value
 from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT32* VM = (UINT32*)VADDR
+<code>UINT32* VM = (UINT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = *VM | VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_OR_X2</h4>
 <p>Opcode: 90 (0x5a) for GCN 1.1; 105 (0x69) for GCN 1.2<br />
+<p>Opcode: 90 (0x5a) for GCN 1.1; 105 (0x69) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_OR_X2 VDST(2), VADDR(2), VDATA(2)<br />
 Description: Do 64-bit bitwise OR on VDATA and value of VADDR address,
+Description: Do 64-bit bitwise OR on VDATA and value of memory address,
 and store result to this address. If GLC flag is set then return previous value
 from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT64* VM = (UINT64*)VADDR
+<code>UINT64* VM = (UINT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = *VM | VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_SMAX</h4>
 <p>Opcode: 55 (0x37) for GCN 1.1; 70 (0x46) for GCN 1.2<br />
+<p>Opcode: 55 (0x37) for GCN 1.1; 70 (0x46) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_SMAX VDST, VADDR(2), VDATA<br />
 Description: Choose greatest signed 32-bit value from VDATA and from VADDR address,
+Description: Choose greatest signed 32-bit value from VDATA and from memory address,
 and store result to this address.
 If GLC flag is set then return previous value from this address to VDST, otherwise keep
 VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>INT32* VM = (INT32*)VADDR
+<code>INT32* VM = (INT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = MAX(*VM, (INT32)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_SMAX_X2</h4>
 <p>Opcode: 87 (0x57) for GCN 1.1; 102 (0x66) for GCN 1.2<br />
+<p>Opcode: 87 (0x57) for GCN 1.1; 102 (0x66) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_SMAX_X2 VDST(2), VADDR(2), VDATA(2)<br />
 Description: Choose greatest signed 64-bit value from VDATA and from VADDR address,
+Description: Choose greatest signed 64-bit value from VDATA and from memory address,
 and store result to this address.
 If GLC flag is set then return previous value from this address to VDST, otherwise keep
 VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>INT64* VM = (INT64*)VADDR
+<code>INT64* VM = (INT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = MAX(*VM, (INT64)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_SMIN</h4>
 <p>Opcode: 53 (0x35) for GCN 1.1; 68 (0x44) for GCN 1.2<br />
+<p>Opcode: 53 (0x35) for GCN 1.1; 68 (0x44) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_SMIN VDST, VADDR(2), VDATA<br />
 Description: Choose smallest signed 32-bit value from VDATA and from VADDR address,
+Description: Choose smallest signed 32-bit value from VDATA and from memory address,
 and store result to this address.
 If GLC flag is set then return previous value from this address to VDST, otherwise keep
 VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>INT32* VM = (INT32*)VADDR
+<code>INT32* VM = (INT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = MIN(*VM, (INT32)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_SMIN_X2</h4>
 <p>Opcode: 85 (0x55) for GCN 1.1; 100 (0x64) for GCN 1.2<br />
+<p>Opcode: 85 (0x55) for GCN 1.1; 100 (0x64) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_SMIN_X2 VDST(2), VADDR(2), VDATA(2)<br />
 Description: Choose smallest signed 64-bit value from VDATA and from VADDR address,
+Description: Choose smallest signed 64-bit value from VDATA and from memory address,
 and store result to this address.
 If GLC flag is set then return previous value from this address to VDST, otherwise keep
 VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>INT64* VM = (INT64*)VADDR
+<code>INT64* VM = (INT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = MIN(*VM, (INT64)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_SUB</h4>
 <p>Opcode: 51 (0x33) for GCN 1.1; 67 (0x43) for GCN 1.2<br />
+<p>Opcode: 51 (0x33) for GCN 1.1; 67 (0x43) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_SUB VDST, VADDR(2), VDATA<br />
 Description: Subtract VDATA from value of VADDR address, and store result to this address.
+Description: Subtract VDATA from value of memory address, and store result to this address.
 If GLC flag is set then return previous value from this address to VDST,
 otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT32* VM = (UINT32*)VADDR
+<code>UINT32* VM = (UINT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = *VM - VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_SUB_X2</h4>
 <p>Opcode: 83 (0x53) for GCN 1.1; 99 (0x63) for GCN 1.2<br />
+<p>Opcode: 83 (0x53) for GCN 1.1; 99 (0x63) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_SUB_X2 VDST(2), VADDR(2), VDATA(2)<br />
 Description: Subtract 64-bit VDATA from 64-bit value of VADDR address, and store result
+Description: Subtract 64-bit VDATA from 64-bit value of memory address, and store result
 to this address. If GLC flag is set then return previous value from address to VDST,
 otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT64* VM = (UINT64*)VADDR
+<code>UINT64* VM = (UINT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = *VM - VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_SWAP</h4>
 <p>Opcode: 48 (0x30) for GCN 1.1; 64 (0x40) for GCN 1.2<br />
 Syntax: FLAT_ATOMIC_SWAP VDST, VADDR(2), VDATA
 Description: Store VDATA dword into VADDR address. If GLC flag is set then
 return previous value from VADDR address to VDST, otherwise keep old value from VDST.
+<p>Opcode: 48 (0x30) for GCN 1.1; 64 (0x40) for GCN 1.2/1.4<br />
+Syntax: FLAT_ATOMIC_SWAP VDST, VADDR(2), VDATA<br />
+Description: Store VDATA dword into memory address. If GLC flag is set then
+return previous value from memory address to VDST, otherwise keep old value from VDST.
 Operation is atomic.<br />
 Operation:<br />
 <code>UINT32* VM = (UINT32*)VADDR
+<code>UINT32* VM = (UINT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_SWAP_X2</h4>
 <p>Opcode: 80 (0x50) for GCN 1.1; 96 (0x60) for GCN 1.2<br />
 Syntax: FLAT_ATOMIC_SWAP_X2 VDST(2), VADDR(2), VDATA(2)
 Description: Store VDATA 64-bit word into VADDR address. If GLC flag is set then
 return previous value from VADDR address to VDST, otherwise keep old value from VDST.
+<p>Opcode: 80 (0x50) for GCN 1.1; 96 (0x60) for GCN 1.2/1.4<br />
+Syntax: FLAT_ATOMIC_SWAP_X2 VDST(2), VADDR(2), VDATA(2)<br />
+Description: Store VDATA 64-bit word into memory address. If GLC flag is set then
+return previous value from memory address to VDST, otherwise keep old value from VDST.
 Operation is atomic.<br />
 Operation:<br />
 <code>UINT64* VM = (UINT64*)VADDR
+<code>UINT64* VM = (UINT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_UMAX</h4>
 <p>Opcode: 56 (0x38) for GCN 1.1; 71 (0x47) for GCN 1.2<br />
+<p>Opcode: 56 (0x38) for GCN 1.1; 71 (0x47) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_UMAX VDST, VADDR(2), VDATA<br />
 Description: Choose greatest unsigned 32-bit value from VDATA and from VADDR address,
+Description: Choose greatest unsigned 32-bit value from VDATA and from memory address,
 and store result to this address.
 If GLC flag is set then return previous value from this address to VDST, otherwise keep
 VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT32* VM = (UINT32*)VADDR
+<code>UINT32* VM = (UINT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = MAX(*VM, (UINT32)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_UMAX_X2</h4>
 <p>Opcode: 88 (0x58) for GCN 1.1; 103 (0x67) for GCN 1.2<br />
+<p>Opcode: 88 (0x58) for GCN 1.1; 103 (0x67) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_UMAX_X2 VDST(2), VADDR(2), VDATA(2)<br />
 Description: Choose greatest unsigned 64-bit value from VDATA and from VADDR address,
+Description: Choose greatest unsigned 64-bit value from VDATA and from memory address,
 and store result to this address.
 If GLC flag is set then return previous value from this address to VDST, otherwise keep
 VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT64* VM = (UINT64*)VADDR
+<code>UINT64* VM = (UINT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = MAX(*VM, (UINT64)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_UMIN</h4>
 <p>Opcode: 54 (0x36) for GCN 1.1; 69 (0x45) for GCN 1.2<br />
+<p>Opcode: 54 (0x36) for GCN 1.1; 69 (0x45) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_UMIN VDST, VADDR(2), VDATA<br />
 Description: Choose smallest unsigned 32-bit value from VDATA and from VADDR address,
+Description: Choose smallest unsigned 32-bit value from VDATA and from memory address,
 and store result to this address.
 If GLC flag is set then return previous value from this address to VDST, otherwise keep
 VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT32* VM = (UINT32*)VADDR
+<code>UINT32* VM = (UINT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = MIN(*VM, (UINT32)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_UMIN_X2</h4>
 <p>Opcode: 86 (0x56) for GCN 1.1; 101 (0x65) for GCN 1.2<br />
+<p>Opcode: 86 (0x56) for GCN 1.1; 101 (0x65) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_UMIN_X2 VDST(2), VADDR(2), VDATA(2)<br />
 Description: Choose smallest unsigned 64-bit value from VDATA and from VADDR address,
+Description: Choose smallest unsigned 64-bit value from VDATA and from memory address,
 and store result to this address.
 If GLC flag is set then return previous value from this address to VDST, otherwise keep
 VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT64* VM = (UINT64*)VADDR
+<code>UINT64* VM = (UINT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = MIN(*VM, (UINT64)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_XOR</h4>
 <p>Opcode: 59 (0x3b) for GCN 1.1; 74 (0x4a) for GCN 1.2<br />
+<p>Opcode: 59 (0x3b) for GCN 1.1; 74 (0x4a) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_XOR VDST, VADDR(2), VDATA<br />
 Description: Do bitwise XOR on VDATA and value of VADDR address,
+Description: Do bitwise XOR on VDATA and value of memory address,
 and store result to this address. If GLC flag is set then return previous value
 from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT32* VM = (UINT32*)VADDR
+<code>UINT32* VM = (UINT32*)(VADDR + INST_OFFSET)
 UINT32 P = *VM; *VM = *VM ^ VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_ATOMIC_XOR_X2</h4>
 <p>Opcode: 91 (0x5b) for GCN 1.1; 106 (0x6a) for GCN 1.2<br />
+<p>Opcode: 91 (0x5b) for GCN 1.1; 106 (0x6a) for GCN 1.2/1.4<br />
 Syntax: FLAT_ATOMIC_XOR_X2 VDST(2), VADDR(2), VDATA(2)<br />
 Description: Do 64-bit bitwise XOR on VDATA and value of VADDR address,
+Description: Do 64-bit bitwise XOR on VDATA and value of memory address,
 and store result to this address. If GLC flag is set then return previous value
 from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
 Operation:<br />
 <code>UINT64* VM = (UINT64*)VADDR
+<code>UINT64* VM = (UINT64*)(VADDR + INST_OFFSET)
 UINT64 P = *VM; *VM = *VM ^ VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
 <h4>FLAT_LOAD_DWORD</h4>
 <p>Opcode: 12 (0xc) for GCN 1.1; 20 (0x14) for GCN 1.2<br />
+<p>Opcode: 12 (0xc) for GCN 1.1; 20 (0x14) for GCN 1.2/1.4<br />
 Syntax: FLAT_LOAD_DWORD VDST, VADDR(2)<br />
 Description Load dword to VDST from VADDR address.<br />
 Operation:<br />
 <code>VDST = *(UINT32*)VADDR</code></p>
+Description Load dword to VDST from memory address.<br />
+Operation:<br />
+<code>VDST = *(UINT32*)(VADDR + INST_OFFSET)</code></p>
 <h4>FLAT_LOAD_DWORDX2</h4>
 <p>Opcode: 13 (0xd) for GCN 1.1; 21 (0x15) for GCN 1.2<br />
+<p>Opcode: 13 (0xd) for GCN 1.1; 21 (0x15) for GCN 1.2/1.4<br />
 Syntax: FLAT_LOAD_DWORDX2 VDST(, VADDR(2)<br />
 Description Load two dwords to VDST from VADDR address.<br />
 Operation:<br />
 <code>VDST = *(UINT64*)VADDR</code></p>
+Description Load two dwords to VDST from memory address.<br />
+Operation:<br />
+<code>VDST = *(UINT64*)(VADDR + INST_OFFSET)</code></p>
 <h4>FLAT_LOAD_DWORDX3</h4>
 <p>Opcode: 15 (0xf) for GCN 1.1; 22 (0x16) for GCN 1.2<br />
+<p>Opcode: 15 (0xf) for GCN 1.1; 22 (0x16) for GCN 1.2/1.4<br />
 Syntax: FLAT_LOAD_DWORDX3 VDST(3), VADDR(2)<br />
+Description Load three dwords to VDST from VADDR address.<br />
+Operation:<br />
+<code>VDST[0] = *(UINT32*)VADDR
+VDST[1] = *(UINT32*)(VADDR+4)
+VDST[2] = *(UINT32*)(VADDR+8)</code></p>
+Description Load three dwords to VDST from memory address.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + INST_OFFSET)
+VDST[0] = *(UINT32*)VM
+VDST[1] = *(UINT32*)(VM+4)
+VDST[2] = *(UINT32*)(VM+8)</code></p>
 <h4>FLAT_LOAD_DWORDX4</h4>
 <p>Opcode: 13 (0xe) for GCN 1.1; 23 (0x17) for GCN 1.2<br />
+<p>Opcode: 13 (0xe) for GCN 1.1; 23 (0x17) for GCN 1.2/1.4<br />
 Syntax: FLAT_LOAD_DWORDX4 VDST(4), VADDR(2)<br />
+Description Load four dwords to VDST from VADDR address.<br />
+Operation:<br />
+<code>VDST[0] = *(UINT32*)VADDR
+VDST[1] = *(UINT32*)(VADDR+4)
+VDST[2] = *(UINT32*)(VADDR+8)
+VDST[3] = *(UINT32*)(VADDR+12)</code></p>
+Description Load four dwords to VDST from memory address.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + INST_OFFSET)
+VDST[0] = *(UINT32*)VM
+VDST[1] = *(UINT32*)(VM+4)
+VDST[2] = *(UINT32*)(VM+8)
+VDST[3] = *(UINT32*)(VM+12)</code></p>
 <h4>FLAT_LOAD_SBYTE</h4>
 <p>Opcode: 9 (0x9) for GCN 1.1; 17 (0x11) for GCN 1.2<br />
+<p>Opcode: 9 (0x9) for GCN 1.1; 17 (0x11) for GCN 1.2/1.4<br />
 Syntax: FLAT_LOAD_SBYTE VDST, VADDR(2)<br />
+Description: Load byte to VDST from VADDR address with sign extending.<br />
+Operation:<br />
+<code>VDST = *(INT8*)VADDR</code></p>
+Description: Load byte to VDST from memory address with sign extending.<br />
+Operation:<br />
+<code>VDST = *(INT8*)(VADDR + INST_OFFSET)</code></p>
+<h4>FLAT_LOAD_SBYTE_D16</h4>
+<p>Opcode: 34 (0x22) for GCN 1.4<br />
+Syntax: FLAT_LOAD_SBYTE_D16 VDST, VADDR(2)<br />
+Description: Load byte to lower 16-bit part of VDST from
+memory address with sign extending.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + INST_OFFSET)
+VDST = ((UINT16)*(INT8*)VM) | (VDST&amp;0xffff0000)</code></p>
+<h4>FLAT_LOAD_SBYTE_D16_HI</h4>
+<p>Opcode: 35 (0x23) for GCN 1.4<br />
+Syntax: FLAT_LOAD_SBYTE_D16_HI VDST, VADDR(2)<br />
+Description: Load byte to higher 16-bit part of VDST from
+memory address with sign extending.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + INST_OFFSET)
+VDST = (((UINT32)*(INT8*)VM)&lt;&lt;16) | (VDST&amp;0xffff)</code></p>
+<h4>FLAT_LOAD_SHORT_D16</h4>
+<p>Opcode: 36 (0x24) for GCN 1.4<br />
+Syntax: FLAT_LOAD_SHORT_D16 VDST, VADDR(2)<br />
+Description: Load 16-bit word to lower 16-bit part of VDST from memory address.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + INST_OFFSET)
+VDST = *(UINT16*)VM | (VDST &amp; 0xffff0000)</code></p>
+<h4>FLAT_LOAD_SHORT_D16_HI</h4>
+<p>Opcode: 36 (0x24) for GCN 1.4<br />
+Syntax: FLAT_LOAD_SHORT_D16_HI VDST, VADDR(2)<br />
+Description: Load 16-bit word to lower 16-bit part of VDST from memory address.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + INST_OFFSET)
+VDST = (((UINT32)*(UINT16*)VM)&lt;&lt;16) | (VDST &amp; 0xffff)</code></p>
 <h4>FLAT_LOAD_SSHORT</h4>
 <p>Opcode: 11 (0xb) for GCN 1.1; 19 (0x13) for GCN 1.2<br />
+<p>Opcode: 11 (0xb) for GCN 1.1; 19 (0x13) for GCN 1.2/1.4<br />
 Syntax: FLAT_LOAD_SSHORT VDST, VADDR(2)<br />
 Description: Load 16-bit word to VDST from VADDR address with sign extending.<br />
 Operation:<br />
 <code>VDST = *(INT16*)VADDR</code></p>
+Description: Load 16-bit word to VDST from memory address with sign extending.<br />
+Operation:<br />
+<code>VDST = *(INT16*)(VADDR + INST_OFFSET)</code></p>
 <h4>FLAT_LOAD_UBYTE</h4>
 <p>Opcode: 8 (0x8) for GCN 1.1; 16 (0x10) for GCN 1.2<br />
+<p>Opcode: 8 (0x8) for GCN 1.1; 16 (0x10) for GCN 1.2/1.4<br />
 Syntax: FLAT_LOAD_UBYTE VDST, VADDR(2)<br />
+Description: Load byte to VDST from VADDR address with zero extending.<br />
+Operation:<br />
+<code>VDST = *(UINT8*)VADDR</code></p>
+Description: Load byte to VDST from memory address with zero extending.<br />
+Operation:<br />
+<code>VDST = *(UINT8*)(VADDR + INST_OFFSET)</code></p>
+<h4>FLAT_LOAD_UBYTE_D16</h4>
+<p>Opcode: 32 (0x20) for GCN 1.4<br />
+Syntax: FLAT_LOAD_UBYTE_D16 VDST, VADDR(2)<br />
+Description: Load byte to lower 16-bit part of VDST from
+memory address with zero extending.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + INST_OFFSET)
+VDST = ((UINT16)*(UINT8*)VM) | (VDST&amp;0xffff0000)</code></p>
+<h4>FLAT_LOAD_UBYTE_D16_HI</h4>
+<p>Opcode: 33 (0x21) for GCN 1.4<br />
+Syntax: FLAT_LOAD_UBYTE_D16_HI VDST, VADDR(2)<br />
+Description: Load byte to higher 16-bit part of VDST from
+memory address with zero extending.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + INST_OFFSET)
+VDST = (((UINT32)*(UINT8*)VM)&lt;&lt;16) | (VDST&amp;0xffff)</code></p>
 <h4>FLAT_LOAD_USHORT</h4>
 <p>Opcode: 10 (0xa) for GCN 1.1; 18 (0x12) for GCN 1.2<br />
+<p>Opcode: 10 (0xa) for GCN 1.1; 18 (0x12) for GCN 1.2/1.4<br />
 Syntax: FLAT_LOAD_USHORT VDST, VADDR(1:2)<br />
 Description: Load 16-bit word to VDST from VADDR address with zero extending.<br />
 Operation:<br />
 <code>VDST = *(UINT16*)VADDR</code></p>
+Description: Load 16-bit word to VDST from memory address with zero extending.<br />
+Operation:<br />
+<code>VDST = *(UINT16*)(VADDR + INST_OFFSET)</code></p>
 <h4>FLAT_STORE_BYTE</h4>
 <p>Opcode: 24 (0x18)<br />
 Syntax: FLAT_STORE_BYTE VADDR(2), VDATA<br />
+Description: Store byte from VDATA to VADDR address.<br />
+Operation:<br />
+<code>*(UINT8*)VADDR = VDATA&amp;0xff</code></p>
+Description: Store byte from VDATA to memory address.<br />
+Operation:<br />
+<code>*(UINT8*)(VADDR + INST_OFFSET) = VDATA&amp;0xff</code></p>
+<h4>FLAT_STORE_BYTE_D16_HI</h4>
+<p>Opcode: 25 (0x19) for GCN 1.4<br />
+Syntax: FLAT_STORE_BYTE_D16_HI VADDR(2), VDATA<br />
+Description: Store byte from 16-23 bits of VDATA to memory address.<br />
+Operation:<br />
+<code>*(UINT8*)(VADDR + INST_OFFSET) = (VDATA&gt;&gt;16)&amp;0xff</code></p>
 <h4>FLAT_STORE_DWORD</h4>
 <p>Opcode: 28 (0x1c)<br />
 Syntax: FLAT_STORE_DWORD VADDR(2), VDATA<br />
 Description: Store dword from VDATA to VADDR address.<br />
 Operation:<br />
 <code>*(UINT32*)VADDR = VDATA</code></p>
+Description: Store dword from VDATA to memory address.<br />
+Operation:<br />
+<code>*(UINT32*)(VADDR + INST_OFFSET) = VDATA</code></p>
 <h4>FLAT_STORE_DWORDX2</h4>
 <p>Opcode: 29 (0x1d)<br />
 Syntax: FLAT_STORE_DWORDX2 VADDR(2), VDATA(2)<br />
 Description: Store two dwords from VDATA to VADDR address.<br />
 Operation:<br />
 <code>*(UINT64*)VADDR = VDATA</code></p>
+Description: Store two dwords from VDATA to memory address.<br />
+Operation:<br />
+<code>*(UINT64*)(VADDR + INST_OFFSET) = VDATA</code></p>
 <h4>FLAT_STORE_DWORDX3</h4>
 <p>Opcode: 31 (0x1f) for GCN 1.1; 30 (0x1e) for GCN 1.2<br />
+<p>Opcode: 31 (0x1f) for GCN 1.1; 30 (0x1e) for GCN 1.2/1.4<br />
 Syntax: FLAT_STORE_DWORDX3 VADDR(2), VDATA(3)<br />
+Description: Store three dwords from VDATA to VADDR address.<br />
+Operation:<br />
+<code>*(UINT32*)(VADDR) = VDATA[0]
+*(UINT32*)(VADDR+4) = VDATA[1]
+*(UINT32*)(VADDR+8) = VDATA[2]</code></p>
+Description: Store three dwords from VDATA to memory address.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + INST_OFFSET)
+*(UINT32*)(VM) = VDATA[0]
+*(UINT32*)(VM+4) = VDATA[1]
+*(UINT32*)(VM+8) = VDATA[2]</code></p>
 <h4>FLAT_STORE_DWORDX4</h4>
 <p>Opcode: 30 (0x1e) for GCN 1.1; 31 (0x1d) for GCN 1.2<br />
+<p>Opcode: 30 (0x1e) for GCN 1.1; 31 (0x1d) for GCN 1.2/1.4<br />
 Syntax: FLAT_STORE_DWORDX4 VADDR(2), VDATA(4)<br />
 Description: Store four dwords from VDATA to VADDR address.<br />
 Operation:<br />
 <code>*(UINT32*)(VADDR) = VDATA[0]
 *(UINT32*)(VADDR+4) = VDATA[1]
 *(UINT32*)(VADDR+8) = VDATA[2]
 *(UINT32*)(VADDR+12) = VDATA[3]</code></p>
+Description: Store four dwords from VDATA to memory address.<br />
+Operation:<br />
+<code>*(UINT32*)(VM) = VDATA[0]
+*(UINT32*)(VM+4) = VDATA[1]
+*(UINT32*)(VM+8) = VDATA[2]
+*(UINT32*)(VM+12) = VDATA[3]</code></p>
 <h4>FLAT_STORE_SHORT</h4>
 <p>Opcode: 26 (0x1a)<br />
 Syntax: FLAT_STORE_SHORT VADDR(2), VDATA<br />
+Description: Store 16-bit word from VDATA to VADDR address.<br />
+Operation:<br />
+<code>*(UINT16*)VADDR = VDATA&amp;0xffff</code></p>
+Description: Store 16-bit word from VDATA to memory address.<br />
+Operation:<br />
+<code>*(UINT16*)(VADDR + INST_OFFSET) = VDATA&amp;0xffff</code></p>
+<h4>FLAT_STORE_SHORT_D16_HI</h4>
+<p>Opcode: 27 (0x1b) for GCN 1.4<br />
+Syntax: FLAT_STORE_SHORT_D16_HI VADDR(2), VDATA<br />
+Description: Store 16-bit word from higher 16-bit part of VDATA to memory address.<br />
+Operation:<br />
+<code>*(UINT16*)(VADDR + INST_OFFSET) = VDATA&gt;&gt;16</code></p>
+<h4>GLOBAL_ATOMIC_ADD</h4>
+<p>Opcode: 66 (0x42) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_ADD VDST, VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Add VDATA to value of global address, and store result to this address.
+If GLC flag is set then return previous value from this address to VDST,
+otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT32* VM = (UINT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = *VM + VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_ADD_X2</h4>
+<p>Opcode: 98 (0x62) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_ADD_X2 VDST(2), VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Add 64-bit VDATA to 64-bit value of global address, and store result
+to this address. If GLC flag is set then return previous value from address to VDST,
+otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT64* VM = (UINT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = *VM + VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_AND</h4>
+<p>Opcode: 72 (0x48) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_AND VDST, VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Do bitwise AND on VDATA and value of global address,
+and store result to this address. If GLC flag is set then return previous value
+from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT32* VM = (UINT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = *VM &amp; VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_AND_X2</h4>
+<p>Opcode: 104 (0x68) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_AND_X2 VDST(2), VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Do 64-bit bitwise AND on VDATA and value of global address,
+and store result to this address. If GLC flag is set then return previous value
+from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT64* VM = (UINT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = *VM &amp; VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_CMPSWAP</h4>
+<p>Opcode: 65 (0x41) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_CMPSWAP VDST, VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Store lower VDATA dword into global address  if previous value
+from that address is equal VDATA&gt;&gt;32, otherwise keep old value from address.
+If GLC flag is set then return previous value from address to VDST,
+otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT32* VM = (UINT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = *VM==(VDATA&gt;&gt;32) ? VDATA&amp;0xffffffff : *VM // part of atomic
+VDST = (GLC) ? P : VDST // last part of atomic</code></p>
+<h4>GLOBAL_ATOMIC_CMPSWAP_X2</h4>
+<p>Opcode: 97 (0x61) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_CMPSWAP_X2 VDST(2), VADDR(2), VDATA(4), SADDR(2)|OFF<br />
+Description: Store lower VDATA 64-bit word into global address if previous value
+from address is equal VDATA&gt;&gt;64, otherwise keep old value from VADDR.
+If GLC flag is set then return previous value from VADDR to VDST,
+otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT64* VM = (UINT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = *VM==(VDATA[2:3]) ? VDATA[0:1] : *VM // part of atomic
+VDST = (GLC) ? P : VDST // last part of atomic</code></p>
+<h4>GLOBAL_ATOMIC_DEC</h4>
+<p>Opcode: 76 (0x4c) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_DEC VDST, VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Compare value from global address and if less or equal than VDATA
+and this value is not zero, then decrement value from global address,
+otherwise store VDATA to this address. If GLC flag is set then return previous value
+from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT32* VM = (UINT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = (*VM &lt;= VDATA &amp;&amp; *VM!=0) ? *VM-1 : VDATA // atomic
+VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_DEC_X2</h4>
+<p>Opcode: 108 (0x6c) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_DEC_X2 VDST(2), VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Compare 64-bit value from global address and if less or equal than VDATA
+and this value is not zero, then decrement value from global address,
+otherwise store VDATA to this address. If GLC flag is set then return previous value
+from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT64* VM = (UINT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = (*VM &lt;= VDATA &amp;&amp; *VM!=0) ? *VM-1 : VDATA // atomic
+VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_INC</h4>
+<p>Opcode: 75 (0x4b) for GCN 1.4<br />
+Syntax: FLT_ATOMIC_INC VDST, VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Compare value from global address and if less than VDATA,
+then increment value from address, otherwise store zero to address.
+If GLC flag is set then return previous value from this address to VDST,
+otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT32* VM = (UINT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = (*VM &lt; VDATA) ? *VM+1 : 0; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_INC_X2</h4>
+<p>Opcode: 107 (0x9b) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_INC_X2 VDST(2), VADDR(2), VADDR(2), SADDR(2)|OFF<br />
+Description: Compare 64-bit value from global address and if less than VDATA,
+then increment value from address, otherwise store zero to address.
+If GLC flag is set then return previous value from this address to VDST,
+otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT64* VM = (UINT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = (*VM &lt; VDATA) ? *VM+1 : 0; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_OR</h4>
+<p>Opcode: 73 (0x49) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_OR VDST, VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Do bitwise OR on VDATA and value of global address,
+and store result to this address. If GLC flag is set then return previous value
+from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT32* VM = (UINT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = *VM | VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_OR_X2</h4>
+<p>Opcode: 105 (0x69) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_OR_X2 VDST(2), VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Do 64-bit bitwise OR on VDATA and value of global address,
+and store result to this address. If GLC flag is set then return previous value
+from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT64* VM = (UINT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = *VM | VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_SMAX</h4>
+<p>Opcode: 70 (0x46) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_SMAX VDST, VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Choose greatest signed 32-bit value from VDATA and from global address,
+and store result to this address.
+If GLC flag is set then return previous value from this address to VDST, otherwise keep
+VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>INT32* VM = (INT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = MAX(*VM, (INT32)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_SMAX_X2</h4>
+<p>Opcode: 102 (0x66) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_SMAX_X2 VDST(2), VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Choose greatest signed 64-bit value from VDATA and from global address,
+and store result to this address.
+If GLC flag is set then return previous value from this address to VDST, otherwise keep
+VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>INT64* VM = (INT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = MAX(*VM, (INT64)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_SMIN</h4>
+<p>Opcode: 68 (0x44) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_SMIN VDST, VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Choose smallest signed 32-bit value from VDATA and from global address,
+and store result to this address.
+If GLC flag is set then return previous value from this address to VDST, otherwise keep
+VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>INT32* VM = (INT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = MIN(*VM, (INT32)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_SMIN_X2</h4>
+<p>Opcode: 100 (0x64) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_SMIN_X2 VDST(2), VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Choose smallest signed 64-bit value from VDATA and from global address,
+and store result to this address.
+If GLC flag is set then return previous value from this address to VDST, otherwise keep
+VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>INT64* VM = (INT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = MIN(*VM, (INT64)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_SUB</h4>
+<p>Opcode: 67 (0x43) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_SUB VDST, VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Subtract VDATA from value of global address, and store result to this address.
+If GLC flag is set then return previous value from this address to VDST,
+otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT32* VM = (UINT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = *VM - VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_SUB_X2</h4>
+<p>Opcode: 99 (0x63) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_SUB_X2 VDST(2), VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Subtract 64-bit VDATA from 64-bit value of global address, and store result
+to this address. If GLC flag is set then return previous value from address to VDST,
+otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT64* VM = (UINT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = *VM - VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_SWAP</h4>
+<p>Opcode: 64 (0x40) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_SWAP VDST, VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Store VDATA dword into global address. If GLC flag is set then
+return previous value from global address to VDST, otherwise keep old value from VDST.
+Operation is atomic.<br />
+Operation:<br />
+<code>UINT32* VM = (UINT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_SWAP_X2</h4>
+<p>Opcode: 96 (0x60) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_SWAP_X2 VDST(2), VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Store VDATA 64-bit word into global address. If GLC flag is set then
+return previous value from global address to VDST, otherwise keep old value from VDST.
+Operation is atomic.<br />
+Operation:<br />
+<code>UINT64* VM = (UINT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_UMAX</h4>
+<p>Opcode: 71 (0x47) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_UMAX VDST, VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Choose greatest unsigned 32-bit value from VDATA and from global address,
+and store result to this address.
+If GLC flag is set then return previous value from this address to VDST, otherwise keep
+VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT32* VM = (UINT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = MAX(*VM, (UINT32)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_UMAX_X2</h4>
+<p>Opcode: 103 (0x67) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_UMAX_X2 VDST(2), VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Choose greatest unsigned 64-bit value from VDATA and from global address,
+and store result to this address.
+If GLC flag is set then return previous value from this address to VDST, otherwise keep
+VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT64* VM = (UINT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = MAX(*VM, (UINT64)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_UMIN</h4>
+<p>Opcode: 69 (0x45) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_UMIN VDST, VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Choose smallest unsigned 32-bit value from VDATA and from global address,
+and store result to this address.
+If GLC flag is set then return previous value from this address to VDST, otherwise keep
+VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT32* VM = (UINT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = MIN(*VM, (UINT32)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_UMIN_X2</h4>
+<p>Opcode: 101 (0x65) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_UMIN_X2 VDST(2), VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Choose smallest unsigned 64-bit value from VDATA and from global address,
+and store result to this address.
+If GLC flag is set then return previous value from this address to VDST, otherwise keep
+VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT64* VM = (UINT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = MIN(*VM, (UINT64)VDATA); VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_XOR</h4>
+<p>Opcode: 74 (0x4a) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_XOR VDST, VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Do bitwise XOR on VDATA and value of global address,
+and store result to this address. If GLC flag is set then return previous value
+from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT32* VM = (UINT32*)(VADDR + SADDR + INST_OFFSET)
+UINT32 P = *VM; *VM = *VM ^ VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_ATOMIC_XOR_X2</h4>
+<p>Opcode: 106 (0x6a) for GCN 1.4<br />
+Syntax: GLOBAL_ATOMIC_XOR_X2 VDST(2), VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Do 64-bit bitwise XOR on VDATA and value of global address,
+and store result to this address. If GLC flag is set then return previous value
+from this address to VDST, otherwise keep VDST value. Operation is atomic.<br />
+Operation:<br />
+<code>UINT64* VM = (UINT64*)(VADDR + SADDR + INST_OFFSET)
+UINT64 P = *VM; *VM = *VM ^ VDATA; VDST = (GLC) ? P : VDST // atomic</code></p>
+<h4>GLOBAL_LOAD_DWORD</h4>
+<p>Opcode: 20 (0x14) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_DWORD VDST, VADDR(2), SADDR(2)|OFF<br />
+Description Load dword to VDST from global address.<br />
+Operation:<br />
+<code>VDST = *(UINT32*)(VADDR + SADDR + INST_OFFSET)</code></p>
+<h4>GLOBAL_LOAD_DWORDX2</h4>
+<p>Opcode: 21 (0x15) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_DWORDX2 VDST(, VADDR(2), SADDR(2)|OFF<br />
+Description Load two dwords to VDST from global address.<br />
+Operation:<br />
+<code>VDST = *(UINT64*)(VADDR + SADDR + INST_OFFSET)</code></p>
+<h4>GLOBAL_LOAD_DWORDX3</h4>
+<p>Opcode: 22 (0x16) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_DWORDX3 VDST(3), VADDR(2), SADDR(2)|OFF<br />
+Description Load three dwords to VDST from global address.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + SADDR + INST_OFFSET)
+VDST[0] = *(UINT32*)VM
+VDST[1] = *(UINT32*)(VM+4)
+VDST[2] = *(UINT32*)(VM+8)</code></p>
+<h4>GLOBAL_LOAD_DWORDX4</h4>
+<p>Opcode: 23 (0x17) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_DWORDX4 VDST(4), VADDR(2), SADDR(2)|OFF<br />
+Description Load four dwords to VDST from global address.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + SADDR + INST_OFFSET)
+VDST[0] = *(UINT32*)VM
+VDST[1] = *(UINT32*)(VM+4)
+VDST[2] = *(UINT32*)(VM+8)
+VDST[3] = *(UINT32*)(VM+12)</code></p>
+<h4>GLOBAL_LOAD_SBYTE</h4>
+<p>Opcode: 17 (0x11) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_SBYTE VDST, VADDR(2), SADDR(2)|OFF<br />
+Description: Load byte to VDST from global address with sign extending.<br />
+Operation:<br />
+<code>VDST = *(INT8*)(VADDR + SADDR + INST_OFFSET)</code></p>
+<h4>GLOBAL_LOAD_SBYTE_D16</h4>
+<p>Opcode: 34 (0x22) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_SBYTE_D16 VDST, VADDR(2), SADDR(2)|OFF<br />
+Description: Load byte to lower 16-bit part of VDST from
+global address with sign extending.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + SADDR + INST_OFFSET)
+VDST = ((UINT16)*(INT8*)VM) | (VDST&amp;0xffff0000)</code></p>
+<h4>GLOBAL_LOAD_SBYTE_D16_HI</h4>
+<p>Opcode: 35 (0x23) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_SBYTE_D16_HI VDST, VADDR(2), SADDR(2)|OFF<br />
+Description: Load byte to higher 16-bit part of VDST from
+global address with sign extending.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + SADDR + INST_OFFSET)
+VDST = (((UINT32)*(INT8*)VM)&lt;&lt;16) | (VDST&amp;0xffff)</code></p>
+<h4>GLOBAL_LOAD_SHORT_D16</h4>
+<p>Opcode: 36 (0x24) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_SHORT_D16 VDST, VADDR(2), SADDR(2)|OFF<br />
+Description: Load 16-bit word to lower 16-bit part of VDST from global address.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + SADDR + INST_OFFSET)
+VDST = *(UINT16*)VM | (VDST &amp; 0xffff0000)</code></p>
+<h4>GLOBAL_LOAD_SHORT_D16_HI</h4>
+<p>Opcode: 36 (0x24) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_SHORT_D16_HI VDST, VADDR(2), SADDR(2)|OFF<br />
+Description: Load 16-bit word to lower 16-bit part of VDST from global address.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + SADDR + INST_OFFSET)
+VDST = (((UINT32)*(UINT16*)VM)&lt;&lt;16) | (VDST &amp; 0xffff)</code></p>
+<h4>GLOBAL_LOAD_SSHORT</h4>
+<p>Opcode: 19 (0x13) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_SSHORT VDST, VADDR(2), SADDR(2)|OFF<br />
+Description: Load 16-bit word to VDST from global address with sign extending.<br />
+Operation:<br />
+<code>VDST = *(INT16*)(VADDR + SADDR + INST_OFFSET)</code></p>
+<h4>GLOBAL_LOAD_UBYTE</h4>
+<p>Opcode: 16 (0x10) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_UBYTE VDST, VADDR(2), SADDR(2)|OFF<br />
+Description: Load byte to VDST from global address with zero extending.<br />
+Operation:<br />
+<code>VDST = *(UINT8*)(VADDR + SADDR + INST_OFFSET)</code></p>
+<h4>GLOBAL_LOAD_UBYTE_D16</h4>
+<p>Opcode: 32 (0x20) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_UBYTE_D16 VDST, VADDR(2), SADDR(2)|OFF<br />
+Description: Load byte to lower 16-bit part of VDST from
+global address with zero extending.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + SADDR + INST_OFFSET)
+VDST = ((UINT16)*(UINT8*)VM) | (VDST&amp;0xffff0000)</code></p>
+<h4>GLOBAL_LOAD_UBYTE_D16_HI</h4>
+<p>Opcode: 33 (0x21) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_UBYTE_D16_HI VDST, VADDR(2), SADDR(2)|OFF<br />
+Description: Load byte to higher 16-bit part of VDST from
+global address with zero extending.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + SADDR + INST_OFFSET)
+VDST = (((UINT32)*(UINT8*)VM)&lt;&lt;16) | (VDST&amp;0xffff)</code></p>
+<h4>GLOBAL_LOAD_USHORT</h4>
+<p>Opcode: 18 (0x12) for GCN 1.4<br />
+Syntax: GLOBAL_LOAD_USHORT VDST, VADDR(1:2), SADDR(2)|OFF<br />
+Description: Load 16-bit word to VDST from global address with zero extending.<br />
+Operation:<br />
+<code>VDST = *(UINT16*)(VADDR + SADDR + INST_OFFSET)</code></p>
+<h4>GLOBAL_STORE_BYTE</h4>
+<p>Opcode: 24 (0x18) for GCN 1.4<br />
+Syntax: GLOBAL_STORE_BYTE VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Store byte from VDATA to global address.<br />
+Operation:<br />
+<code>*(UINT8*)(VADDR + SADDR + INST_OFFSET) = VDATA&amp;0xff</code></p>
+<h4>GLOBAL_STORE_BYTE_D16_HI</h4>
+<p>Opcode: 25 (0x19) for GCN 1.4<br />
+Syntax: GLOBAL_STORE_BYTE_D16_HI VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Store byte from 16-23 bits of VDATA to global address.<br />
+Operation:<br />
+<code>*(UINT8*)(VADDR + SADDR + INST_OFFSET) = (VDATA&gt;&gt;16)&amp;0xff</code></p>
+<h4>GLOBAL_STORE_DWORD</h4>
+<p>Opcode: 28 (0x1c) for GCN 1.4<br />
+Syntax: GLOBAL_STORE_DWORD VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Store dword from VDATA to global address.<br />
+Operation:<br />
+<code>*(UINT32*)(VADDR + SADDR + INST_OFFSET) = VDATA</code></p>
+<h4>GLOBAL_STORE_DWORDX2</h4>
+<p>Opcode: 29 (0x1d) for GCN 1.4<br />
+Syntax: GLOBAL_STORE_DWORDX2 VADDR(2), VDATA(2), SADDR(2)|OFF<br />
+Description: Store two dwords from VDATA to global address.<br />
+Operation:<br />
+<code>*(UINT64*)(VADDR + SADDR + INST_OFFSET) = VDATA</code></p>
+<h4>GLOBAL_STORE_DWORDX3</h4>
+<p>Opcode: 30 (0x1e) for GCN 1.4<br />
+Syntax: GLOBAL_STORE_DWORDX3 VADDR(2), VDATA(3), SADDR(2)|OFF<br />
+Description: Store three dwords from VDATA to global address.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + SADDR + INST_OFFSET)
+*(UINT32*)(VM) = VDATA[0]
+*(UINT32*)(VM+4) = VDATA[1]
+*(UINT32*)(VM+8) = VDATA[2]</code></p>
+<h4>GLOBAL_STORE_DWORDX4</h4>
+<p>Opcode: 31 (0x1d) for GCN 1.4<br />
+Syntax: GLOBAL_STORE_DWORDX4 VADDR(2), VDATA(4), SADDR(2)|OFF<br />
+Description: Store four dwords from VDATA to global address.<br />
+Operation:<br />
+<code>BYTE* VM = (VADDR + SADDR + INST_OFFSET)
+*(UINT32*)(VM) = VDATA[0]
+*(UINT32*)(VM+4) = VDATA[1]
+*(UINT32*)(VM+8) = VDATA[2]
+*(UINT32*)(VM+12) = VDATA[3]</code></p>
+<h4>GLOBAL_STORE_SHORT</h4>
+<p>Opcode: 26 (0x1a) for GCN 1.4<br />
+Syntax: GLOBAL_STORE_SHORT VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Store 16-bit word from VDATA to global address.<br />
+Operation:<br />
+<code>*(UINT16*)(VADDR + SADDR + INST_OFFSET) = VDATA&amp;0xffff</code></p>
+<h4>GLOBAL_STORE_SHORT_D16_HI</h4>
+<p>Opcode: 27 (0x1b) for GCN 1.4<br />
+Syntax: GLOBAL_STORE_SHORT_D16_HI VADDR(2), VDATA, SADDR(2)|OFF<br />
+Description: Store 16-bit word from higher 16-bit part of VDATA to global address.<br />
+Operation:<br />
+<code>*(UINT16*)(VADDR + SADDR + INST_OFFSET) = VDATA&gt;&gt;16</code></p>
 }}}