Context Navigation

Changes between Version 8 and Version 9 of GcnTimings

Timestamp:: 05/26/16 18:00:32 (8 years ago)
Author:: trac
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

GcnTimings

-                      v8
+                      v9
 <h3>Preliminary explanations</h3>
 <p>The almost instructions are executed within 4 cycles (scalar and vector). Hence, to
 achieve maximum performance, 4 wavefront per compute units must be ran. </p>
+achieve maximum performance, 4 wavefront per compute units must be ran.</p>
 <p>NOTE: Simple single dword (4-byte) instruction is executed in 4 cycles (thanks fast
 dispatching from cache). However, 2 dword instruction can require 4 extra cycles
 to execution due to bigger size in memory and limits of instruction dispatching.
 To achieve best performance, we recommend to use single dword instructions.</p>
+<p>The 'Delay' column contains instruction's delays (how many cycles needed to execute
+instruction). The 'Throughput' contains instruction's throughputs (maximum number of
+instructions per cycle).</p>
+<h3>Occupancy table</h3>
+<table>
+<thead>
+<tr>
+<th>Waves</th>
+<th>SGPRs</th>
+<th>VGPRs</th>
+<th>LdsW/I</th>
+<th>Issue</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>1</td>
+<td>128</td>
+<td>256</td>
+<td>64</td>
+<td>1</td>
+</tr>
+<tr>
+<td>2</td>
+<td>128</td>
+<td>128</td>
+<td>32</td>
+<td>2</td>
+</tr>
+<tr>
+<td>3</td>
+<td>128</td>
+<td>84</td>
+<td>21</td>
+<td>3</td>
+</tr>
+<tr>
+<td>4</td>
+<td>128</td>
+<td>64</td>
+<td>16</td>
+<td>4</td>
+</tr>
+<tr>
+<td>5</td>
+<td>96</td>
+<td>48</td>
+<td>12</td>
+<td>5</td>
+</tr>
+<tr>
+<td>6</td>
+<td>80</td>
+<td>40</td>
+<td>10</td>
+<td>5</td>
+</tr>
+<tr>
+<td>7</td>
+<td>72</td>
+<td>36</td>
+<td>9</td>
+<td>5</td>
+</tr>
+<tr>
+<td>8</td>
+<td>64</td>
+<td>32</td>
+<td>8</td>
+<td>5</td>
+</tr>
+<tr>
+<td>9</td>
+<td>56</td>
+<td>28</td>
+<td>7</td>
+<td>5</td>
+</tr>
+<tr>
+<td>10</td>
+<td>48</td>
+<td>24</td>
+<td>6</td>
+<td>5</td>
+</tr>
+</tbody>
+</table>
+<p>Waves - number of concurrent waves that can be computed by single SIMD unit<br />
+SGPRs - number of maximum SGPRs that can be allocated that occupancy<br />
+VPGRs - number of maximum VGPRs that can be allocated that occupancy<br />
+LdsW/I - Maximum amount of LDS space per vector lane per wavefront in dwords<br />
+Issue - number of maximum instruction per clock that can be processed  </p>
+<p>Each compute unit partitioned into four SIMD units. So, maximum number of waves per
+compute unit is 40.</p>
 <h3>Instruction alignment</h3>
 <p>Aligmnent Rules for 2-dword instructions (GCN 1.0/1.1):</p>
 …
 <p>IMPORTANT: If occupancy is greater than 1 wave per compute unit, then penalties for
 instruction fetching, branches, and scalar instructions will be masked while executing
 more waves than 4<em>CUs. For best results is recommended to execute many waves
 (multiple of 4</em>CUs) with occupancy greater than 1.</p>
+more waves than 4*CUs. For best results is recommended to execute many waves
+(multiple of 4*CUs) with occupancy greater than 1.</p>
 <h3>Instruction scheduling</h3>
 <ul>