Changes between Version 8 and Version 9 of GcnTimings


Ignore:
Timestamp:
05/26/16 18:00:32 (8 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GcnTimings

    v8 v9  
    55<h3>Preliminary explanations</h3>
    66<p>The almost instructions are executed within 4 cycles (scalar and vector). Hence, to
    7 achieve maximum performance, 4 wavefront per compute units must be ran. </p>
     7achieve maximum performance, 4 wavefront per compute units must be ran.</p>
    88<p>NOTE: Simple single dword (4-byte) instruction is executed in 4 cycles (thanks fast
    99dispatching from cache). However, 2 dword instruction can require 4 extra cycles
    1010to execution due to bigger size in memory and limits of instruction dispatching.
    1111To achieve best performance, we recommend to use single dword instructions.</p>
    12 <p>The 'Delay' column contains instruction's delays (how many cycles needed to execute
    13 instruction). The 'Throughput' contains instruction's throughputs (maximum number of
    14 instructions per cycle).</p>
     12<h3>Occupancy table</h3>
     13<table>
     14<thead>
     15<tr>
     16<th>Waves</th>
     17<th>SGPRs</th>
     18<th>VGPRs</th>
     19<th>LdsW/I</th>
     20<th>Issue</th>
     21</tr>
     22</thead>
     23<tbody>
     24<tr>
     25<td>1</td>
     26<td>128</td>
     27<td>256</td>
     28<td>64</td>
     29<td>1</td>
     30</tr>
     31<tr>
     32<td>2</td>
     33<td>128</td>
     34<td>128</td>
     35<td>32</td>
     36<td>2</td>
     37</tr>
     38<tr>
     39<td>3</td>
     40<td>128</td>
     41<td>84</td>
     42<td>21</td>
     43<td>3</td>
     44</tr>
     45<tr>
     46<td>4</td>
     47<td>128</td>
     48<td>64</td>
     49<td>16</td>
     50<td>4</td>
     51</tr>
     52<tr>
     53<td>5</td>
     54<td>96</td>
     55<td>48</td>
     56<td>12</td>
     57<td>5</td>
     58</tr>
     59<tr>
     60<td>6</td>
     61<td>80</td>
     62<td>40</td>
     63<td>10</td>
     64<td>5</td>
     65</tr>
     66<tr>
     67<td>7</td>
     68<td>72</td>
     69<td>36</td>
     70<td>9</td>
     71<td>5</td>
     72</tr>
     73<tr>
     74<td>8</td>
     75<td>64</td>
     76<td>32</td>
     77<td>8</td>
     78<td>5</td>
     79</tr>
     80<tr>
     81<td>9</td>
     82<td>56</td>
     83<td>28</td>
     84<td>7</td>
     85<td>5</td>
     86</tr>
     87<tr>
     88<td>10</td>
     89<td>48</td>
     90<td>24</td>
     91<td>6</td>
     92<td>5</td>
     93</tr>
     94</tbody>
     95</table>
     96<p>Waves - number of concurrent waves that can be computed by single SIMD unit<br />
     97SGPRs - number of maximum SGPRs that can be allocated that occupancy<br />
     98VPGRs - number of maximum VGPRs that can be allocated that occupancy<br />
     99LdsW/I - Maximum amount of LDS space per vector lane per wavefront in dwords<br />
     100Issue - number of maximum instruction per clock that can be processed  </p>
     101<p>Each compute unit partitioned into four SIMD units. So, maximum number of waves per
     102compute unit is 40.</p>
    15103<h3>Instruction alignment</h3>
    16104<p>Aligmnent Rules for 2-dword instructions (GCN 1.0/1.1):</p>
     
    32120<p>IMPORTANT: If occupancy is greater than 1 wave per compute unit, then penalties for
    33121instruction fetching, branches, and scalar instructions will be masked while executing
    34 more waves than 4<em>CUs. For best results is recommended to execute many waves
    35 (multiple of 4</em>CUs) with occupancy greater than 1.</p>
     122more waves than 4*CUs. For best results is recommended to execute many waves
     123(multiple of 4*CUs) with occupancy greater than 1.</p>
    36124<h3>Instruction scheduling</h3>
    37125<ul>