[wiki:ClrxToc Back to Table of content]
{{{
#!html
<h2>AMD Catalyst OpenCL 2.0 ABI description</h2>
<p>This chapter describes how kernel gets its argument, how access to constant data. Because
Kernel setup is AMD HSA configuration, hence we recommend to refer to ROCm-ABI documentation
to get information about kernel setup and kernel arguments passing. Now an assembler have
all the AMD HSA configuration's pseudo-ops to do it.</p>
<p>In this chapter, size is given in dwords. Dword is 4-byte value.</p>
<h3>Passing options</h3>
<p>CLRX assembler give ability to set what feature will be used by kernel in configuration.
Following feature can be enabled:</p>
<ul>
<li>usesetup - use sizes information. Add kernel setup and sizes buffer
to user data registers.</li>
<li>useargs - kernel uses arguments. Add kernel arguments to user data registers.</li>
<li>useenqueue - enable enqueue mechanism support</li>
<li>usegeneric - enable generic pointers support</li>
</ul>
<p>The number of user data registers depends on set of an enabled features. Following rules will
be applied:</p>
<ul>
<li>if no feature enabled only 4 user data registers will be used.</li>
<li>if useargs enabled, then 6 user data registers will be used. 4-5 user data are
argument's pointer.</li>
<li>if usesetup enabled, then 8 user data registers will be used. 4-5 user data are kernel
setup pointer. 6-7 user data regs are argument's pointer.</li>
<li>if useenqueue enabled, then 10 user data registers will be used. 4-5 user data regs
are kernel setup pointer. 6-7 user data regs are argument's pointer.</li>
<li>if usegeneric enabled, then 12 user data registers will be used. 4-5 user data regs
are kernel setup pointer. 8-9 user data regs are argument's pointer.</li>
<li>for VEGA (GFX9) architecture, then 10 user data registers will be used. 4-5 user data regs
are kernel setup pointer. 6-7 user data regs are argument's pointer.</li>
</ul>
<h3>Argument passing and kernel setup</h3>
<p>First pointer that is present in user data registers is kernel setup pointer.
This pointer points to setup buffer that holds kernel execution setup. Following
dwords:</p>
<ul>
<li>0 dword - general setup. Bit 16-31 - dimensions number</li>
<li>1 dword - enqueued local size ??. Bit 0-15 - local size X, bit 16-31 - local size Y</li>
<li>2 dword - enqueued local size. Bit 0-15 - local size Z</li>
<li>3-5 dword - global size for each dimension</li>
</ul>
<p>Second pointer is argument's pointer. This pointer points to argument's buffer.
First argument are setup arguments.</p>
<ul>
<li>size_t global_offset_0 - 32-bit or 64-bit global offset for X</li>
<li>size_t global_offset_1 - 32-bit or 64-bit global offset for Y</li>
<li>size_t global_offset_2 - 32-bit or 64-bit global offset for Z</li>
<li>void* printf_buffer - 32-bit or 64-bit printf buffer</li>
<li>void* vqueue_pointer - 32-bit or 64-bit</li>
<li>void* aqlwrap_pointer - 32-bit or 64-bit</li>
</ul>
<p>Further arguments in that buffer are an user arguments defined for a kernel. Any pointer,
command queue, image, sampler, structure tooks 8 bytes (64-bit pointer) or
4 bytes (32-bit pointer) in 32-bit AMD OpenCL 2.0.
3 component vector tooks number of bytes  of 4 element vector.
Smaller types likes (char, short) tooks 1-3 bytes. An alignment depends on same type
or type of element (for vectors).</p>
<p>For 64-bit AMD OpenCL 2.0 all setup arguments and pointers are 64-bit.
For 32-bit AMD OpenCL 2.0 all setup arguments and pointers are 32-bit.</p>
<h3>Image arguments</h3>
<p>An images are passed via pointers to argument's buffer. An image pointers points to
image resource and image informations. Image resources tooks 8 dwords. 8 dword hold
information about channel data type. Following table describes data channel type value's
and their counterpart from OpenCL:</p>
<table>
<thead>
<tr>
<th>Value</th>
<th>OpenCL value</th>
<th>Value</th>
<th>OpenCL value</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>CL_SNORM_INT8</td>
<td>8</td>
<td>CL_SIGNED_INT8</td>
</tr>
<tr>
<td>1</td>
<td>CL_SNORM_INT16</td>
<td>9</td>
<td>CL_SIGNED_INT16</td>
</tr>
<tr>
<td>2</td>
<td>CL_UNORM_INT8</td>
<td>10</td>
<td>CL_SIGNED_INT32</td>
</tr>
<tr>
<td>3</td>
<td>CL_UNORM_INT16</td>
<td>11</td>
<td>CL_UNSIGNED_INT8</td>
</tr>
<tr>
<td>4</td>
<td>CL_UNORM_INT24</td>
<td>12</td>
<td>CL_UNSIGNED_INT16</td>
</tr>
<tr>
<td>5</td>
<td>CL_UNORM_SHORT_555</td>
<td>13</td>
<td>CL_UNSIGNED_INT32</td>
</tr>
<tr>
<td>6</td>
<td>CL_UNORM_SHORT_565</td>
<td>14</td>
<td>CL_HALF_FLOAT</td>
</tr>
<tr>
<td>7</td>
<td>CL_UNORM_INT_101010</td>
<td>15</td>
<td>CL_FLOAT</td>
</tr>
</tbody>
</table>
<p>Before looking up table, value should be masked: (value&amp;0xf).</p>
<p>Likewise, 9 dword holds channel order information. Following table describes values and
OpenCL counterparts:</p>
<table>
<thead>
<tr>
<th>Value</th>
<th>OpenCL value</th>
<th>Value</th>
<th>OpenCL value</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>CL_A</td>
<td>10</td>
<td>CL_ARGB</td>
</tr>
<tr>
<td>1</td>
<td>CL_R</td>
<td>11</td>
<td>CL_ABGR</td>
</tr>
<tr>
<td>2</td>
<td>CL_Rx</td>
<td>12</td>
<td>CL_sRGB</td>
</tr>
<tr>
<td>3</td>
<td>CL_RG</td>
<td>13</td>
<td>CL_sRGBx</td>
</tr>
<tr>
<td>4</td>
<td>CL_RGx</td>
<td>14</td>
<td>CL_sRGBA</td>
</tr>
<tr>
<td>5</td>
<td>CL_RA</td>
<td>15</td>
<td>CL_sBGRA</td>
</tr>
<tr>
<td>6</td>
<td>CL_RGB</td>
<td>16</td>
<td>CL_INTENSITY</td>
</tr>
<tr>
<td>7</td>
<td>CL_RGBx</td>
<td>17</td>
<td>CL_LUMINANCE</td>
</tr>
<tr>
<td>8</td>
<td>CL_RGBA</td>
<td>18</td>
<td>CL_DEPTH</td>
</tr>
<tr>
<td>9</td>
<td>CL_BGRA</td>
<td>19</td>
<td>CL_DEPTH_STENCIL</td>
</tr>
</tbody>
</table>
<p>Before looking up table, value should be masked: (value&amp;0x1f).</p>
<h3>Sampler arguments</h3>
<p>A samplers are passed via pointers. A sampler pointers points to sampler resource.</p>
<h3>Scratch buffer access</h3>
<p>First four scalar registers holds scratch buffer descriptor. Refer to
<a href="GcnState">GCN Machine State</a> to learn about vector and scalar initial registers.</p>
<h3>Flat access</h3>
<p>By default, FLAT instructions read or write values from main memory.
Generic addressing (usegeneric) allow to access to LDS and scratch buffer by using
FLAT instructions. A following rules gives ability to correctly setting up that mechanism.
Registers S[6-7] holds special buffer that hold a LDS and scratch buffer base addresses for
FLAT instructions.
16 dword of that buffer holds 32-63 bits of LDS base address for FLAT instructions.
17 dword of that buffer holds 32-63 bits of scratch buffer base address for
FLAT instructions.
Register S10 holds base scratch buffer offset for FLAT_SCRATCH. Register S11 holds
size of scratch per thread (for FLAT_SCRATCH).</p>
}}}