Commands such as N-D range kernel execution or DMA transfers are not directly accessible through driver API calls such as the ones described in the ‘Driver Functions’ section. Instead, they can be encoded into an array called a command buffer and passed to the refsiExecuteCommandBuffer driver function for execution. The commands will be passed to the RefSi command processor (CMP) and executed in-order.
General Command Format
Commands in a command buffer are serialized using binary format based on the PM4 packet format used on several Radeon and Adreno GPUs:
These packets are made up of one or more 64-bit chunks of data. The first chunk is always a header and it is followed by zero or more 64-bit payload chunks. Little-endian ordering is used to encode the chunks.
The header contains the following fields:
Bits 7-0: Reserved. Set to zero.
Bits 15-8: Opcode. Determine the type of RefSi command to execute.
Bits 29-16: Count. Equal to the number of payload chunks, times two.
Bits 31-30: Packet identifier. Always set to 3.
Bits 63-32: Inline chunk. Command-dependent meaning.
The format of payload chunks differs based on the actual command to be executed and will be described in the following section. Unless otherwise specified, fields are unsigned integers. Zero-extension must be used when storing an unsigned field in a chunk that is larger than the field.
Opcodes are described in tables in which Chunk
can be the header’s
chunk or denotes one or multiple 64-bit data payload chunks.
is the opcode-specific name of the encoded data.
FINISH command (Opcode: 1)
This command marks the end of the command buffer. A ‘command buffer complete’ signal will be sent on the bus and the next command buffer in the queue will be executed.
Chunk |
Field |
Description |
Inline |
This field is not used. |
WRITE_REG64 command (Opcode: 2)
This command writes a 64-bit value to a CMP register.
Chunk |
Field |
Description |
Inline |
CMP register index to write the 64-bit value to. |
1 |
64-bit immediate value to write to the register. |
LOAD_REG64 command (Opcode: 3)
This command reads a 64-bit value from the given memory address and writes it to a CMP register. The memory address must be accessible by the CMP.
Chunk |
Field |
Description |
Inline |
CMP register index to write the 64-bit value to. |
1 |
Memory address to read the 64-bit value from. |
STORE_REG64 command (Opcode: 4)
This command reads a 64-bit value from a CMP register and writes it to the given memory address. The memory address must be accessible by the CMP.
Chunk |
Field |
Description |
Inline |
CMP register index to read the 64-bit value from. |
1 |
Memory address to write the 64-bit value to. |
STORE_IMM64 command (Opcode: 5)
This command writes an immediate 64-bit value to the given memory address. The memory address must be accessible by the CMP and only 32-bit destination addresses are supported with this command. It can be used for example to access the DMA controller via the same register interface that is exposed to the RISC-V accelerator cores.
Chunk |
Field |
Description |
Inline |
Lower 32 bits of the address to write to. |
1 |
64-bit immediate value to write to memory. |
COPY_MEM64 command (Opcode: 6)
This command copies a number of 64-bit elements from one area of memory to another. Elements are copied one by one, to enable this function to be used with memory-mapped registers such as performance counters. The UNIT field can be used to read data from a particular hart in the RefSi accelerator.
Chunk |
Field |
Description |
Inline |
Number of 64-bit elements to copy. |
1 |
64-bit address to read data from. |
2 |
64-bit address to write data to. |
3 |
Unit ID to use for reading data from memory. |
RUN_KERNEL_SLICE command (Opcode: 7)
This command runs a kernel slice, executing the entry point function NUM_INSTANCES times with the same kernel arguments and slice ID. An instance ID ranging from zero to NUM_INSTANCES - 1 is passed to the entry point function in order to identify the instance being currently executed.
Before the entry point function is executed on the accelerator, this command can optionally copy a memory range located in device memory (e.g. TCIM) to the per-thread Kernel Thread Block located in TCDM. This can be used for example to populate the per-thread scheduling info structure based on a single scheduling info template located in device memory.
The entry point function is executed across RISC-V accelerator cores and hardware threads, with the CMP being responsible for distributing the instances between available threads. Kernel execution happens synchronously, so that following commands have to wait until this command has finished executing before they are executed.
The following registers are implicitly read by this command and therefore must be set to valid values before the command is executed.
Changes to device memory made by the cores when running the kernel will be fully observable before the next command is executed, i.e. caches are implicitly flushed.
Chunk |
Field |
Description |
Inline |
[7:0] MAX_HARTS |
Maximum number of harts to run the kernel on. |
1 |
Number of times to call the kernel entry point function. |
2 |
Identifier to pass to the kernel entry point. |
RUN_INSTANCES command (Opcode: 8)
This command executes a kernel function NUM_INSTANCES times with a variable number of arguments. An instance ID ranging from zero to NUM_INSTANCES - 1 is passed to the entry point function as the first argument in order to identify the instance being currently executed. Up to 7 extra arguments can be passed to the entry point function in the order they are defined in the EXTRA_ARG_x chunks.
It is similar to RUN_KERNEL_SLICE but does not mandate a fixed kernel entry point signature, other than the instance ID being the first parameter. Unlike with RUN_KERNEL_SLICE, Kernel Uniform Block and Kernel Thread Block data is not copied to TCDM. This must be done explicitly prior to executing this command.
The entry point function is executed across RISC-V accelerator cores and hardware threads, with the CMP being responsible for distributing the instances between available threads. Kernel execution happens synchronously, so that following commands have to wait until this command has finished executing before they are executed.
The following registers are implicitly read by this command and therefore must be set to valid values before the command is executed.
Changes to device memory made by the cores when running the kernel will not always be fully observable before the next command is executed, i.e. caches need to be explicitly flushed using the SYNC_CACHE command.
Chunk |
Field |
Description |
Inline Inline |
[7:0] MAX_HARTS [10:8] NUM_ARGS |
Maximum number of harts to run the kernel on. Number of extra function arguments (max 7). |
1 |
Number of times to call the kernel entry point function. |
2 |
Value to pass to the entry point function as argument. |
… |
… |
… |
n |
Value to pass to the entry point function as argument. |
SYNC_CACHE command (Opcode: 9)
This command synchronizes one or more caches in the RefSi device with the host. After this command has executed, all data written to RISC-V harts’ caches is flushed to device memory. Any data cached for reading is also invalidated.
Chunk |
Field |
Description |
Inline |
Flags describing which caches to synchronize. |
Possible flags:
: Synchronize the data cache.CMP_CACHE_SYNC_ACC_ICACHE
: Synchronize the instruction cache.
CMP_SCRATCH Register (ID: 0)
The CMP_SCRATCH register has no special purpose and can be used freely in a command buffer. For example, it could be used to temporarily store the most recent transfer ID after starting a new DMA transfer. By writing this stored transfer ID to the appropriate DMA register using the STORE_REG64 command, the CMP will wait until this transfer is complete before executing the next command.
Bits |
Field |
Description |
63-0 |
No meaning. |
The CMP_ENTRY_PT_FN register holds the address of the kernel entry point in device memory and must be set before executing any kernel command.
When using the RUN_INSTANCES command (the preferred way to execute kernels), the entry point must have the following signature:
void kernel_entry(size_t instance_id, void *arg1, ..., void *arg7);
The instance_id parameter is set by the CMP to identify the kernel instance being executed by the hardware thread.
When using the RUN_KERNEL_SLICE command the entry point must have the following signature instead:
void kernel_entry(size_t instance_id, size_t slice_id,
const void *packed_args, void *ktb);
The instance_id parameter is set by the CMP to identify the kernel instance being executed by the hardware thread, while the slice_id parameter is set within the command buffer when executing kernel commands such as RUN_KERNEL_SLICE.
The packed_args argument is populated from the memory range described in the CMP_KARGS_INFO and CMP_KUB_DESC registers while the ktb argument is set to an area of memory allocated by the CMP in TCDM based on the CMP_TSD_INFO and CMP_KUB_DESC registers.
Bits |
Field |
Description |
31-0 |
Address of the kernel entry point in memory. |
63-32 |
Reserved |
Reserved. |
The CMP_KUB_DESC register describes where the Kernel Uniform Block is located in device memory. The KUB contains data which is uniform to all kernel instances, such as kernel arguments or the scheduling info template. It can be read by accelerator cores but not written to.
When KUB_SIZE is set to zero, the KUB is not used when executing kernels using the RUN_KERNEL_SLICE command. As a result, the packed_args and ktd parameters are set to an invalid address when the entry point function is invoked. This register is cleared upon reset of the RefSi platform, with all fields set to zero.
Note: this is only used by the RUN_KERNEL_SLICE command.
Bits |
Field |
Description |
47-0 |
Address of the Kernel Uniform Block. |
63-48 |
Size of the Kernel Uniform Block in 256-byte increments. |
The CMP_KARGS_INFO register describes where packed kernel arguments are located in device memory and must be set before executing any RUN_KERNEL_SLICE command that makes use of kernel arguments.
When KARGS_SIZE is set to zero, the entry point function is invoked with an invalid address for the the packed_args parameter. This register is cleared upon reset of the RefSi platform, with all fields set to zero.
Note: this is only used by the RUN_KERNEL_SLICE command.
Bits |
Field |
Description |
15-0 |
Reserved. |
39-16 |
Offset into the KUB where kernel argumenmts are stored. |
63-40 |
Size of the packed kernel argument list. |
The CMP_TSD_INFO register describes where thread-specific data is located in device memory. When a kernel is executed, the CMP copies thread-specific data to TCDM in several locations. Each hardware thread executing the kernel has a separate copy of the data (called the Kernel Thread Block or KTB), which can be freely modified by the thread.
When TSD_SIZE is set to zero, the entry point function is invoked with an invalid address for the ktd parameter and the KTB feature is not used. This register is cleared upon reset of the RefSi platform, with all fields set to zero.
One use-case for TSD is to let the CMP populate per-thread scheduling info in the KTB (located in TCDM) from a single scheduling info template stored in the KUB. Upon executing the kernel entry point, the instance_id and slice_id parameters can be used to update scheduling info fields such as group_id and local_id.
Bits |
Field |
Description |
15-0 |
Reserved. |
39-16 |
Offset into the KUB where thread-specific data is stored. |
63-40 |
Size of thread-specific data for the kernel. |
CMP_STACK_TOP Register (ID: 5)
The CMP_STACK_TOP register can be used to specify the address harts will use as the top of the stack when executing the kernel entry point function. Since this address will be the same for all harts, it is recommended to set up a memory window to enable each hart to have its own stack area.
Bits |
Field |
Description |
63-0 |
Address to use as the top of the stack. |
CMP_RETURN_ADDR Register (ID: 6)
The CMP_RETURN_ADDR register is used to specify the address to jump to when returning from a kernel entry point function. It can be used to execute any clean up code and notify the host when the execution of the kernel has completed.
Bits |
Field |
Description |
63-0 |
Address to jump to when returning from a kernel. |
CMP_REG_WINDOW_BASEn (IDs: 8 through 15)
The CMP_REG_WINDOW_BASE register array describes the base address of a memory window, which is the starting address of a virtual memory area. Any accesses to the virtual memory area described by the window actually occur in the target memory area of the window.
Bits |
Field |
Description |
63-0 |
Base address for the memory window. |
CMP_REG_WINDOW_TARGETn (IDs: 16 through 23)
The CMP_REG_WINDOW_TARGET register array describes the target address of a memory window, which is the starting address of a physical memory area. Any accesses to the virtual memory area described by the window actually occur in the target memory area of the window.
Bits |
Field |
Description |
63-0 |
Target address for the memory window. |
CMP_REG_WINDOW_MODEn (IDs: 24 through 31)
The CMP_REG_WINDOW_TARGET register array contains most of the configuration options for a given memory window. The most important field in this register is the ACTIVE field, which controls whether or not the virtual memory area described by the window is accessible. Any accesses to the virtual memory area described by the window actually occur in the target memory area of the window.
The calculation of the effective address depends on the windowing mode:
: An access toBASE_ADRR + offset
is mapped toTARGET + offset
. All harts and all cores see the same memory contents when accessing memory through the window.PER_HART
: An access toBASE_ADRR + offset
is mapped toTARGET + (SCALE * hart_id) + offset
. Different harts see different memory contents when accessing memory through the window.PER_CORE
: An access toBASE_ADRR + offset
is mapped toTARGET + (SCALE * core_id) + offset
. Different cores see different memory contents when accessing memory through the window.
In the above calculations, SCALE
is substituted as SCALE_A * SCALE_B
are configured through the CMP_REG_WINDOW_SCALE
register array.
Bits |
Field |
Description |
0 |
Whether or not the memory window can be accessed. |
2-1 |
0: SHARED, 1: PER_HART, 2: PER_CORE, 3: reserved |
3 |
Whether or not the target memory is interleaved. |
6-4 |
Memory permissions. [0]: READ, [1]: WRITE, [2]: EXECUTE |
12-8 |
Granularity of the target area when it is interleaved. |
63-32 |
Size of the memory window, between 1 and 2^32 bytes. |
CMP_REG_WINDOW_SCALEn (IDs: 32 through 39)
The CMP_REG_WINDOW_SCALE register array describes the scaling factor to use
for calculating the effective address when accessing memory through a window
which is in per-hart or per-core mode. The scaling factor is the product of two
parts, SCALE_A
Bits |
Field |
Description |
4-0 |
First part of the scaling factor (0, 2^1, 2^2, …, 2^31). |
63-32 |
Second part of the scaling factor (1 .. 2^32). |