Published: Mon 12 September 2022
By Icecream95
In RE .
Those of us who have been trying to make use of the "v10" Mali GPUs from Arm, such as the Mali-G610 in Rockchip's RK3588 will surely have noticed by now the requirement of firmware—kbase requires mali_csffw.bin
to be present in one of the firmware directories (such as /lib/firmware
) for the GPU to be usable.
Part 1: What is this firmware and how does it work?
The firmware runs on a microcontroller inside the GPU and handles many tasks required to work the GPU that were previously handled inside the kernel. The MCU ("Microcontroller unit") is, as Alyssa Rosenzweig correctly guessed, a Cortex-M7 (r1p2, with no FPU, ECC cache, nor TCM), which I have found runs off the GPU clock, allowing for an impressive 990 MHz on RK3588. (That speed was measured on my board; maximum clock speeds on the RK3588 vary a bit from chip to chip.)
While the MCU is 32-bit, the GPU supports sixteen eight 48-bit virtual address spaces. So how can it access all of this memory? Arm decided on the obvious method here: simply let the MCU control its own page tables!
Calm down, it isn't quite that bad.
Memory mappings
To access memory, there are three layers that must be navigated:
The MCU's integrated MPU (memory protection unit)
The MCU memory mappings
The GPU MMU
Only the first two of these are controlled by the MCU, so exploits cannot access memory that is not mapped for access by the GPU. An attacker would have better luck using the BASE_MEM_MMU_DUMP_HANDLE
cookie from userspace to dump the physical adresses of the page tables, so that a kernel exploit can change page tables from the CPU.
To see how these layers fit together, let's try to access the memory at 0x5fffe05080
inside Address Space 3 from the MCU:
uint64_t va = 0x5fffe05080 ;
unsigned as = 3 ;
/* The memory mapped GPU registers starting at this offset control the
* MCU memory mapping for the 128 MiB region starting at 0x08000000 */
volatile void * map_reg_base = 0x40022100 ;
volatile uint32_t * map_reg_as = map_reg_base + 4 ;
volatile uint64_t * map_reg_addr = map_reg_base + 8 ;
/* Ensure that any writes to the memory that used to be mapped to the
* region have finished. This function is provided by CMSIS. */
__DSB ();
/* Program the registers. Note that the lower 26 bits of map_reg_addr
* are ignored, allowing for 64 MiB granularity */
* map_reg_addr = va ;
* map_reg_as = as ;
/* Set up an MPU region for the mapping. Register definitions and
* macros are provided by CMSIS. */
/* Memory attributes: outer and inner Non-cacheable, shared */
unsigned attr = ARM_MPU_ACCESS_ ( 1 , true , false , false );
/* Allow reads and writes from Privileged mode */
unsigned access = ARM_MPU_AP_PRIV ;
unsigned size = ARM_MPU_REGION_SIZE_128MB ;
/* Use region 15. The MCU supports 16 memory regions */
MPU -> RNR = 15 ;
/* Set the base address register */
MPU -> RBAR = 0x08000000 ;
/* Set up permissions, attributes, and the region size */
MPU -> RASR = ARM_MPU_RASR_EX ( true , access , attr , 0 , size );
/* Ensure that the register changes have taken effect. */
__DSB ();
__ISB ();
/* Get a pointer to the data inside the mapping */
uint32_t * data = 0x08000000 + ( va & 0x3ffffff );
/* Access the memory */
printf ( "*VA(0x%llx, 3) = %lu \n " , va , * data );
The different layers have these properties:
MPU:
Configures permissions and L1 cache behaviour
16 regions of almost arbitrary size (a multiple of 32 bytes, and at most eight times a power of two)
Alignment is the region size rounded up to a power of two
Faults are handled by MCU exceptions
Controlled by the MCU
MCU mappings:
Mapping only; does not appear to support permissions
Eight 128 MiB regions
Regions have 64 MiB alignment
No faults
Controlled by the MCU
GPU MMU:
Configures permissions, mapping, L2 cache behaviour and coherency
Supports as many 4 KB pages as will fit into the sixteen eight 48-bit address spaces
Pages have 4 KB alignment
Faults are handled by the kernel. A fault in the MCU address space causes a reset
Controlled by the kernel
Apart from accessing GPU memory, in what other ways can the MCU communicate with the rest of the system?
Interrupts
An important part of the system is interrupts :
The kernel can ring a doorbell to interrupt the MCU
The MCU can use hardware registers to send an interrupt to the CPU
Userspace can ring mapped doorbell pages to set a register and interrupt the MCU
GPU components can signal an interrupt if they need the MCU to take action
The MCU can interrupt itself, for example using the SysTick timer
(What are doorbells? These are register pages that, when written to, set an MCU-accessible register and signal an interrupt. So the CPU performs a write transfer on the APB ("Advanced Peripheral Bus", part of AMBA ), then the GPU decodes the address, figures out that it is for a doorbell page, updates registers, and signals the interrupt to the MCU.)
These interrupts are used for a number of different tasks:
Kernel to MCU interrupts are used to wake up the MCU for performing firmware initialisation; configuring command streams and command stream groups; power management tasks; other odd jobs, such as entering "protected mode". I have also patched my kernel to send an interrupt on tracebuffer writes—normally, from the CPU's point of view tracebuffers are read-only. This allows two-way communication for gdb.
MCU to kernel interrupts are often used once kernel requests to the MCU are handled, but they are also used for "event store" or "event add" instructions in the command stream, so that the kernel may wake up userspace threads which are waiting for the GPU to finish work by poll
ing on the kbase file descriptor. Errors are also signalled in this way.
Userspace to MCU interrupts are used for one purpose; telling the GPU to start processing a set of command stream instructions. There is also an ioctl
for this, which must be used in some circumstances.
GPU to MCU interrupts might happen when command stream processing is finished, so that the MCU can then tell the kernel what's happening or power off unused GPU components. They are also used for command stream instructions which require emulation, such as the aforementioned "event add".
MCU timer interrupts can be useful for running code after a set amount of time; this might be to soft-stop a fragment job to let another context have a go at rendering.
MMIO
There are two important addresses in the memory map, with locations making mistaking them for each other easy:
At 0x04000000
(in the 48-bit MCU address space; it should be possible to use MCU mappings to page this out) is a 64 MiB region of memory which is used for buffers shared between the kernel and the MCU. This is used to tell the MCU to handle state changes—for example, to bring a new command stream group online, or suspend the registers of an old group to memory at a certain location. There are also tracebuffers for debug output and a few other buffer types.
From 0x40000000
is a 256 KiB region of memory mapped IO (not all of it backed by real storage) which controls the hardware of the GPU. Accessing registers here can signal interrupts, configure MCU memory mappings, query the status of the GPU, launch command stream processing on an iterator, and access command stream registers. Do not take that as a complete list.
(A correction from my last blog post: the GPU does appear to support multiple iterators, just there is only a single command stream processor for chewing through the command stream. I'm still somewhat confused about the relationship between endpoints, iterators, processors, and all the rest.)
(Update: There appears to be yet another relevant term to add to this mess: There are four "Resources", which are "compute", "fragment", "idvs", and "tiler". Userspace configures the resources required by a command stream, then only these resources may have work submitted to them. The command stream must wait for them to be powered up before using them, I think.)
The region at 0x04000000
is just another memory region, but the 0x40000000
is special Device memory. They aren't really so related as my list makes out, except in importance.
Firmware images
How does the firmware actually get loaded into the system?
A single file mali_csffw.bin
includes all of the memory required to be mapped for the firmware to work. (Well, it's designed to be all, but as you'll see later that may not exactly be the case for my firmware.)
Let's dump one of my own firmware images:
ixn@rock-5b:/tmp/w$ ./csffw dump mali_csffw.bin
Firmware header version 0.1
Firmware version 0x1010001
mem flags read,write,cached,zero va 0x0-0x1000 data 0x248-0x2c8 name 'This firmware image is GPL-2'
mem flags read,write,cached,coherent,shared va 0x4000000-0x4094000 data 0x2c8-0x2c8 name '.shared'
mem flags read,write,exec,cached,zero va 0x1000000-0x1039000 data 0x2c8-0x3904c name '.text'
mem flags read,cached va 0xff0000-0x1000000 data 0x3904c-0x3904c name '.code_redzone'
mem flags read,cached va 0x1800000-0x180c000 data 0x3904c-0x44b50 name '.rodata .ARM.exidx .init_array .fini_array'
mem flags read,write,cached,zero va 0x1c00000-0x1c40000 data 0x44b50-0x44b50 name '.brk'
mem flags read,write,cached,zero va 0x2000000-0x2040000 data 0x44b50-0x44b50 name '.stack'
mem flags read,write,cached,zero va 0x27f0000-0x2800000 data 0x44b50-0x44b50 name '.bss_redzone'
mem flags read,write,cached,zero va 0x2800000-0x2831000 data 0x44b50-0x44b50 name '.bss .uninitialized_bss'
mem flags read,cached va 0x2c00000-0x2c21000 data 0x44b50-0x65520 name '.data'
mem flags read,write,cached,zero va 0x3000000-0x3021000 data 0x65520-0x65520 name '.data'
trace 'fwin' type 0, size @ 0x1800000, insert @ 0x1800004, extract @ 0x1800008, data @ 0x180000c, enable @ 0x1800010 (1 bits)
trace 'fwout' type 0, size @ 0x1800014, insert @ 0x1800018, extract @ 0x180001c, data @ 0x1800020, enable @ 0x1800024 (1 bits)
trace 'fwlog' type 0, size @ 0x1800028, insert @ 0x180002c, extract @ 0x1800030, data @ 0x1800034, enable @ 0x1800038 (1 bits)
So the image starts with a 20-byte header with the version information and the size of the following firmware entries, then a number of entries.
There are:
"Interface" entries, which confusingly set up a memory section, here shown with "mem". The permissions are for the GPU page tables, and the MPU can restrict them further if it wants.
"Tracebuffer" entries, which set up a buffer by which the MCU can write trace data for the CPU to read; or with my patched kernel, for the CPU to write data for the MCU to read. Here you can see the "fwlog" tracebuffer for writing log messages, and my non-standard "fwin" and "fwout" tracebuffers for gdb.
Note the label on the first interface, "This firmware image is GPL-2". Useful for people who want to be absolutely sure that they aren't loading non-free firmware by accident. Eventually this might nove to a "Configuration" entry, which is described soon.
There are also a couple of "redzone" interfaces which are set to be inaccessible by the MPU. With proper design, these shouldn't be necessary, but they do appear to reduce the frequency of cases where GPU page faults happen and the MCU requires a reset.
Not shown here are:
"Configuration" entries, which allow userspace to change firmware configuration by writing to files in /sys/devices/platform/*.gpu/firmware_config
"Timeline metadata" entries, which provide information to userspace about supported trace events. My panloader fork includes a tool pantrace
in the trace
subdirectory which uses this metadata to dump events.
"Build info metadata" entries, which store a git SHA for the firmware revision. If someone tried to use SHA-256 for their firmware git repository, kbase would truncate the hash.
"Firmware unit-test" entries, which have an unknown use, not being used by publicly available kernel code
"Type number five" entries, which have unknown effects
It's actually not a terrible format to deal with, especially compared to the MIPE
packets used for timeline metadata, and converting from ELF is simple enough, if you don't mind the terrible crimes I commited with improperly formatted .note
sections.
I did not implement converting to ELF, because after doing that I do not think that I could resist the temptation to load Arm's firmware image into Ghidra, which is a big no-no for GPU reverse engineering.
Part 2: What fun things can you run on the microcontroller?
Rust
Hah, likely story.
Rust as a language seems good enough, but the tooling is… not great.
let svd = include_str! ( "csf.svd" );
let gen = generate ( svd , & svd_cfg ). expect ( "Generate from SVD" );
let gen_dev = gen . device_specific . unwrap ();
let gen_lib = gen . lib_rs . split_once ( "# ! [no_std]" ). unwrap (). 1 ;
let gen_lib = gen_lib . replace ( "crate ::" , "crate :: csf_pac ::" );
let gen_lib = gen_lib . replace ( "# [cfg (feature = \" rt \" )]" , "" );
After writing that in build.rs
, I gave up on the idea of pure-Rust firmware. Maybe I'm holding it wrong, but even the terrible hacks I wrote in Meson weren't as bad as this. And at least I got the latter working!
Maybe once it works better with the Meson build system I'll come back.
(Update: I changed my mind and decided that Rust is usable, but I'm shelling out to Cargo from Meson, building a static library with functions to be called from C.)
gdb
There are many guides to using gdb
with a microcontroller, but they all seem to require external hardware. Sadly, that isn't exactly possible for an MCU integrated deep within a SoC.
Luckily, some "Embedded Systems/Robotics Hobbyist" type called Adam Green wrote a gdbserver using the self hosted debug features for Armv7-M, called MRI . It works very well, apart from a couple of bugs around OS ABI handling:
root@rock-5b:~# gdb -iex 'set osabi none' -ex 'target remote /sys/kernel/debug/mali0/fw_io' /tmp/w/fw
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /tmp/w/fw...
get_tty_state failed: Inappropriate ioctl for device
set_tty_state failed: Inappropriate ioctl for device
Remote debugging using /sys/kernel/debug/mali0/fw_io
warning: while parsing target memory map: no element found
0x0100091c in irq_handler_reset () at ../../home/ixn/src/panfwost/src/fw.c:579
579 __WFI();
(gdb) bt
#0 0x0100091c in irq_handler_reset () at ../../home/ixn/src/panfwost/src/fw.c:579
#1 0xfffffffe in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) break mp_hal_stdout_tx_str
Breakpoint 1 at 0x1008492: file ../../home/ixn/src/panfwost/src/csf_py.c, line 147.
(gdb) cont
Continuing.
Breakpoint 1, mp_hal_stdout_tx_str (str=0x1806f3c "\r\n") at ../../home/ixn/src/panfwost/src/csf_py.c:147
147 tb_write(py_output, str, strlen(str));
(gdb) bt
#0 mp_hal_stdout_tx_str (str=0x1806f3c "\r\n") at ../../home/ixn/src/panfwost/src/csf_py.c:147
#1 0x0101cdd2 in pyexec_friendly_repl_process_char (c=2) at ../../home/ixn/src/panfwost/py/micropython/shared/runtime/pyexec.c:403
#2 0x0101cf90 in pyexec_event_repl_process_char (c=2) at ../../home/ixn/src/panfwost/py/micropython/shared/runtime/pyexec.c:485
#3 0x01008368 in process_char (ch=2) at ../../home/ixn/src/panfwost/src/csf_py.c:72
#4 0x0100839a in csf_py_process_input () at ../../home/ixn/src/panfwost/src/csf_py.c:88
#5 0x01001282 in irq_handler_pendsv_c () at ../../home/ixn/src/panfwost/src/fw.c:858
#6 0x01001192 in irq_handler_pendsv () at ../../home/ixn/src/panfwost/src/fw.c:827
#7 <signal handler called>
#8 0x0100091c in irq_handler_reset () at ../../home/ixn/src/panfwost/src/fw.c:579
#9 0xfffffffe in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)
It really does work as nicely as normal gdb, except that NULL pointers now point to the interrupt vectors.
Note that a patched kernel is currently required, to get the fw_io
file in debugfs
which uses the fwin
and fwout
tracebuffers.
watchpoints?
Another no here…
Part of the Cortex-M debug features is a set of hardware watchpoints, for example to watch for reads from a set memory region. This even supports data watchpoints, raising an exception on any use of a set magic word, such as 0xdeadbeef.
Unfortunately, the exceptions are asynchronous , which means that they can happen several instructions after the fact and are therefore just about useless.
Replacements for watchpoints include:
MPU regions disallowing all accesses
Single-stepping and checking every register value in the debug handler
Valgrind. (I'd love to see it ported to Armv7-M…)
critical sections?
Look at the gdb
backtrace again. What's the point of two handlers, when the MCU already saves enough registers that plain C functions work just fine?
#5 0x01001282 in irq_handler_pendsv_c () at ../../home/ixn/src/panfwost/src/fw.c:858
#6 0x01001192 in irq_handler_pendsv () at ../../home/ixn/src/panfwost/src/fw.c:827
#7 <signal handler called>
Well, the first handler is an assembly thunk which links a context into the on-stack context list:
struct exception_frame {
uint32_t r0 ;
uint32_t r1 ;
uint32_t r2 ;
uint32_t r3 ;
uint32_t r12 ;
uint32_t lr ;
uint32_t return_address ;
uint32_t xpsr ;
};
struct context_list {
struct context_list * prev ;
struct exception_frame * frame ;
uint32_t r7 ;
uint32_t lr ;
} * ctx_list ;
The point of this is that rather than using atomics in tight loops, it can be the responsibility of the ISR to make sure that lower priority contexts get consistent results. An ISR can simply walk the list of contexts to see if a given context is inside a critical section, and if so take appropriate action.
Think of this like the restartable sequences supported by Linux that accelerate sched_getcpu
so much. To be honest it would have been better to copy the RSEQ design completely and forget about nesting.
I don't know if this idea will stay or not. Atomics should be pretty cheap, considering that no cross-core synchronisation is needed.
(It was originally going to be used to fix a WFI
race condition, but then I discovered that WFE
existed.)
micropython
$ ./repl
Use CTRL-D to quit
MicroPython v1.19.1-298-gc616721b1 on 2022 -09-08; Mali CSF with Arm Cortex-M7
Type "help()" for more information.
>>> def fib( i) :
... if i <= 2 :
... return 1
... else :
... return fib( i - 1 ) + fib( i - 2 )
...
>>> fib( 20 )
6765
It looks like Python. It runs on an MCU. It starts up much faster than real Python. It uses tracebuffers in GPU context memory rather than something only accessible by root and with a patched kernel. What more can I say?
The idea is to plan to throw one away, by implementing the firmware initially in Python, before doing a RiiR. It would also be really fun for users to be able to break things without needing to install a toolchain for rebuilding the firmware image.
If Asahi can write drivers in Python, why can't I?
The infrastructure isn't quite set up for this to work, but it should be soon.
fin.
So, what do we have?
Some understanding of how the firmware works, and enough handling of interrupts to keep kbase somewhat happy. gdb
integration, and MicroPython support.
All of that lives in the panFWost git repository, though panloader also has some code related to timelines that may be useful for firmware reverse engineering.
A big thank you to Alyssa Rosenzweig for writing tools that were very helpful for getting to this point with the firmware. Unfortunately those tools are nonfree, so you can't use them yourself (if your name is not Alyssa).
There's also my mesa repository. At least somewhat usable OpenGL support for CSF GPUs should come Real Soon™, I promise*!
* not a promise.
About the author
Although not taking advantage of being old enough to drink, the jury may still be out on whether Icecream95 is too young to develop GPU drivers. Icecream95 ignores that, and spends his time working on a fork of the Panfrost driver in Mesa for Arm Midgard, Bifrost, and Valhall GPUs, though right now Valhall is the main focus. Ever since he learnt z80 assembly and spent hours reading through the ZX Spectrum manual chapters on the system memory layout, Icecream95 has liked the idea of having direct access to memory, with no pesky MMU in the way. Many years later, Icecream95 is finally able to do that with non-emulated hardware, at least if the GPU MMU is discounted, being external to the MCU.
Lightning McQueen also has 95 on his side, and certainly runs a lot of firmware on the many embedded MCUs required to get to a leading position in a race. This isn't the reason for Icecream95's username, though. The real reason is because "AAAARGH!" appears only on page 95 of the ZX Spectrum BASIC programming manual.