Mali G610 Reverse Engineering, Part 1

Since a few days ago, Mali blob drivers have been available for the Mali G610 Valhall GPU, along with the firmware.

Yesterday I decided to download the blob, to see if I could get it working with Panloader before real hardware arrives, like I managed previously for v9 hardware (i.e. G57 and G78 GPUs).

(The word "blob" comes from "Binary Large OBject", and refers to a blob of data or executable code that is opaque and not easily accessible, unlike Free Software where the source code is available from your choice of 73 mirrors. In this blog post, "blob" usually refers to the user-mode "DDK" drivers that Arm makes for its GPUs and allows licensees, but no-one else, to look at the source and modify.)

I grabbed all of the G610 drivers, but started with libmali-valhall-g610-g6p0-dummy-gbm.so because it was the first in the file listing. I don't know how important the differences between them are.

glibc

Those who have tried running user-space blob drivers of any sort will know that library compatibility can be a problem, in particular with glibc, the GNU implementation of the C standard library, which is the base library that just about every program links to, for tasks from interfacing with the kernel to memory allocation and string processing.

glibc is problematic because it is one of the few libraries that cares about symbol versioning, where rather than just linking to a function such as fmemopen, a version is specified, such as fmemopen@GLIBC_2.22. The upside of this is that old binaries work on new glibc versions with all of the old quirks intact (such as "binary mode" in the fmemopen function). The downside is that you can't run new binaries on old glibc versions.

For the Chrome OS blobs available for v9 Valhall, this caused big problems, because they require a glibc with patches not available in any distro. As a result, I ended up copying a dozen libraries including glibc from the Chrome OS partition on my Chromebook.

Thankfully, for v10 we can get a driver from Rockchip, which has an interest in binary compatibility with existing GNU/Linux distributions.

Unfortunately, this does not mean all Linux distributions, only those quick enough to have already upgraded to glibc 2.33. Slackware, the distro I run on my ASUS C201 Chromebook, has been on glibc 2.33 for a while, so all would be good there. But the RK3288 in that Chromebook is 32-bit, and the drivers are only available for AArch64.

So I had to compile glibc 2.33 for my other Chromebook, which runs Void and so was still using glibc 2.32. This does mean that I have to set LD_LIBRARY_PATH and explicitly specify the dynamic linker for it to find libraries belonging to the rest of the system.

LD_LIBRARY_PATH=/opt/glibc-2.33/lib:/lib /opt/glibc-2.33/lib/ld-linux-aarch64.so.1 $PROGRAM

Time to try running dEQP with the blob?

gpu id

ERROR: The DDK (built for 0xa0020000 r0p0 status range [0..15]) is not compatible with this Mali GPU device, /dev/mali0 detected as 0x9091 r0p0 status 0.

Clearly the GPU ID is wrong here, but how should it be fixed?

We don't actually know what the ID should be, so we'll have to search for one that works.

To speed up the brute force search through all possible GPU IDs, I came up with the idea of forking just before the GPU properties query, at the last point in execution before having to decide on the ID.

I decided to let it loose with 4096 child processes before actually testing the idea. Trying to SSH back in to kill everything took quite a while.

But sometimes the solution is right in front of you. Guess what built for 0xa0020000 means? It means: "The GPU ID should be 0xa002." Not that hard, is it?

Unknown ioctls

CSF is a pretty big change in v10 Valhall, so I expected a few new ioctls, which is the mechanism by which the blob communicates with mali_kbase, and was not disappointed.

Among them, we have these ones which actually got called when executing the dEQP test:

  • CS_GET_GLB_IFACE: Reads features supported by the CSF.

  • CS_TILER_HEAP_INIT: Initialises the tiler heap, which contains per-tile primitive draw commands.

  • CS_QUEUE_GROUP_CREATE_1_6: Create a queue "group", which seems to group different command-stream queue groups together

  • CS_QUEUE_REGISTER, CS_QUEUE_BIND: Create and register a command-stream queue, where the different commands for the CSF (command-stream firmware) to read are.

  • CS_QUEUE_KICK: Not referring to the sort of thing that happens when I use IRC, this tells the kernel that there are jobs in the CS queue which it should schedule and let the firmware start executing.

  • KCPU_QUEUE_CREATE: The "kcpu" command queue is interpreted by the kernel and handles things like memory allocations which the CSF cannot do by itself.

CONFIG_MALI_NO_MALI

Once I labelled all of the new ioctls, I noticed that we crashed pretty quickly.

Thread 1 "ld-linux-aarch6" received signal SIGSEGV, Segmentation fault.
__memcpy_generic () at ../sysdeps/aarch64/multiarch/../memcpy.S:92
92              stp     A_l, A_h, [dstin]

Evidently I was handling a new ioctl wrong, but without having the expected behaviour to compare with, finding out which it was, and in what way it was wrong, would be just about impossible.

But then I discovered that the kernel driver mali_kbase (which is Free Software; Arm wouldn't dare violate the terms of the GPL here) had a feature called MALI_NO_MALI, which allows using the kernel driver without actual hardware. This sounded like exactly what I wanted.

Actually getting it compiled and running wasn't too hard a challenge, and I only wasted an hour and caused a kernel "oops" twice while trying to do that.

The hardest problem was actually getting it to compile at all, and eventually I settled with:

make -j10 M=drivers/gpu/arm/ CONFIG_MALI_MIDGARD=m CONFIG_MALI_PLATFORM_NAME=fake

(Other settings were set in .config, but these two specific ones seemed to be problematic.)

I did have a few other problems, caused by the custom platform I created to avoid having to muck about with device trees.

Incidentally, I found that mali_kbase_model_dummy.c lists all of the GPU models, along with that elusive GPU ID I had so much trouble finding…

[   62.611285] mali_kbase: loading out-of-tree module taints kernel.
[   62.627947] kbasep_config_parse_io_resources: couldn't find proper resources
[   62.628635] mali mali.0: Kernel DDK version r35p0-01eac0
[   62.628654] mali mali.0: Using Dummy Model
[   62.628716] mali mali.0: GPU identified as 0x2 arch 10.8.0 r0p0 status 0
[   62.629027] mali mali.0: No OPPs found in device tree! Scaling timeouts using 100000 kHz
[   62.629687] mali mali.0: Probed as mali0

sleep ~0

After a while of waiting for dEQP, I got bored and decided to see where it had hung:

#0  __futex_abstimed_wait_common64 (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0xffffffffe088) at ../sysdeps/nptl/futex-internal.c:74
#1  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0xffffffffe088, expected=expected@entry=0, clockid=clockid@entry=0, 
    abstime=abstime@entry=0x0, private=private@entry=0) at ../sysdeps/nptl/futex-internal.c:123
#2  0x0000fffff782f148 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0xffffffffe090, cond=0xffffffffe060) at pthread_cond_wait.c:504
#3  __pthread_cond_wait (cond=0xffffffffe060, mutex=0xffffffffe090) at pthread_cond_wait.c:619
#4  0x0000fffff52f289c in osup_sync_object_wait () from /tmp/libmali/libEGL.so.1
#5  0x0000fffff4f2d9b4 in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /tmp/libmali/libEGL.so.1
#6  0x0000fffff4e9fff0 in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /tmp/libmali/libEGL.so.1
#7  0x0000fffff4e9d634 in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /tmp/libmali/libEGL.so.1
#8  0x0000fffff4e9f20c in gles_vertexp_bb_neon_transform_and_produce_clip_bits () from /tmp/libmali/libEGL.so.1
#9  0x0000fffff4e53bb0 in ?? () from /tmp/libmali/libEGL.so.1
#10 0x0000fffff4e3b95c in ?? () from /tmp/libmali/libEGL.so.1
#11 0x0000fffff4e20690 in ?? () from /tmp/libmali/libEGL.so.1
#12 0x0000fffff4e20c6c in ?? () from /tmp/libmali/libEGL.so.1
#13 0x0000fffff4e1a114 in ?? () from /tmp/libmali/libEGL.so.1
#14 0x0000fffff7c5bf6c in ?? ()
#15 0x0000fffff8675ba0 in ?? ()

First of all, it seems that the AArch64 driver is compiled with frame pointers, which is why a full-ish backtrace could be printed. When debugging anything without debug symbols on all other architectures, this is a frequent problem, as without frame pointers a backtrace is cut short--with AArch32, this would be unlikely to show anything below osup_sync_object_wait.

…About that osup_sync_object_wait, what does it do?

I can say that it calls __pthread_cond_wait. But because I am avoiding actually reverse-engineering the blob, I do not and cannot know anything more about that function.

But clearly it's the cause of the hang. I'll tell you something you can do with a function that does not require knowing its internals, and that is patching it out:

const char *fns[] = {
        "osup_sync_object_wait",
        "osup_sync_object_timedwait",
};
for (int i = 0; i < (sizeof(fns) / sizeof(*fns)); ++i) {
        uint32_t *function = (uint32_t*)(dlsym(RTLD_DEFAULT, fns[i]));
        if (mprotect((void*)((uintptr_t)function & ~4095), 4096, PROT_READ | PROT_WRITE | PROT_EXEC))
                perror("mprotect");
        function[0] = 0xd65f03c0; // ret
}

Unfortunately, that just made it hang in ppoll.

While for v9 and earlier it's possible to patch the ppoll to fake an event, this doesn't work for v10, because the command stream firmware uses a completely different mechanism for signalling completed GPU jobs.

Do you really need to execute more than one job?

At this point, I gave up.

I didn't need to run large games such as SuperTuxKart with the driver, just small dEQP tests that try out a single piece of functionality and usually don't need many GPU jobs.

Because the actual format of the CSF command-stream is still unknown, trying to get anything out of it would take too long for a quick demo.

But because the areas of memory containing shaders are clearly marked as such, it was easy to tell the memory dump code to also disassemble shaders while creating an ELF core dump. So, here is what is possibly a vertex shader, for dEQP-GLES2.functional.draw.random.16.

This proves that v10 uses a very similar ISA to v9, and also shows off the mostly-completeness of Alyssa's excellent reverse engineering of the v9 ISA. But I have already found at least one instruction that the disassembler cannot handle yet, so there are probably at least some differences between the architectures.

82 81 00 28 f4 82 6a 00    LD_BUFFER.i64.unsigned.slot0 @r2:r3, u2, u1
80 81 00 68 f4 80 6a 00    LD_BUFFER.i64.unsigned.slot1 @r0:r1, u0, u1
83 81 00 a8 f4 be 6a 00    LD_BUFFER.i64.unsigned.slot2 @r62:r63, u3, u1
c0 00 00 00 00 fc 10 01    IADD_IMM.i32 r60, 0x0, #0x0
3c 00 00 00 00 fd 91 08    MOV.i32.wait0 r61, r60
02 00 00 30 e6 84 60 08    LOAD.i96.unsigned.slot0.wait0 @r4:r5:r6, r2, offset:0
05 00 00 00 00 c7 91 10    MOV.i32.wait1 r7, r5
00 04 00 18 02 46 61 00    STORE.i32.slot0 @r6, r0, offset:4
04 00 00 00 00 c6 91 00    MOV.i32 r6, r4
00 08 00 38 08 44 61 08    STORE.i128.slot0.wait0 @r4:r5:r6:r7, r0, offset:8
42 00 00 30 e6 82 60 08    LOAD.i96.unsigned.slot0.wait0 @r2:r3:r4, `r2, offset:0
02 44 00 00 00 c4 a0 00    IADD.u32 r4, r2, `r4
42 04 c0 80 01 c2 f0 00    ICMP.u32.gt.m1 r2, `r2, r4, 0x0
43 42 01 00 00 c5 a0 00    ISUB.u32 r5, `r3, `r2
40 18 00 28 04 44 61 00    STORE.i64.slot0 @r4:r5, `r0, offset:24
84 81 00 68 f4 80 6a 28    LD_BUFFER.i64.unsigned.slot1.wait02 @r0:r1, u4, u1
7e 00 00 28 04 7c 61 18    STORE.i64.slot0.wait01 @r60:r61, `r62, offset:0
00 00 00 28 f4 82 60 08    LOAD.i64.unsigned.slot0.wait0 @r2:r3, r0, offset:0
02 c0 03 11 01 c4 b4 00    LSHIFT_OR.i32 r4, r2, 0x0.b00, r3
44 0b 00 00 10 c0 1f 50    BRANCHZ.eq.reconverge `r4, offset:11
02 00 00 18 82 84 60 08    LOAD.i32.unsigned.slot0.wait0 @r4, r2, offset:0
c4 c0 44 10 71 c4 b4 00    LSHIFT_AND.i32 r4, 0x1000000.b3, 0x0.b00, `r4
c4 c0 04 12 71 c5 b4 00    LSHIFT_XOR.i32 r5, 0x1000000.b3, 0x0.b00, r4
42 00 00 98 02 45 61 00    STORE.i32.slot2 @r5, `r2, offset:0
44 06 00 00 10 c0 1f 50    BRANCHZ.eq.reconverge `r4, offset:6
00 00 00 00 00 c0 00 20    NOP.wait2 
00 08 00 28 f4 be 60 00    LOAD.i64.unsigned.slot0 @r62:r63, r0, offset:8
40 10 00 58 82 82 60 08    LOAD.i32.unsigned.slot1.wait0 @r2, `r0, offset:16
3e 00 00 18 82 83 60 18    LOAD.i32.unsigned.slot0.wait01 @r3, r62, offset:0
43 42 00 00 00 c2 a0 00    IADD.u32 r2, `r3, `r2
7e 00 00 18 02 42 61 50    STORE.i32.slot0.reconverge @r2, `r62, offset:0
00 00 00 00 00 c0 00 78    NOP.return 

But what was the ELF core dump for?

…No reason, it's not as if I prefer using my own, um, disassembler or anything:

screenshot of shader open in Ghidra

fin.

A patched version of kbase that makes it easy to compile a "NO_MALI" driver is here. I have not yet pushed my changes to panloader, but you aren't missing out on much.

If you have any questions, you can contact me at @ixn@mastodon.xyz, or on OFTC IRC as icecream95. For questions about my fork of Panfrost, join the #panfork channel on OFTC (webchat), or via Matrix at #_oftc_#panfork:matrix.org.

About the author

Apparently too young both to drink and to develop GPU drivers, Icecream95 ignores only the second of these, and spends his time working on a fork of the Panfrost driver in Mesa for Arm Midgard, Bifrost, and Valhall GPUs. Since his first reverse engineered feature to Mesa back when he could still write his age with a single hex digit, Icecream95 has wasted a lot of time on reverse engineering things that will never be of much use upstream. Now his age in hex is a palindrome, and so he will waste even more time trying to work out the significance of that.

Lightning McQueen also has 95 on his side, and possibly has done some reverse engineering work, but Icecream95 does not remember the movies very well. This isn't the reason for Icecream95's username, though. Icecream98 was already taken on Scratch, and he thought that Microsoft would complain too much if he instead decided to reverse engineer Windows XP.

social