An important part of any modern graphics driver is shader
compilation. Applications hand the GPU driver the shader text written
in an API-specific language, and then it has to compile them on the
fly to shader programs that can be executed by the GPU.
The GPU cores used in modern graphics processors use an instruction
set that is in some ways like a RISC CPU, but have a few
distinguishing features. Instructions are often quite long, because
there are more registers to access—usually there are more than 100,
while AArch64 has only 31, and 32-bit ARM has 14. Often instructions
have extra bits for encoding things which CPUs can't do, such as doing
an abs
operation on any instruction source, or for waiting on
results from message-passing instructions before execution. There are
a bunch of other things unique to GPU cores, such as texture sampling
and varying interpolation instructions, but they aren't relevant to
this blog post, so I won't describe them more.
The shaders can be used for many different tasks, from transforming
the vertices making up a 3D mesh, to shading a pixel by sampling
from a texture and doing lighting calculations, as well as
post-process effects and tasks done by compute shaders, which can
include physics simulations or bitcoin mining.
If a game wants to load a new shader in the middle of play, the driver
can only take a couple of milliseconds to compile the shader, or the
user will start to notice stuttering. For slower paced applications,
or while a loading screen is shown, taking time is more forgivable,
and a patient user will not mind having to wait a few seconds.
The shader I am looking at here takes three minutes to compile,
using the Panfrost driver with an Arm Mali Bifrost GPU.
Well, it took that long. It's a little faster now.
Initial testing
When driver developers are bored, they try their drivers with a bunch
of programs and make pretty graphs by comparing the results with older
versions of their driver, or with competitors.
Opening up 50 different applications all the time can get boring after
a while, so there are automated systems to make the process a little
easier. One of these is shader-db, which has collected shaders
from many sources; benchmarks, games, desktop environments, rendering
frameworks...
When doing a shader-db run (which compiles all of the shaders in the
database), I noticed that one shader was kept compiling for a long
time after all of the other threads had finished.
Let's run this one by itself:
$ time ./run shaders/skia/781.shader_test
ATTENTION: default value of option allow_glsl_extension_directive_midshader overridden by environment.
ATTENTION: default value of option allow_glsl_120_subset_in_110 overridden by environment.
ATTENTION: default value of option allow_glsl_builtin_variable_redeclaration overridden by environment.
ATTENTION: default value of option allow_glsl_compat_shaders overridden by environment.
shaders/skia/781.shader_test - MESA_SHADER_VERTEX shader: 10 inst, 10 tuples, 3 clauses, 2.000000 cycles, 0.333333 arith, 0.000000 texture, 0.000000 vary, 2.000000 ldst, 10 quadwords, 1 threads, 0 loops, 0:0 spills:fills
shaders/skia/781.shader_test - MESA_SHADER_VERTEX shader: 33 inst, 31 tuples, 15 clauses, 15.000000 cycles, 0.708333 arith, 0.000000 texture, 0.000000 vary, 15.000000 ldst, 35 quadwords, 1 threads, 0 loops, 0:0 spills:fills
shaders/skia/781.shader_test - MESA_SHADER_FRAGMENT shader: 13932 inst, 10255 tuples, 2316 clauses, 705.000000 cycles, 406.500000 arith, 705.000000 texture, 1.750000 vary, 109.000000 ldst, 9489 quadwords, 1 threads, 64 loops, 14:92 spills:fills
Thread 0 took 208.77 seconds and compiled 2 shaders (not including SIMD16) with 1 GL context switches
real 3m23.467s
user 3m18.707s
sys 0m4.028s
(Why are there two MESA_SHADER_VERTEX
entries? Because one shader
calculates the vertex position, and the other calculates any varyings,
making use of the IDVS feature of Bifrost GPUs.)
I doubt that any user would be patient enough to ever wait this long
for a shader to compile.
This is an optimised build of the driver, with a debug build it takes
a bit longer:
Thread 0 took 620.58 seconds and compiled 2 shaders (not including SIMD16) with 1 GL context switches
real 10m14.773s
user 10m8.474s
sys 0m4.468s
(I don't know what's wrong with the time printed by shader-db. For
both release and debug builds, it is at least five seconds longer than
what bash's time
shows.)
Finding out what's wrong
We can't just go off and optimise random things when we don't know if
they are taking up any execution time or not!
So the first step is to use a profiler to see which functions take up
the most CPU time.
My usual tool for this is perf(1)
:
$ time perf record ./run shaders/skia/781.shader_test
shaders/skia/781.shader_test - MESA_SHADER_VERTEX shader: 10 inst, 10 tuples, 3 clauses, 2.000000 cycles, 0.333333 arith, 0.000000 texture, 0.000000 vary, 2.000000 ldst, 10 quadwords, 1 threads, 0 loops, 0:0 spills:fills
shaders/skia/781.shader_test - MESA_SHADER_VERTEX shader: 33 inst, 31 tuples, 15 clauses, 15.000000 cycles, 0.708333 arith, 0.000000 texture, 0.000000 vary, 15.000000 ldst, 35 quadwords, 1 threads, 0 loops, 0:0 spills:fills
shaders/skia/781.shader_test - MESA_SHADER_FRAGMENT shader: 13932 inst, 10255 tuples, 2316 clauses, 705.000000 cycles, 406.500000 arith, 705.000000 texture, 1.750000 vary, 109.000000 ldst, 9489 quadwords, 1 threads, 64 loops, 14:92 spills:fills
Thread 0 took 206.98 seconds and compiled 2 shaders (not including SIMD16) with 1 GL context switches
[ perf record: Woken up 125 times to write data ]
[ perf record: Captured and wrote 31.196 MB perf.data (817608 samples) ]
real 3m26.045s
user 3m19.573s
sys 0m4.545s
$ perf report --stdio | grep -v '^#' | head
78.29% run libgallium_dri.so [.] lcra_solve
9.95% run libgallium_dri.so [.] bi_register_allocate
9.46% run libgallium_dri.so [.] bi_compute_interference
0.78% run libgallium_dri.so [.] bi_compute_liveness
0.14% run [unknown] [k] 0xffffd42bafb4566c
0.05% run libc-2.32.so [.] __GI___memset_generic
0.05% run libc-2.32.so [.] __memcpy_generic
0.05% run libgallium_dri.so [.] dce_cf_list.isra.0
0.04% run libgallium_dri.so [.] hash_table_search
0.04% run libgallium_dri.so [.] bi_opt_dead_code_eliminate
So the function lcra_solve
, which is the core of the register
allocator, is using the most CPU time. This function iterates through
every instruction in the shader, and attempts to assign a register for
the instruction destination.
We can also use taskset to see what it's like when running on the in-order A53 cores on my computer:
$ time taskset -c 0-3 perf record --output perf.little.data ./run -d 1 shaders/skia/781.shader_test
...
[ perf record: Woken up 191 times to write data ]
[ perf record: Captured and wrote 47.637 MB perf.little.data (1248562 samples) ]
real 5m15.041s
user 5m4.207s
sys 0m6.736s
$ perf report --stdio --input perf.little.data | grep -v '^#' | head
70.67% run libgallium_dri.so [.] lcra_solve
18.05% run libgallium_dri.so [.] bi_register_allocate
8.51% run libgallium_dri.so [.] bi_compute_interference
0.97% run libgallium_dri.so [.] bi_compute_liveness
0.09% run [unknown] [k] 0xffffd42bafb4566c
0.07% run libc-2.32.so [.] __memcpy_generic
0.07% run libgallium_dri.so [.] dce_cf_list.isra.0
0.06% run libgallium_dri.so [.] match_expression
0.06% run libgallium_dri.so [.] ir_expression::accept
0.05% run libgallium_dri.so [.] bi_writemask
What do all these functions even do?
So we know that some functions called lcra_solve
,
bi_register_allocate
, and bi_compute_interference
use a lot of CPU
time, but what are these functions even for?
First I'll write about register allocation in general. An important
part of compiling for any processor (Forth programmers, shush) is
register allocation. CPUs and GPUs have a set of registers, which can
store variables for fast access, but there are only a small number of
them, so some work is needed to make sure that they are used as
efficiently as possible. With complex control flow such as loops, this
becomes a difficult task, because new variables can appear seemingly
out of nowhere halfway through iterating over the
instructions. bi_register_allocate
is the top-level function which
calls other functions to do this allocation work. It doesn't do much
work by itself, but because of inlining it seems to do a lot more when
looking at the assembly output by the compiler.
Next: What is LCRA? This is a register allocation algorithm that
is supposedly "fast", using an "efficient" greedy solver. I do not
doubt Alyssa Rosenzweig's claims here, but
looking at the "reference implementation" of LCRA will not convince
you of them!
How the algorithm works is by first calculating linear constraints
between nodes (hence the name; Linearly Constrained Register
Allocation). That is, it looks at every pair of instructions where the
value written by one instruction is existing ("live") at the same time
as the value written by the other instruction. Because these values
need to be stored at the same time, they cannot be assigned the same
storage location (register). But nodes can have a variable number of
adjacent registers, so rather than storing a single bit per pair, we
store a seven bit value, which is the linear constraint.
(The majority of nodes are SSA, meaning that there is only one
instruction which writes to the node. But other nodes are written many
times, for example if there are if statements to change a variable's
value.)
Once that process is done, the actual allocation can be done. For each
node, we pick a register number, then iterate over the linear
constraint array, linear
. For each node that has already been
assigned a register, we compare the two registers, and look it up in
the constraint byte. If it is set, there is a collision, and we try
the algorithm again with a new candidate register.
For example, if we have a four-register node i
, and a two-register
node j
, we would get a constraint of 0x7c
as a "forwards"
constraint. Or as individual bits: (Note that the bit order here is
opposite to the LCRA reference implementation.)
3 |
2 |
1 |
0 |
-1 |
-2 |
-3 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
This means that when finding a register for node j
, it must be at
least four registers above the base register chosen for i
, or at
least two registers below. For example, if we first chose r5 for i
and then r4 for j
, then the second register component of j
would
overwrite the first register component of i
, and so this must not
happen. To check this, the register allocator subtracts these two
values, getting a difference of -1. It looks this bit up in the linear
constraint value, and seeing that it is set decides that another
register must be chosen.
The chosen registers for each node is stored in the solutions
array.
Coming back to bi_register_allocate
: As we saw from perf, the loop
in the function where most of the time is spent is counting the number
of bits in an array of bytes. These bytes are linear constraints, and
the loop is used to find which nodes has the most constraints, because
they restrict allocation for the highest number of other nodes.
bi_compute_interference
calculates the linear constraints used by
LCRA, but will not be covered in more depth until part two or three of
this series.
Solving the mystery of solving
We want to look at lcra_solve first, because it is by far using the
most CPU time.
Looking at the source code, it seems that lcra_solve doesn't do much
work itself, but leaves it to lcra_test_linear
, which evidently got
inlined. What lcra_test_linear
does is check all of the other nodes
against node i
to see if there is no interference:
static bool
lcra_test_linear(struct lcra_state *l, unsigned *solutions, unsigned i)
{
uint8_t *row = &l->linear[i * l->node_count];
signed constant = solutions[i];
for (unsigned j = 0; j < l->node_count; ++j) {
if (solutions[j] == ~0) continue;
signed lhs = solutions[j] - constant;
if (lhs < -3 || lhs > 3)
continue;
if (row[j] & (1 << (lhs + 3)))
return false;
}
return true;
}
NEON
My first action for optimisation work is often to try and make the
existing code run faster, without changing how it works or how often
it is run. Here, I started by rewriting lcra_test_linear
to use NEON
intrinsics. This is a Single Instruction Multiple Data (SIMD)
instruction set for ARM that allows processing up to 16 elements at
once.
While it may seem inefficient to work in this way, as later changes at
a higher level can require discarding the initial optimisations, I
think that it is sometimes still helpful to start at a low level, so
that you can become more familiar with what the code actually does,
which can make it easier to think of ideas for higher level changes.
(Even if it's not actually you that thinks of the higher-level
change. If I remember correctly, Panfrost's index min/max cache
was inspired by my optimisation to allow NEON to be used for min/max
calculations.)
So, what exactly are "NEON intrinsics"?
Usually CPU instructions operate on a single item of data at a time,
doing a single addition, or single multiplication, and so on. But just
about every applications focused CPU for a long time has supported
some variant of SIMD instructions, which operate on multiple pieces of
data in a single instruction. x86 and AMD64 have MMX, SSE, and AVX,
Power has AltiVec, and ARM has NEON, SVE and, for DSP-type
embedded applications, Helium. ARM CPUs in currently available
consumer devices mostly have just NEON.
Compilers can sometimes make use of the instructions through an
"auto-vectorizer", but they are unable to optimise it as much as can
be done when writing hand-written assembly. Intrinsics are a path in
the middle, using C function calls that get compiled down to the
vector instructions. This gives almost the same performance as
hand-written assembly, but also has advantages like allowing the same
code to mostly work for both 32-bit and 64-bit code, and preventing
common assembly mistakes such as forgetting to put a return
instruction at the end of a function.
With SIMD, rather than doing a single addition, you might now do
sixteen additions in a single instruction. This might make the
operation 16x faster, but in reality programs often see lower gains
than this, because memory accesses take almost as long as with scalar
code, or even longer if unnecessary components of a vector are loaded
from another cache line.
Preparation
Because of this, my first step before writing a single NEON intrinsic
function call was to try and reduce the memory bandwidth required. In
the lcra_test_linear
function, the solutions
array has 32-bit
elements, but Bifrost only has 64 different registers so this can be
reduced to eight bits. This change did not make shader compilation
faster at all, indicating that this function was CPU-bound, and was
never waiting on memory reads. It still meant that rewriting the
function to use NEON would be easier.
The second preparatory step was to do some size alignment for the
linear
array. Though fixed in the newest SIMD instruction sets like
AVX-512 and SVE, one major downside of SIMD is that the execution
width is fixed—if a routine accesses sixteen elements, it will always
access that many elements, even if the size of the array that you are
iterating over is not a multiple of sixteen. This means that the final
iteration of the loop will access beyond the end of the array. So some
elements are added to the end of each row to make the number of
elements an even multiple. These get set to zero, to not change the
results of the function. This change actually hurt performance, making
the shader take four seconds longer to compile. Possibly it increased
the number of cache conflict misses, when inserting an entry into the
CPU cache requires another active cache line to be evicted, when
building up the linear
array.
(When doing this, I forgot to also align the solutions
array. This
isn't usually a problem, because more memory is allocated after
solutions
, so reading slightly beyond the end of the array should
not end up reading unmapped memory, and the actual contents doesn't
matter if the corresponding element of linear
is zero.)
Now we have done the preparatory work, it is time to write the
function. Here is the original inner loop again:
if (solutions[j] == LCRA_NOT_SOLVED) continue;
signed lhs = solutions[j] - constant;
if (lhs < -3 || lhs > 3)
continue;
if (row[j] & (1 << (lhs + 3)))
return false;
For SIMD instruction sets, we can't do branching like this code does,
but rather we have to execute every instruction for every piece of
data.
For writing a function like this, it can be useful to start with the
most specialised instruction, and build it around that. In this case,
it is the left shift by lhs + 3
. The "ARM ARM" (Architecture
Reference Manual) lists many shift instructions, but the ones we are
interested in come under the "SIMD arithmetic" section.
Because we are doing bitwise operations, not arithmetic, we certainly
do not want saturating instructions! Then, a bit that got shifted out
would cause all of the other bits to be set to 1, ruining the
result. We are not intending to do any right shifting, so the choice
of rounding vs non-rounding shifts is unimportant. So we could either
use a signed or unsigned left shift, SSHL
or USHL
.
We want an unsigned shift, right?
No.
In the scalar code, we do the shift if lhs
is in the range [-3,
3]. This means that the eventual shift amount is in the range [0, 6],
covering seven bits. Because it is harder to conditionally execute
code with SIMD instruction sets, we will likely end up doing the shift
even when lhs
is outside its range. For an unsigned shift, we will
sometimes shift by seven bits, leaving a single bit in the output that
we have to deal with. A signed shift will remove this for us! (Signed
shifts never change the top bit of the element, and right shifts copy
in the top bit rather than a constant 0
.)
Using signed shifts only ends up saving a single instruction per
execution of lcra_test_linear
, but it's a fun hack nonetheless.
(This trick to avoid clamping the shift amount cannot be used in
regular C code. Shifting by a greater amount than the width of the
value or by a value less than zero is undefined behaviour. It wouldn't
work anyway, because even for ARM the LSL
instruction only uses the
lower bits for the shift amount, as the pseudocode shows: result =
ShiftReg(n, shift_type, UInt(operand2) MOD datasize);
)
Writing the function
We now have one line of the function:
int8x16_t shifted = vshlq_s8(vdupq_n_s8(1), lhs);
But!.. This doesn't look like the SSHL
instruction, which I talked
about earlier, what's going on?..
When ARM first introduced NEON for 32-bit ARMv7 CPUs, it had a
different syntax. All instructions were prefixed with v
, so the
shift instruction might look like this: vshl.u8 q8, q8, q9
. But the
actual vector width used in the instruction (NEON supports both 64
bits and 128 bits) is only encoded in the register names, so for the
intrinsic functions a q
suffix needs to be added.
But AArch64 has a cleaner (and arguably nicer) syntax, where
instruction have the same name between the base and NEON instructions.
To maintain compatability, this new syntax is only used for
programming in assembly, and the intrinsic functions are the same.
With that out of the way, we can now add the instructions to load the
data above it:
int8_t *row = (int8_t*)&l->linear[i * l->linear_stride];
/* Create a sixteen-element vector with `3 - solutions[i]` in
* each element */
int8x16_t constant = vdupq_n_s8(3 - solutions[i]);
for (unsigned j = 0; j < l->linear_stride; j += 16) {
/* Load 16 bytes from `solutions` and add 3 to each byte */
int8x16_t lhs = vaddq_s8(vld1q_s8(solutions), constant);
/* Load 16 bytes from `row`. */
int8x16_t rhs = vld1q_s8(row);
solutions += 16;
row += 16;
/* This will return zero for small or large lhs, no need to
* clamp */
int8x16_t shifted = vshlq_s8(vdupq_n_s8(1), lhs);
Notice that we have now gone from calculating solutions[j] -
constant
, and then adding three, to adding it into the constant,
which saves an instruction. Clang caan do this optimisation by itself,
but GCC didn't in this case.
..But where has the check of solutions[j]
against LCRA_NOT_SOLVED
gone? The answer is that LCRA_NOT_SOLVED
is defined as -15
, so
lhs
is in the range (-76,-12]. So SSHL
always does a right
shift, making shifted
end up as zero, which is what we want.
Once we have defined a int8x16_t res = vdupq_n_s8(0);
to hold the
result, we can write the rest of the function:
int8x16_t collide = vandq_s8(shifted, rhs);
res = vorrq_s8(res, collide);
}
return vmaxvq_s8(res) == 0;
This is a common technique when using SIMD. Rather than try to emulate
if (cond) return false;
per iteration, if the function usually
doesn't exit that way until it has iterated through a considerable
proportion of the total, it can be simpler to set a flag bit if the
condition fails, and exit at the end.
In this case, after finding the bits representing the collisions, we
use an ORR
instruction (ARM's mnemonic for inclusive-or) to set the
bits in the result vector, and then at the end use MAXV
, which
returns the largest element of the vector, and compares it to zero.
MAXV
will only return zero if there are no set bits in the result
vector (assuming the sign bit is never set), which would mean that
there are no collisions.
real 1m41.791s
user 1m37.260s
sys 0m3.984s
A nice improvent, huh?
Compilation of our shader is now twice as fast as it was before.
Running perf
again and doing some calculations, we can see that
lcra_solve
went from taking 78.3% of 203 seconds, or 159 seconds, to
29.9% of 65 seconds, or 19 seconds. That is over eight times faster,
an impressive improvement from vectorization!
When I first saw this, I thought that it was as good as the
improvement was going to get, and went off to do other things for a
while. But you got promised a 1000x improvement in speed, so you can
see that my thought was clearly wrong!
This is only the first step in improving speed. I've already written
another three or four blog posts worth of content about the other
optimisations, so stay tuned (and subscribe to the blog's feed)
for when they come up.
math.log(2, 1000) = 0.1
, so we are only 10% of the way to the 1000x
speedup. But some of the other optimisations will be more impactful,
contributing as mcuh as 25% towards the 1000x goal, so there won't be
ten more blog posts!
A fork in the driver
Most of the speed-up patches are available today in Panfork, my
fork of the Panfrost driver for Arm Mali GPUs. The patches in the
release branch there give a 125x speedup for the shader this blog post
focuses on, with the incomplete patches for the last 10x in the
ra-hacks
branch.
It appears that at the moment Valhall support is a higher priority
for upstream, so there is no indication that they have noticed the
patches described in this blog post series yet.
(Anyone from upstream reading this: please can we find a way to work
together more closely on Panfrost again? There's no need to be scared,
I don't bite, at least not intentionally.)
fin.
If you have any questions, you can contact me at
@ixn@mastodon.xyz, or on OFTC IRC as icecream95
. For
questions about the fork, join the #panfork
channel on OFTC
(webchat), or via Matrix at #_oftc_#panfork:matrix.org
.
Finally, thanks to notklaatu of GNU World Order fame for
giving some feedback and suggestions for this blog post.
About the author
Apparently too young both to drink and to develop GPU drivers,
Icecream95 ignores only the second of these, and spends his time
working on a fork of the
Panfrost driver in Mesa for Arm Midgard, Bifrost, and Valhall
GPUs. From his first contribution to Mesa back when he could still
write his age with a single hex digit, Icecream95 has focused on
reducing CPU overhead to make games run faster. Now his age in hex is
a palindrome, and so.. um.. I'm sure it has some significance.
Lightning McQueen also has 95 on his side, and he goes fast, but this
isn't the reason for Icecream95's username. Icecream98 was already
taken on Scratch, and he wasn't going to go in the other direction and
choose IcecreamXP.