A Take on how to perform Matrix Multiplication with Intrinsics

Introduction

When working on a performance critical application, you want to do as much as possible in as few CPU clock cycles as possible. Intrinsics are platform specific features that compiler can replace with a single assembly instruction most of the time.
In this blog, I’m gonna walk through a couple of ways to perform matrix multiplication with intrinsics(specifically, Intel’s SIMD SSE instrinsics).
Let’s work with the following 4×4 matrix. Each of it’s rows(or columns, if we are implementing a column major matrix) can snugly fit in a single MMX register. The traditional approach of computing matrix multiplication is to multiply Row ‘i’ with Column ‘j’ to get element ‘Eij‘.

Matrix Multiplication

This is basically implementing a dot product on ith row and jth column. This approach is implemented using intrinsics by first implementing a dot product and then using that when multiplying elements of matrix. Here’s how this can be implementation.
In the following implementations, “pack” is just a variable of type “__m128” packed within a union.

float dot_vf(const QVec4& a, const QVec3& b)
    {
        __m128 tmp = _mm_mul_ps(a.pack, b.pack); //(x1*x2, y1*y2, z1*z2, ?*?)
        __m128 shuff = _mm_shuffle_ps(tmp, tmp, _MM_SHUFFLE(0, 2, 0, 1)); //(y1*y2, x1*x2, z1*z2, ?*?)

        tmp = _mm_add_ps(tmp, shuff); //(x1*x2 + y1*y2, x1*x2+x1*x2, z1*z2+z1*z2, ?);
        shuff = _mm_shuffle_ps(shuff, shuff, _MM_SHUFFLE(0, 0, 0, 2)); //(z1*z2, ?*?, x1*x2, y1*y2)
        return _mm_cvtss_f32(_mm_add_ps(shuff, tmp)); //(x1*x2 + y1*y2 + z1*z2, ?, ? ,?)
    }

QMat4x4 mul1(const QMat4x4& a, const QMat4x4& b)
    {
        QMat4x4 transposed(b);
        transposed.transpose();
        return QMat4x4(
            QVec4( dot_vf(a.col0, transposed.col0), dot_vf(a.col0, transposed.col1), 
            dot_vf(a.col1, transposed.col2),dot_vf(a.col0, transposed.col3)),

            QVec4( dot_vf(a.col1, transposed.col0), dot_vf(a.col1, transposed.col1), 
            dot_vf(a.col1, transposed.col2),dot_vf(a.col1, transposed.col3)),

            QVec4( dot_vf(a.col2, transposed.col0), dot_vf(a.col2, transposed.col1), 
            dot_vf(a.col2, transposed.col2),dot_vf(a.col2, transposed.col3)),

            QVec4( dot_vf(a.col3, transposed.col0), dot_vf(a.col3, transposed.col1), 
            dot_vf(a.col3, transposed.col2),dot_vf(a.col3, transposed.col3))
        );
    }
Multiplication with a row view

The above implementation works perfectly, but dot product does a horizontal add and we are not really utilizing parallel hardware completely with subsequent calls to _mm_add_ps in dot product.
An alternate approach to multiplying matrices is by taking a closer look at how elements in the multiplication are distributed in the resultant matrix. Here’s a great resource that could help you solidify the intuition behind matrix multiplication.
The idea behind the row view of matrix multiplication is that each row in matrix “B” is a linear combination of rows of matrix B, with weights from matrix A.

Let’s try to compute Row-1 as an example:
1. All elements of Row-1 of Matrix B must be multiplied by a11.
2. All elements of Row-2 of Matrix B must be multiplied by a12.
3. All elements of Row-3 of Matrix B must be multiplied by a13.
4. All elements of Row-4 of Matrix B must be multiplied by a14.
The results of 1, 2, 3, 4 are added to get Row-1 of resultant matrix.

Matrix Multiplication: Row View

Here’s how this can be implemented using intrinsics.

QMat4x4 out;
        
//Col 0
__m128 T0 = _mm_mul_ps(b.pack[0], _mm_set_ps1(a.row0.x));
__m128 T1 = _mm_mul_ps(b.pack[1], _mm_set_ps1(a.row0.y));
out.pack[0] = _mm_add_ps(T0, T1);
T0 = _mm_mul_ps(b.pack[2], _mm_set_ps1(a.row0.z));
out.pack[0] = _mm_add_ps(out.pack[0], T0);
T0 = _mm_mul_ps(b.pack[3], _mm_set_ps1(a.row0.w));
out.pack[0] = _mm_add_ps(out.pack[0], T0);
Multiplication with a Column view

What if you have a column-major matrix? Would it be possible to implement a similar logic?
Of course, yes! You just have to approach the problem from the opposite direction.
In this case, each element of column of matrix B must be multiplied with a respective column in matrix A.

Let’s try to compute Column-1 as an example:
1. All elements of Column-1 of Matrix A must be multiplied by b11.
2. All elements of Column-2 of Matrix A must be multiplied by b21.
3. All elements of Column-3 of Matrix A must be multiplied by b31.
4. All elements of Column-4 of Matrix A must be multiplied by b41.
The results of 1, 2, 3, 4 are added to get Column-1 of resultant matrix.

Matrix Multiplication: Column View

Here’s how this can be implemented using intrinsics.

QMat3x3 out;

//Col 0
__m128 T0 = _mm_mul_ps(a.pack[0], _mm_set_ps1(b.col0.x));
__m128 T1 = _mm_mul_ps(a.pack[1], _mm_set_ps1(b.col0.y));
out.pack[0] = _mm_add_ps(T0, T1);
T0 = _mm_mul_ps(a.pack[2], _mm_set_ps1(b.col0.z));
out.pack[0] = _mm_add_ps(out.pack[0], T0);
T0 = _mm_mul_ps(a.pack[3], _mm_set_ps1(b.col0.w));
out.pack[0] = _mm_add_ps(out.pack[0], T0);

//Similar logic for the rest of the rows


That is all for this post.
Hope this taught you folks something 🙂

,

Leave a Reply

Your email address will not be published. Required fields are marked *