Matrix multiplication template for architectures with SSE2 or higher and compilers that support C++ intrinsics for access to SSE instructions.
More...
|
static void | exec (real const *const *const A, real const *const *const B, real *const C, int const i=1, int const offset_A=0, int const offset_B=0, int const offset_C=0) |
| Executes the matrix-matrix multiply C += A B with the three matrices A, B, and C stored according to the static members and typedefs of this class. More...
|
|
template<int T_offset_A, int T_offset_B, int T_offset_C> |
static void | exec (real const *const *const A, real const *const *const B, real *const C, int const i=1) |
|
template<typename T_real, typename T_reg, int T_M, int T_N, int T_K>
class MM_kernel_inner_sse2_A< T_real, T_reg, T_M, T_N, T_K >
Matrix multiplication template for architectures with SSE2 or higher and compilers that support C++ intrinsics for access to SSE instructions.
Choice of template parameters:
- T_M and T_N should be chosen so that the T_M x T_N matrix C
fits in registers. For example T_M == T_N == 4
- T_K should be chosen so that the generated code fits in L1 instruction cache. For example T_K == 128.
- T_real and T_reg must go together. Example:
- <T_real, T_reg> == <double, __m128d>
- <T_real, T_reg> == <float, __m128>
The public typedefs and static members specify how the matrices must be stored.
template<typename real , typename T_reg , int T_M, int T_N, int T_K>
void MM_kernel_inner_sse2_A< real, T_reg, T_M, T_N, T_K >::exec |
( |
real const *const *const |
A, |
|
|
real const *const *const |
B, |
|
|
real *const |
C, |
|
|
int const |
i = 1 , |
|
|
int const |
offset_A = 0 , |
|
|
int const |
offset_B = 0 , |
|
|
int const |
offset_C = 0 |
|
) |
| |
|
static |
Executes the matrix-matrix multiply C += A B with the three matrices A, B, and C stored according to the static members and typedefs of this class.
References A, B, and STATIC_ASSERT_DEBUG.
template<typename real , typename T_reg , int T_M, int T_N, int T_K>
template<int T_offset_A, int T_offset_B, int T_offset_C>