[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Wrapping of Julians code more or less completed.
Peter,
>I have finished wrapping Julians athlon kernel into a .c file using gcc
>inline assembly. It provides all four precisions and does N cleanup. N is
>always read at runtime.
>
>I have not looked at the prefetching, so that stuff is still only
>optimized for 30x30 dgemm, but hopefully it does not do to muh of a
>difference.
>
>Please test it thouroughly for speed, since I have a hard time testing it
>properly over my 56k modem.
Just got some initial results. I have not yet built it into the full gemm,
but the kernel timing looks very good:
dmm : 995
zmm : 915
smm : 1039
cmm : 1025
So it looks like zmm has taken the biggest hit, which doesn't make a lot of
sense to me. Julian's nasm kernel is getting 960 for dmm, so I'm thinking I
must have an old .o or something . . .
Anyway, if these numbers hold up for full gemm, this looks plenty good for the
stable to me . . .
Thanks,
Clint