Functions with 8, or 4, suffix are branchless and faster (see swar8 vs swar). Functions with longer input must have a branch per word. An SSE implementation can follow the same ideas as here for ...
Some results have been hidden because they may be inaccessible to you