In practice, SSE 4.2 is not faster than regular SSE 2 code for strcmp and strlen. More complex process may benefit of SSE 4.2 - but the fastest JSON parser I have seen doesn't use it, and focuses on micro-parallelized branchless process with regular SIMD instructions - see https://github.com/simdjson/simdjson
Memory access is the bottleneck. This is what Agner measured.
About any asm, it is mandatory to refer to https://agner.org/optimize
There are reference code and reference documentation about how modern asm should be written.
The PosEx_Sha_Pas_2 version is one of the fastest, and probably faster than your version, even if it is written in pure pascal. For instance, reading a register then shr + cmp is not the fastest pattern today.
Pascal version will also work on Mac and Linux, whereas your asm version would need additional code to support the POSIX ABI.
We included it (with minimal tweaks like using NativeInt instead of integer, and using an inline stub for register usage) in https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.base.pas#L7974
First thing is to benchmark and compare your code with proper timing, and regular tests.
Try with some complex process, not in naive loops which tends to be biased because in naive tests the data remains in the CPU L1 cache, so numbers are not realistic.