On Windows, we use http.sys kernel mode which scales better than anything on this platform. It is faster than IOCP since it runs in the kernel.
On Linux, we use our own thread-pool of socket server, with a nginx frontend as reverse proxy on the unix socket loopback, handling HTTPS and HTTP/2. This is very safe and scalable.
And don't trust micro benchmarks. Even worse, don't write your own benchmark. They won't be as good as measuring of a real application.
As I wrote, Intel TBB is a no-go for real server work due to huge memory consumption. If you have to run some specific API calls to release the memory, this is a big design flow - may be considered as a bug (we don't want to have the application stale as it would have with a GC) - and we would never do it.
To be more precise, we use long-living threads from thread pools. So in practice, the threads are never released, and the memory allocation and the memory release are done in diverse threads: one thread pool handles the socket communication, then other thread pool will consume the data and release the memory. This is a scenario typical from most event-driven servers, running on multi-core CPUs, with a proven ring-oriented architecture. Perhaps Intel TBB is not very good at releasing memory with such pattern - whereas our SynFPCx64MM is very efficient in this case. And we almost never realloc - just alloc/free using the stack as working buffer if necessary.