See the paper:
And related SIMD galloping:

Quoting the first paper:

Our algorithm is suitable to 
replace existing standard library functions, such as
std::set_intersection in C++, thus accelerating many applications,
because the algorithm is simple and requires no preprocessing to
generate additional data structures. We implemented our
algorithm on Xeon and POWER7+. The experimental results
show our algorithm outperforms the std::set_intersection
implementation delivered with gcc by up to 5.2x using SIMD
instructions and by up to 2.1x even without using SIMD
instructions for 32-bit and 64-bit integer datasets.

Worth looking into.

Apr 04 2016