{"id":13877,"date":"2025-08-19T00:23:54","date_gmt":"2025-08-18T22:23:54","guid":{"rendered":"https:\/\/monodes.com\/predaelli\/?p=13877"},"modified":"2025-08-19T00:23:58","modified_gmt":"2025-08-18T22:23:58","slug":"going-faster-than-memcpy","status":"publish","type":"post","link":"https:\/\/monodes.com\/predaelli\/2025\/08\/19\/going-faster-than-memcpy\/","title":{"rendered":"Going faster than memcpy"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy\">Going faster than memcpy<\/a><\/h2>\n\n\n\n<!--nextpage-->\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<h1 class=\"wp-block-heading\">Going faster than memcpy<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">While profiling <a href=\"https:\/\/github.com\/squadrick\/shadesmar\">Shadesmar<\/a> a couple of weeks ago, I noticed that for large binary unserialized messages (&gt;512kB) most of the execution time is spent doing copying the message (using <code class=\"\" data-line=\"\">memcpy<\/code>) between process memory to shared memory and back.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I had a few hours to kill last weekend, and I tried to implement a faster way to do memory copies.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"autopsy-of-memcpy\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy#autopsy-of-memcpy\"><\/a>Autopsy of memcpy<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s the dumb of <a href=\"https:\/\/perf.wiki.kernel.org\/index.php\/Main_Page\"><code class=\"\" data-line=\"\">perf<\/code><\/a> when running pub-sub for messages of sizes between 512kB and 2MB.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\"> Children      Self  Shared Object      Symbol\n+  99.86%     0.00%  libc-2.27.so       &#091;.] __libc_start_main\n+  99.86%     0.00%  &#091;unknown]          &#091;k] 0x4426258d4c544155\n+  99.84%     0.02%  raw_benchmark      &#091;.] main\n+  98.13%    97.12%  libc-2.27.so       &#091;.] __memmove_avx_unaligned_erms\n+  51.99%     0.00%  raw_benchmark      &#091;.] shm::PublisherBin&lt;16u&gt;::publish\n+  51.98%     0.01%  raw_benchmark      &#091;.] shm::Topic&lt;16u&gt;::write\n+  47.64%     0.01%  raw_benchmark      &#091;.] shm::Topic&lt;16u&gt;::read\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><code class=\"\" data-line=\"\">__memmove_avx_unaligned_erms<\/code> is an implementation of <code class=\"\" data-line=\"\">memcpy<\/code> for unaligned memory blocks that uses AVX to copy over 32 bytes at a time. Digging into the <code class=\"\" data-line=\"\">glibc<\/code> source code, I found this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">#if IS_IN (libc)\n# define VEC_SIZE                32\n# define VEC(i)                  ymm##i\n# define VMOVNT                  vmovntdq\n# define VMOVU                   vmovdqu\n# define VMOVA                   vmovdqa\n# define SECTION(p)              p##.avx\n# define MEMMOVE_SYMBOL(p,s)     p##_avx_##s\n\n# include &quot;memmove-vec-unaligned-erms.S&quot;\n#endif\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Breaking down this function:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code class=\"\" data-line=\"\">memmove<\/code>: <code class=\"\" data-line=\"\">glibc<\/code> implements <code class=\"\" data-line=\"\">memcpy<\/code> as a <code class=\"\" data-line=\"\">memmove<\/code> instead, here\u2019s the relevant source code:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\"># define SYMBOL_NAME memcpy\n# include &quot;ifunc-memmove.h&quot;\n\nlibc_ifunc_redirected (__redirect_memcpy, __new_memcpy,\n\t\t       IFUNC_SELECTOR ());\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s the difference between the two: With <code class=\"\" data-line=\"\">memcpy<\/code>, the destination cannot overlap the source at all. With <code class=\"\" data-line=\"\">memmove<\/code> it can. Initially, I wasn\u2019t sure why it was implemented as <code class=\"\" data-line=\"\">memmove<\/code>. The reason for this will become clearer as the post proceeds.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code class=\"\" data-line=\"\">erms<\/code>: <em>E<\/em>nhanced <em>R<\/em>ep <em>M<\/em>ov<em>s<\/em> is a hardware optimization for a loop that does a simple copy. In simple pseudo-code, this is what the loop implementation looks like for copying a single byte at a time (<code class=\"\" data-line=\"\">REP MOVSB<\/code>).<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">void rep_movsb(void *dest, const void *src, size_t len) {\n  const uint8_t* s = (uint8_t*)src;\n  uint8_t* d = (uint8_t*)dest;\n\n  while (len--)\n    *d++ = *s++;\n\n  return dest;\n}\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Since the loop copies data pointer by pointer, it can handle the case of overlapping data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code class=\"\" data-line=\"\">vec<\/code>: For the above loop rather than copying around single bytes, it uses x86 vectorized instructions to copy multiple bytes in a single loop iteration (technically single instruction). <code class=\"\" data-line=\"\">vmov*<\/code> are assembly instructions for AVX which is the latest instruction set that the CPU on my laptop supports. With <code class=\"\" data-line=\"\">VEC_SIZE = 32<\/code>, it copies 32 bytes at a time.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code class=\"\" data-line=\"\">unaligned<\/code>: This is a generic version of <code class=\"\" data-line=\"\">memmove<\/code> that can copy between any pointer locations irrespective of their alignment. Unaligned pointers increase complexity for the copy loop when using vectorized instructions. The unaligned preceeding and trailing memory locations must be copied separately before hitting the optimized loop.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code class=\"\" data-line=\"\">memmove-vec-unaligned-erms.S<\/code> holds the actual implementation in assembly. A few things that the implementation does:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>It uses <code class=\"\" data-line=\"\">REP MOVS<\/code> only if the data is greater than 4kB. For smaller values it uses the SSE2 optimization.<\/li>\n\n\n\n<li>For handling <code class=\"\" data-line=\"\">unaligned<\/code> pointers, it uses the following blocks:\n<ul class=\"wp-block-list\">\n<li>16 to 31: <code class=\"\" data-line=\"\">vmovdqu<\/code><\/li>\n\n\n\n<li>15 to 8: <code class=\"\" data-line=\"\">movq<\/code><\/li>\n\n\n\n<li>7 to 4: <code class=\"\" data-line=\"\">movl<\/code><\/li>\n\n\n\n<li>3 to 2: <code class=\"\" data-line=\"\">movzwl<\/code> and <code class=\"\" data-line=\"\">movw<\/code><\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><code class=\"\" data-line=\"\">VMOVNT<\/code> defined above is for doing non-temporal(NT) moves. NT instructions are used when there is an overlap between destination and source since destination may be in cache when source is loaded. Uses <code class=\"\" data-line=\"\">prefetcht0<\/code> to load data into cache (all levels: t0). In the current iteration, we prefetch the data for 2 iterations later. The data is copied (via cache) into registers. The data (via NT) is copied from registers into destination.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">L(loop_large_forward):\n\t; Copy 4 * VEC a time forward with non-temporal stores.\n\tPREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 2)\n\tPREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 3)\n  ; PREFETCH 256b from rsi+256 to rsi+511\n\n\tVMOVU\t(%rsi), %VEC(0)\n\tVMOVU\tVEC_SIZE(%rsi), %VEC(1)\n\tVMOVU\t(VEC_SIZE * 2)(%rsi), %VEC(2)\n\tVMOVU\t(VEC_SIZE * 3)(%rsi), %VEC(3)\n  ; mov 128b from rsi to rsi+127 -&gt; 4 ymm registers (cache)\n  ; 2 loops later, we hit the prefetched values\n\n\taddq\t$PREFETCHED_LOAD_SIZE, %rsi  ; advance to rsi+128 in next loop\n\tsubq\t$PREFETCHED_LOAD_SIZE, %rdx\n\n\tVMOVNT\t%VEC(0), (%rdi)\n\tVMOVNT\t%VEC(1), VEC_SIZE(%rdi)\n\tVMOVNT\t%VEC(2), (VEC_SIZE * 2)(%rdi)\n\tVMOVNT\t%VEC(3), (VEC_SIZE * 3)(%rdi)\n  ; mov 128b from 4 ymm register -&gt; rdi to rdi+127 (no cache)\n\n\taddq\t$PREFETCHED_LOAD_SIZE, %rdi  ; advance to rdi+128 in next loop\n\tcmpq\t$PREFETCHED_LOAD_SIZE, %rdx\n\tja\tL(loop_large_forward)\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"method-1-basic-rep-movsb\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy#method-1-basic-rep-movsb\"><\/a>Method 1: Basic REP MOVSB<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Before getting into more exotic implementations, I wanted to first implement a super simple version of ERSB to see how well it would perform. I used inline assembly to write out the loop.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">void _rep_movsb(void *d, const void *s, size_t n) {\n  asm volatile(&quot;rep movsb&quot;\n               : &quot;=D&quot;(d), &quot;=S&quot;(s), &quot;=c&quot;(n)\n               : &quot;0&quot;(d), &quot;1&quot;(s), &quot;2&quot;(n)\n               : &quot;memory&quot;);\n}\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This does the same as the pseudo-code attached above, but I wrote it in assembly to prevent any compiler optimization, and rely only on the hardware ERMS optimization.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"alternate-2-aligned-avx\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy#alternate-2-aligned-avx\"><\/a>Alternate 2: Aligned AVX<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">One of the complexities in <code class=\"\" data-line=\"\">glibc<\/code>\u2019s implementation is getting it to work for unaligned pointers. Since I control the memory allocation, I figured I could recreate the implementation focused solely on aligned pointer and sizes. I\u2019m using AVX intrinsics for 32-byte vectors (AVX):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">void _avx_cpy(void *d, const void *s, size_t n) {\n  \/\/ d, s -&gt; 32 byte aligned\n  \/\/ n -&gt; multiple of 32\n  auto *dVec = reinterpret_cast&lt;__m256i *&gt;(d);\n  const auto *sVec = reinterpret_cast&lt;const __m256i *&gt;(s);\n  size_t nVec = n \/ sizeof(__m256i);\n  for (; nVec &gt; 0; nVec--, sVec++, dVec++) {\n    const __m256i temp = _mm256_load_si256(sVec);\n    _mm256_store_si256(dVec, temp);\n  }\n}\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The logic is identical to the previous <code class=\"\" data-line=\"\">REP MOVSB<\/code> loop instead operating on 32 bytes at a time.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"method-3-stream-aligned-avx\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy#method-3-stream-aligned-avx\"><\/a>Method 3: Stream aligned AVX<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><code class=\"\" data-line=\"\">_mm256_load_si256<\/code> and <code class=\"\" data-line=\"\">_mm256_store_si256<\/code> go through the cache, which incurs additional overhead. AVX instruction set has <code class=\"\" data-line=\"\">_stream_<\/code> load and store instructions that skip the cache. The performance of this copy is dependant on:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Quantity of data to copy<\/li>\n\n\n\n<li>Cache size<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Non-temporal moves may bog down the performance for smaller copies (that can fit into L2 cache) compared to regular moves.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">void _avx_async_cpy(void *d, const void *s, size_t n) {\n  \/\/ d, s -&gt; 32 byte aligned\n  \/\/ n -&gt; multiple of 32\n  auto *dVec = reinterpret_cast&lt;__m256i *&gt;(d);\n  const auto *sVec = reinterpret_cast&lt;const __m256i *&gt;(s);\n  size_t nVec = n \/ sizeof(__m256i);\n  for (; nVec &gt; 0; nVec--, sVec++, dVec++) {\n    const __m256i temp = _mm256_stream_load_si256(sVec);\n    _mm256_stream_si256(dVec, temp);\n  }\n  _mm_sfence();\n}\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Exact code as before but using non-temporal moves instead. There\u2019s an extra <code class=\"\" data-line=\"\">_mm_sfence<\/code> which guarantees that all stores in the preceding loop are visible globally.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"method-4-stream-aligned-avx-with-prefetch\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy#method-4-stream-aligned-avx-with-prefetch\"><\/a>Method 4: Stream aligned AVX with prefetch<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In the previous method, we skipped the cache entirely. We can squeeze a bit more performance by prefetching the source data into the cache for the next iteration in the current iteration. Since all prefetches work on cache-lines (64-bytes), each loop iteration copies 64-bytes from source to data.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">void _avx_async_pf_cpy(void *d, const void *s, size_t n) {\n  \/\/ d, s -&gt; 64 byte aligned\n  \/\/ n -&gt; multiple of 64\n\n  auto *dVec = reinterpret_cast&lt;__m256i *&gt;(d);\n  const auto *sVec = reinterpret_cast&lt;const __m256i *&gt;(s);\n  size_t nVec = n \/ sizeof(__m256i);\n  for (; nVec &gt; 2; nVec -= 2, sVec += 2, dVec += 2) {\n    \/\/ prefetch the next iteration&#039;s data\n    \/\/ by default _mm_prefetch moves the entire cache-lint (64b)\n    _mm_prefetch(sVec + 2, _MM_HINT_T0);\n\n    _mm256_stream_si256(dVec, _mm256_load_si256(sVec));\n    _mm256_stream_si256(dVec + 1, _mm256_load_si256(sVec + 1));\n  }\n  _mm256_stream_si256(dVec, _mm256_load_si256(sVec));\n  _mm256_stream_si256(dVec + 1, _mm256_load_si256(sVec + 1));\n  _mm_sfence();\n}\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The load from source pointer to register should <strong>not<\/strong> skip the cache since that data is explicitly prefetched into the cache, non-stream <code class=\"\" data-line=\"\">_mm256_load_si256<\/code> must be used instead.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This also unrolls the loop for 2 copies at a time instead of a single copy. This is to guarantee that each loop iteration\u2019s prefetch coincides the copy. Prefetch the next 64-bytes and copy the current 64-bytes.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"alternate-avenues\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy#alternate-avenues\"><\/a>Alternate avenues<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"unrolling\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy#unrolling\"><\/a>Unrolling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In the previous section, most of the changes were in the actual underlying load, store instructions used. Another avenue of exploration is to unroll the loop for a certain number of iterations. This reduces the number of branch statements by the factor of unrolling.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the <code class=\"\" data-line=\"\">glibc<\/code> implementation the unrolling factor is 4 which is what I\u2019ll use as well. A very simple way to implement this is to increase the alignment required by 4x and treat each loop as 4 instructions that copy 4x data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A more complicated version would be trying to implement an unrolled loop without increasing alignment size. We\u2019ll need to copy using a regular fully rolled loop till we hit a pointer location that is aligned to the size expected by our unrolled loop.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Unrolling the aligned AVX copy:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">void _avx_cpy_unroll(void *d, const void *s, size_t n) {\n  \/\/ d, s -&gt; 128 byte aligned\n  \/\/ n -&gt; multiple of 128\n\n  auto *dVec = reinterpret_cast&lt;__m256i *&gt;(d);\n  const auto *sVec = reinterpret_cast&lt;const __m256i *&gt;(s);\n  size_t nVec = n \/ sizeof(__m256i);\n  for (; nVec &gt; 0; nVec -= 4, sVec += 4, dVec += 4) {\n    _mm256_store_si256(dVec, _mm256_load_si256(sVec));\n    _mm256_store_si256(dVec + 1, _mm256_load_si256(sVec + 1));\n    _mm256_store_si256(dVec + 2, _mm256_load_si256(sVec + 2));\n    _mm256_store_si256(dVec + 3, _mm256_load_si256(sVec + 3));\n  }\n}\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"multithreading\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy#multithreading\"><\/a>Multithreading<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The operation of copying data is super easy to parallelize across multiple threads. The total data to be transferred can be segmented into (almost) equal chunks, and then copied over using one of the above methods. This will make the copy super-fast especially if the CPU has a large core count.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"shadesmar-api\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy#shadesmar-api\"><\/a>Shadesmar API<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">To make it easy to integrate custom memory copying logic into the library, I introduced the concept of <code class=\"\" data-line=\"\">Copier<\/code> in <a href=\"https:\/\/github.com\/Squadrick\/shadesmar\/commit\/22dc762ca658d1396f3c00366e80e4f695189df9\">this commit<\/a>. For a new copying algorithm, an abstract class <code class=\"\" data-line=\"\">Copier<\/code> must be implemented.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s the definition of <code class=\"\" data-line=\"\">Copier<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">class Copier {\n public:\n  virtual void *alloc(size_t) = 0;\n  virtual void dealloc(void *) = 0;\n  virtual void shm_to_user(void *, void *, size_t) = 0;\n  virtual void user_to_shm(void *, void *, size_t) = 0;\n};\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The original reason for introducing this construct was to allow cross-device usage, where a custom copier would be implemented to tranfer between CPU and GPU. E.g.: using <code class=\"\" data-line=\"\">cudaMemcpy<\/code> for Nvidia GPUs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For a single device use case the implementation of <code class=\"\" data-line=\"\">shm_to_user<\/code> and <code class=\"\" data-line=\"\">user_to_shm<\/code> are identical. The implementation of a copier that uses <code class=\"\" data-line=\"\">std::memcpy<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">class DefaultCopier : public Copier {\n public:\n  void *alloc(size_t size) override { return malloc(size); }\n\n  void dealloc(void *ptr) override { free(ptr); }\n\n  void shm_to_user(void *dst, void *src, size_t size) override {\n    std::memcpy(dst, src, size);\n  }\n\n  void user_to_shm(void *dst, void *src, size_t size) override {\n    std::memcpy(dst, src, size);\n  }\n};\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I also created an adapter <code class=\"\" data-line=\"\">MTCopier<\/code> that adds multithreading support to other copiers:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\" data-line=\"\">template &lt;class BaseCopierT&gt; \nclass MTCopier : public Copier {\npublic:\n  explicit MTCopier(uint32_t threads = std::thread::hardware_concurrency())\n      : base_copier(base_copier), nthreads(threads) {}\n\n  void *alloc(size_t size) override { return base_copier.alloc(size); }\n\n  void dealloc(void *ptr) override { base_copier.dealloc(ptr); }\n\n  void _copy(void *d, void *s, size_t n, bool shm_to_user) {\n    std::vector&lt;std::thread&gt; threads;\n    threads.reserve(nthreads);\n\n    ldiv_t per_worker = div((int64_t)n, nthreads);\n\n    size_t next_start = 0;\n    for (uint32_t thread_idx = 0; thread_idx &lt; nthreads; ++thread_idx) {\n      const size_t curr_start = next_start;\n      next_start += per_worker.quot;\n      if (thread_idx &lt; per_worker.rem) {\n        ++next_start;\n      }\n      uint8_t *d_thread = reinterpret_cast&lt;uint8_t *&gt;(d) + curr_start;\n      uint8_t *s_thread = reinterpret_cast&lt;uint8_t *&gt;(s) + curr_start;\n\n      if (shm_to_user) {\n        threads.emplace_back(&amp;Copier::shm_to_user, &amp;base_copier, d_thread,\n                             s_thread, next_start - curr_start);\n      } else {\n        threads.emplace_back(&amp;Copier::user_to_shm, &amp;base_copier, d_thread,\n                             s_thread, next_start - curr_start);\n      }\n    }\n    for (auto &amp;thread : threads) {\n      thread.join();\n    }\n    threads.clear();\n  }\n\n  void shm_to_user(void *dst, void *src, size_t size) override {\n    _copy(dst, src, size, true);\n  }\n\n  void user_to_shm(void *dst, void *src, size_t size) override {\n    _copy(dst, src, size, false);\n  }\n\nprivate:\n  BaseCopierT base_copier;\n  uint32_t nthreads;\n};\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Currently this only works for <code class=\"\" data-line=\"\">memcpy<\/code> and <code class=\"\" data-line=\"\">_rep_movsb<\/code> since the implementation expects the memory copy to work for unaligned memory.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"benchmark\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy#benchmark\"><\/a>Benchmark<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">I used Google\u2019s <a href=\"https:\/\/github.com\/google\/benchmark\">Benchmark<\/a> for timing the performance of copying data ranging from size of 32kB to 64MB. All the benchmarks were run on my PC with the following specifications:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>AMD Ryzen 7 3700X<\/li>\n\n\n\n<li>2x8GB DDR4 RAM @ 3600Mhz<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"conclusion\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy#conclusion\"><\/a>Conclusion<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Stick to <code class=\"\" data-line=\"\">std::memcpy<\/code>. It delivers great performance while also adapting to the hardware architecture, and makes no assumptions about the memory alignment.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If performance truly matters, then you might want to consider using a more specific non-genetic implementation with alignment requirements. The streaming prefetching copy works the best for larger copies (&gt;1MB), but the performance for small sizes is abyssal, but <code class=\"\" data-line=\"\">memcpy<\/code> matches its performance. For small to medium sizes Unrolled AVX absolutely dominates, but as for larger messages, it is slower than the streaming alternatives. The regular <code class=\"\" data-line=\"\">RepMovsb<\/code> is by far the worst overall performer as excepted.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Unrolling definitely improves performance in most cases by about 5-10%. The only case where the unrolled version is slower than rolled version is for <code class=\"\" data-line=\"\">AvxCopier<\/code> with data size of 32B, which the unrolled version is 25% slower. The rolled version will do a single AVX-256 load\/store and a conditional check. The unrolled version will do 4 AVX-256 load\/stores and a conditional check.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"code\"><a href=\"https:\/\/squadrick.dev\/journal\/going-faster-than-memcpy#code\"><\/a>Code<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Code for all the methods is included in the library conforming to the above mentioned API. To actively warn about the danger of using these custom copiers I have named this file <a href=\"https:\/\/github.com\/Squadrick\/shadesmar\/blob\/master\/include\/shadesmar\/memory\/dragons.h\"><code class=\"\" data-line=\"\">dragons.h<\/code><\/a>, with an apt message: <em>Here be dragons<\/em>.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p class=\"excerpt\">Going faster than memcpy<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"https:\/\/monodes.com\/predaelli\/2025\/08\/19\/going-faster-than-memcpy\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"link","meta":{"inline_featured_image":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":4,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"federated","footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false,"_members_access_role":[],"_members_access_error":"&nbsp;\r\n<div id=\"inkAppRoot\">\r\n<div><\/div>\r\n<\/div>\r\n<div id=\"inkPopUp\">\r\n<div><\/div>\r\n<\/div>\r\n<div id=\"inkAppFlex\">\r\n<div><\/div>\r\n<\/div>"},"categories":[1],"tags":[489,488],"class_list":["post-13877","post","type-post","status-publish","format-link","hentry","category-senza-categoria","tag-endif","tag-if","post_format-post-format-link"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p6daft-3BP","jetpack-related-posts":[{"id":6836,"url":"https:\/\/monodes.com\/predaelli\/2020\/03\/06\/what-is-faster-in-c-a-struct-or-a-class-c-architects-medium\/","url_meta":{"origin":13877,"position":0},"title":"What Is Faster In C#: A Struct Or A Class? &#8211; C# Architects &#8211; Medium","author":"Paolo Redaelli","date":"2020-03-06","format":false,"excerpt":"What do you think is faster: filling an array with one million structs, or filling an array with one million classes? Mark Farragher ask himself What Is Faster In C#: A Struct Or A Class? - C# Architects - on Medium. But of course the third version is faster because\u2026","rel":"","context":"In &quot;Documentations&quot;","block_context":{"text":"Documentations","link":"https:\/\/monodes.com\/predaelli\/category\/documentations\/"},"img":{"alt_text":"Mark Farragher","src":"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/03\/1IyCy3Cj8Kuv759UIaFq3aw.jpeg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":8979,"url":"https:\/\/monodes.com\/predaelli\/2021\/12\/11\/new-aluminum-ion-battery-charges-up-to-60-times-faster-than-lithium-ion\/","url_meta":{"origin":13877,"position":1},"title":"New aluminum-ion battery charges up to 60 times faster than lithium-ion","author":"Paolo Redaelli","date":"2021-12-11","format":"link","excerpt":"GMG and UQ develop faster-charging and more sustainable batteries with a life up to three times greater than lithium-ion. Source: New aluminum-ion battery charges up to 60 times faster than lithium-ion","rel":"","context":"In &quot;Senza categoria&quot;","block_context":{"text":"Senza categoria","link":"https:\/\/monodes.com\/predaelli\/category\/senza-categoria\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":3955,"url":"https:\/\/monodes.com\/predaelli\/2018\/04\/01\/mega65-8-bit-computer\/","url_meta":{"origin":13877,"position":2},"title":"MEGA65 8-bit computer","author":"Paolo Redaelli","date":"2018-04-01","format":false,"excerpt":"Modern retrocomputing made the way I like: MEGA65: the\u00a0 8-bit computer: an open-source new and open C65-like computer. The 21st century realization of the C65 heritage: A complete 8-bit computer running around 50x faster than a C64 while being highly compatible. C65 design, mechanical keyboard, HD output, SD card support,\u2026","rel":"","context":"In &quot;Fun&quot;","block_context":{"text":"Fun","link":"https:\/\/monodes.com\/predaelli\/category\/fun\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":11473,"url":"https:\/\/monodes.com\/predaelli\/2024\/03\/13\/multi-threading-is-always-the-wrong-design\/","url_meta":{"origin":13877,"position":3},"title":"Multi-threading is always the wrong design","author":"Paolo Redaelli","date":"2024-03-13","format":false,"excerpt":"\u201cWe\u2019ll just do that on a background thread\u201d Source: Multi-threading is always the wrong design Well, really? Multi-threading is always the wrong design \u201cWe\u2019ll just do that on a background thread\u201d uNetworking AB Say what you want about Node.js. It sucks, a lot. But it was made with one very\u2026","rel":"","context":"In &quot;Tricks&quot;","block_context":{"text":"Tricks","link":"https:\/\/monodes.com\/predaelli\/category\/documentations\/tricks\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":8787,"url":"https:\/\/monodes.com\/predaelli\/2021\/10\/12\/write-better-and-faster-python-using-einstein-notation-by-bilal-himite-aug-2021-towards-data-science\/","url_meta":{"origin":13877,"position":4},"title":"Write Better And Faster Python Using Einstein Notation | by Bilal Himite | Aug, 2021 | Towards Data Science","author":"Paolo Redaelli","date":"2021-10-12","format":false,"excerpt":"How to make your code more readable, concise, and efficient using Einstein notation Source: Write Better And Faster Python Using Einstein Notation | by Bilal Himite | Aug, 2021 | Towards Data Science","rel":"","context":"In &quot;Python&quot;","block_context":{"text":"Python","link":"https:\/\/monodes.com\/predaelli\/category\/python\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1963,"url":"https:\/\/monodes.com\/predaelli\/2016\/12\/04\/deliver-support-for-new-languages-in-eclipse-ide-faster-with-generic-editor-and-language-servers-red-hat-developer-blog\/","url_meta":{"origin":13877,"position":5},"title":"Deliver support for new languages in Eclipse IDE faster with Generic Editor and Language Servers \u2013 Red Hat Developer Blog","author":"Paolo Redaelli","date":"2016-12-04","format":"link","excerpt":"http:\/\/developerblog.redhat.com\/2016\/11\/24\/deliver-support-for-new-languages-in-eclipse-ide-faster-with-generic-editor-and-language-servers\/ May help to add support for LibertyEiffel\u00a0","rel":"","context":"In &quot;Senza categoria&quot;","block_context":{"text":"Senza categoria","link":"https:\/\/monodes.com\/predaelli\/category\/senza-categoria\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts\/13877","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/comments?post=13877"}],"version-history":[{"count":0,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts\/13877\/revisions"}],"wp:attachment":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/media?parent=13877"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/categories?post=13877"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/tags?post=13877"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}