Jacek Laskowski 57 Posted November 6, 2018 (edited) Based on this thread: https://stackoverflow.com/questions/29958168/are-integer-reads-atomic-in-delphi it can be assumed that writing or reading Integer values (4 bytes) does not have to be atomic. It depends on the alignment. Questions: Is writing and reading 1 byte (type Byte) always atomic (and safe) operation? Is writing and reading 2 byte (type Word) always atomic (and safe) operation? Edited November 6, 2018 by Jacek Laskowski Share this post Link to post
Zacherl 3 Posted November 7, 2018 (edited) Reading unaligned values from memory is slow but should still be atomic (on a Pentium6 and newer). This behavior is described in the Intel SDM: Quote 8.1.1 Guaranteed Atomic Operations The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically: • Reading or writing a byte • Reading or writing a word aligned on a 16-bit boundary • Reading or writing a doubleword aligned on a 32-bit boundary The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically: • Reading or writing a quadword aligned on a 64-bit boundary • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically: • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line The above is only valid for single-core CPUs. For multi-core CPUs you will need to utilize the "bus control signals" (click here for explanation) : Quote Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided. Delphi implements the System.SyncObjs.TInterlocked class which provides some functions to help with atomic access (e.g. `TInterlocked.Exchange()` for an atomic exchange operation). You should always use these functions to make sure your application does not run into race conditions on multi-core systems. BTW: I released an unit with a few atomic type wrappers some time ago (limited functionality compared to the `std::atomic<T>` C++ types as Delphi does not allow overloading of the assignment operator): https://github.com/flobernd/delphi-utils/blob/master/Utils.AtomicTypes.pas Edited November 7, 2018 by Zacherl Share this post Link to post
David Heffernan 2345 Posted November 8, 2018 8 hours ago, Zacherl said: Reading unaligned values from memory is slow but should still be atomic (on a Pentium6 and newer). Actually both of these statements are wrong. Reading unaligned memory is not atomic for reads that straddle cache lines. And unaligned memory access is not slow on modern processors. Share this post Link to post
David Heffernan 2345 Posted November 8, 2018 On 11/6/2018 at 2:30 PM, Jacek Laskowski said: Is writing and reading 1 byte (type Byte) always atomic (and safe) operation? Is writing and reading 2 byte (type Word) always atomic (and safe) operation? Single bytes are aligned, because they can't straddle cache lines. Reads are therefore atomic, because they are aligned. Two byte reads can straddle cache lines and unaligned reads are not atomic. Share this post Link to post
Zacherl 3 Posted November 8, 2018 (edited) 4 hours ago, David Heffernan said: Actually both of these statements are wrong. Reading unaligned memory is not atomic for reads that straddle cache lines. Sorry, but did you actually read my post? Thats exactly what I quoted from the latest Intel SDM: 13 hours ago, Zacherl said: • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line 4 hours ago, David Heffernan said: And unaligned memory access is not slow on modern processors. Can you please give me some references that proof your statement? Intel SDM says (you might be correct for normal data access, but using the `LOCK` prefix is something else): 13 hours ago, Zacherl said: nonaligned data accesses will seriously impact the performance of the processor and should be avoided. Edited November 8, 2018 by Zacherl Share this post Link to post
Primož Gabrijelčič 223 Posted November 8, 2018 (edited) 32 minutes ago, Zacherl said: 4 hours ago, David Heffernan said: And unaligned memory access is not slow on modern processors. Can you please give me some references that proof your statement I can confirm (from experience) that this is indeed true. I cannot find any definitive document about that, but it looks like since 2011/12 unaligned access doesn't hurt very much (at least on Intel platform). Some 3rd party posts that confirm my finding: https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/ https://www.reddit.com/r/programming/comments/2la6qc/data_alignment_for_speed_myth_or_reality/ Edited November 8, 2018 by Primož Gabrijelčič 1 Share this post Link to post
Tommi Prami 130 Posted November 8, 2018 Check out this fabulous book about memory alignment and performance. 1 Share this post Link to post
Zacherl 3 Posted November 8, 2018 51 minutes ago, Primož Gabrijelčič said: I can confirm (from experience) that this is indeed true. I cannot find any definitive document about that, but it looks like since 2011/12 unaligned access doesn't hurt very much (at least on Intel platform). Well, okay seems like the Intel SDM needs to be updated. Did anybody test the same szenario with multi-threaded atomic write operations (`lock xchg`, `lock add`, and so on)? Could imagine different results in terms of performance here. Anyways .. I guess the original question is answered. To summarize this: 1 byte reads are always atomic 2/4/8 byte reads are atomic, if executed on a P6+ and fitting in a single cache line (any sane compiler should use correct alignments by itself) For multi threaded read-modify-write access, read this (TLDR: you will need to use `TInterlocked.XXX()`): https://stackoverflow.com/a/5421844/9241044 1 Share this post Link to post
Primož Gabrijelčič 223 Posted November 8, 2018 (edited) I put together a simple test measuring aligned vs. unaligned access speed. You can find it here: https://github.com/gabr42/GpDelphiCode/tree/master/MemSpeed The code simply runs in a tight loop and does the following 1 million times: pData^ := {$IFDEF X64}$F0F0F0F0F0F0F0F0{$ELSE}$F0F0F0F0{$ENDIF}; pData^ := {$IFDEF X64}$0F0F0F0F0F0F0F0F{$ELSE}$0F0F0F0F{$ENDIF}; All this is repeated for each offset in a 1024-byte buffer. All above is repeated ten times. Memory is allocated with VirtualAlloc which gives back nicely aligned blocks. At the end, the shortest time for each offset is logged into a file of your choosing. I was only interested in relative differences so I did nothing to convert data to "real" time unit. Warning: The code needs more than a minute to run on my slow Xeon. If I graph the result in Excel, I get this: There's basically no difference (besides the noise - all the junk I got installed on Windows was running along the test program). There's no difference over the whole 1024-byte range. QWORD access in 64-bit is slightly faster than DWORD access in 32-bit and that's that. Any contribution to the code will be welcome. Edited November 8, 2018 by Primož Gabrijelčič Share this post Link to post
Primož Gabrijelčič 223 Posted November 8, 2018 Measurements from my i7: Basically the same as on the Xeon. A bit larger speed difference between 32-bit DWORD and 64-bit QWORD. Remember - smaller is faster. Share this post Link to post
Primož Gabrijelčič 223 Posted November 8, 2018 4 hours ago, Zacherl said: 2/4/8 byte reads are atomic, if executed on a P6+ and fitting in a single cache line (any sane compiler should use correct alignments by itself) Can you please specify what exactly do you mean with this statement? If nobody is writing to the memory then the statement is obviously true. It doesn't matter whether reads are atomic or not - the data will always be correct. If there is a writer - does it have to be writing with 'lock' prefix or no? Share this post Link to post
Primož Gabrijelčič 223 Posted November 8, 2018 49 minutes ago, Primož Gabrijelčič said: If there is a writer - does it have to be writing with 'lock' prefix or no? To answer myself - writer doesn't need to use 'lock'. The proof is here: https://github.com/gabr42/GpDelphiCode/tree/master/MemAtomic The code runs tests with 1/2/4/8 byte data on offsets from 0 to 127 (relative to a well-aligned memory block). It writes out all offsets where reads/writes were not atomic. 32-bit 1: 2: 63 127 4: 61 62 63 125 126 127 8: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 2/4 bytes: access is not atomic when straddling a cache line 8 bytes: access is never atomic 64-bit 1: 2: 63 127 4: 61 62 63 125 126 127 8: 57 58 59 60 61 62 63 121 122 123 124 125 126 127 2/4/8 bytes: access is not atomic when straddling a cache line Share this post Link to post
Zacherl 3 Posted November 8, 2018 1 hour ago, Primož Gabrijelčič said: If there is a writer - does it have to be writing with 'lock' prefix or no? You will need a lock prefix only, if you have concurrent threads reading, modifying (e.g. incrementing by one) and writing values. This is to make sure the value does not get changed by another thread in a way like this: T1: read 0 T2: read 0 T2: inc T2: write 1 T1: inc T1: write 1 By locking that operation, read + increment + write is always performed atomic: T1: read 0 T1: inc T1: write 1 T2: read 1 T2: inc T3: write 2 For forther explanation read this: 6 hours ago, Zacherl said: For multi threaded read-modify-write access, read this (TLDR: you will need to use `TInterlocked.XXX()`): https://stackoverflow.com/a/5421844/9241044 If you only want to write values without reading and modifying of the previous value, no `LOCK` prefix is needed. Share this post Link to post
Primož Gabrijelčič 223 Posted November 8, 2018 2 minutes ago, Zacherl said: You will need a lock prefix only, if you have concurrent threads reading, modifying (e.g. incrementing by one) and writing values. Yes, of course. My question was related to one thread reading and one thread writing. Share this post Link to post
David Heffernan 2345 Posted November 8, 2018 7 hours ago, Zacherl said: Sorry, but did you actually read my post? Thats exactly what I quoted from the latest Intel SDM: Can you please give me some references that proof your statement? Intel SDM says (you might be correct for normal data access, but using the `LOCK` prefix is something else): Yes, I read what you wrote. What you wrote was wrong. You said nothing about cache lines. Share this post Link to post
Zacherl 3 Posted November 8, 2018 2 minutes ago, David Heffernan said: You said nothing about cache lines. 20 hours ago, Zacherl said: Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line Nothing more to say ... Share this post Link to post
David Heffernan 2345 Posted November 8, 2018 22 hours ago, Zacherl said: Reading unaligned values from memory is slow but should still be atomic (on a Pentium6 and newer). That's what you said. Share this post Link to post
Zacherl 3 Posted November 8, 2018 25 minutes ago, David Heffernan said: That's what you said. It's okay man. Can't help it, if you don't want to read the complete post (e.g. the text i quoted from the Intel SDM). I'm out. Share this post Link to post
David Heffernan 2345 Posted November 8, 2018 59 minutes ago, Zacherl said: It's okay man. Can't help it, if you don't want to read the complete post (e.g. the text i quoted from the Intel SDM). I'm out. You agree then that unaligned access of data greater than a single byte is not atomic? Share this post Link to post
Stefan Glienke 2002 Posted November 8, 2018 40 minutes ago, David Heffernan said: unaligned access of data greater than a single byte is not atomic isAtomic := not IsCrossingCacheLine(unalignedData); Share this post Link to post
David Heffernan 2345 Posted November 9, 2018 @Stefan Glienke And the existence of two byte data that cross cache lines shows that unaligned access is not atomic. When arguing about non deterministic properties like atomic access or thread safety the language used is a little different. When we say code is. It threadsafe we don't mean that it will always fail. We mean that we can't guarantee that it will always succeed. When we say memory reads are not atomic we don't mean that they won't always be read with multiple bus reads. We mean that we can't guarantee that they will always be read with a single bus read. Share this post Link to post
Stefan Glienke 2002 Posted November 9, 2018 Well you have a different definition then than Intel in paragraph 8.1.1 it seems. So unaligned data can be as well part of a struct/record which fits into a cache line and thus is guaranteed to be atomic. Share this post Link to post
Primož Gabrijelčič 223 Posted November 9, 2018 More data, some old, some new. Firstly, two very old CPUs (the oldest I could find in the company): This pattern repeats very consistently every 64 bytes (size of cache line): Very interesting pattern but the worst thing is the terrible slowdown when memory access crosses the cache line. Similar data can be seen in a Xeon of a similar age: For a moment I thought I used the wrong data files - that's how similar both results are! And now a suprise! A very modern & fast AMD Ryzen Threadripper: Wow! The cache line is only 32 bytes and memory access across that line is still slow! Interestingly, accessing 4-aligned 8-byte data in 64-bits works great even when straddling cache line. MemAtomic proves that cache line is only 32-byte: 1: 2: 15 31 47 63 79 95 111 127 4: 13 14 15 29 30 31 45 46 47 61 62 63 77 78 79 93 94 95 109 110 111 125 126 127 8: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 No wonder Intel is still a king for non-optimized software! 3 Share this post Link to post
David Heffernan 2345 Posted November 13, 2018 On 11/9/2018 at 8:52 AM, Stefan Glienke said: Well you have a different definition then than Intel in paragraph 8.1.1 it seems. So unaligned data can be as well part of a struct/record which fits into a cache line and thus is guaranteed to be atomic. Fair enough. Share this post Link to post
Tommi Prami 130 Posted November 13, 2018 (edited) Memspeed app by Primož, build with 10.2.3, Release mode, 32 and 64 bit versions, If any dares to download... MemSpeed.7z Edited November 13, 2018 by Tommi Prami Share this post Link to post