Aligned and atomic read/write

Jacek Laskowski · November 6, 2018

Based on this thread:

https://stackoverflow.com/questions/29958168/are-integer-reads-atomic-in-delphi

it can be assumed that writing or reading Integer values (4 bytes) does not have to be atomic. It depends on the alignment.
Questions:
Is writing and reading 1 byte (type Byte) always atomic (and safe) operation?

Is writing and reading 2 byte (type Word) always atomic (and safe) operation?

Edited November 6, 2018 by Jacek Laskowski

Zacherl · November 7, 2018

Reading unaligned values from memory is slow but should still be atomic (on a Pentium6 and newer). This behavior is described in the Intel SDM:

Quote

8.1.1 Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
always be carried out atomically:
• Reading or writing a byte
• Reading or writing a word aligned on a 16-bit boundary
• Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following additional memory operations
will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following additional memory operation
will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

The above is only valid for single-core CPUs. For multi-core CPUs you will need to utilize the "bus control signals" (click here for explanation) :

Quote

Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided.

Delphi implements the System.SyncObjs.TInterlocked class which provides some functions to help with atomic access (e.g. `TInterlocked.Exchange()` for an atomic exchange operation). You should always use these functions to make sure your application does not run into race conditions on multi-core systems.

BTW: I released an unit with a few atomic type wrappers some time ago (limited functionality compared to the `std::atomic<T>` C++ types as Delphi does not allow overloading of the assignment operator):

https://github.com/flobernd/delphi-utils/blob/master/Utils.AtomicTypes.pas

Edited November 7, 2018 by Zacherl

David Heffernan · November 8, 2018

8 hours ago, Zacherl said:

Reading unaligned values from memory is slow but should still be atomic (on a Pentium6 and newer).

Actually both of these statements are wrong. Reading unaligned memory is not atomic for reads that straddle cache lines. And unaligned memory access is not slow on modern processors.

David Heffernan · November 8, 2018

On 11/6/2018 at 2:30 PM, Jacek Laskowski said:

Is writing and reading 1 byte (type Byte) always atomic (and safe) operation?

Is writing and reading 2 byte (type Word) always atomic (and safe) operation?

Single bytes are aligned, because they can't straddle cache lines. Reads are therefore atomic, because they are aligned.

Two byte reads can straddle cache lines and unaligned reads are not atomic.

Zacherl · November 8, 2018

4 hours ago, David Heffernan said:

Actually both of these statements are wrong. Reading unaligned memory is not atomic for reads that straddle cache lines.

Sorry, but did you actually read my post?

Thats exactly what I quoted from the latest Intel SDM:

13 hours ago, Zacherl said:

• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

4 hours ago, David Heffernan said:

And unaligned memory access is not slow on modern processors.

Can you please give me some references that proof your statement? Intel SDM says (you might be correct for normal data access, but using the `LOCK` prefix is something else):

13 hours ago, Zacherl said:

nonaligned data accesses will seriously impact the performance of the processor and should be avoided.

Edited November 8, 2018 by Zacherl

Primož Gabrijelčič · November 8, 2018

32 minutes ago, Zacherl said:

4 hours ago, David Heffernan said:

And unaligned memory access is not slow on modern processors.

Can you please give me some references that proof your statement

I can confirm (from experience) that this is indeed true. I cannot find any definitive document about that, but it looks like since 2011/12 unaligned access doesn't hurt very much (at least on Intel platform).

Some 3rd party posts that confirm my finding:

https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

https://www.reddit.com/r/programming/comments/2la6qc/data_alignment_for_speed_myth_or_reality/

Edited November 8, 2018 by Primož Gabrijelčič

Tommi Prami · November 8, 2018

Check out this fabulous book about memory alignment and performance.

Zacherl · November 8, 2018

51 minutes ago, Primož Gabrijelčič said:

I can confirm (from experience) that this is indeed true. I cannot find any definitive document about that, but it looks like since 2011/12 unaligned access doesn't hurt very much (at least on Intel platform).

Well, okay seems like the Intel SDM needs to be updated. Did anybody test the same szenario with multi-threaded atomic write operations (`lock xchg`, `lock add`, and so on)? Could imagine different results in terms of performance here.

Anyways .. I guess the original question is answered. To summarize this:

1 byte reads are always atomic
2/4/8 byte reads are atomic, if executed on a P6+ and fitting in a single cache line (any sane compiler should use correct alignments by itself)

For multi threaded read-modify-write access, read this (TLDR: you will need to use `TInterlocked.XXX()`):

https://stackoverflow.com/a/5421844/9241044

Primož Gabrijelčič · November 8, 2018

I put together a simple test measuring aligned vs. unaligned access speed. You can find it here: https://github.com/gabr42/GpDelphiCode/tree/master/MemSpeed

The code simply runs in a tight loop and does the following 1 million times:

pData^ := {$IFDEF X64}$F0F0F0F0F0F0F0F0{$ELSE}$F0F0F0F0{$ENDIF};
pData^ := {$IFDEF X64}$0F0F0F0F0F0F0F0F{$ELSE}$0F0F0F0F{$ENDIF};

All this is repeated for each offset in a 1024-byte buffer.

All above is repeated ten times.

Memory is allocated with VirtualAlloc which gives back nicely aligned blocks.

At the end, the shortest time for each offset is logged into a file of your choosing. I was only interested in relative differences so I did nothing to convert data to "real" time unit.

Warning: The code needs more than a minute to run on my slow Xeon.

If I graph the result in Excel, I get this:

image.png.fca17575461c5f9807110b38f77cb6dc.png

There's basically no difference (besides the noise - all the junk I got installed on Windows was running along the test program).

There's no difference over the whole 1024-byte range. QWORD access in 64-bit is slightly faster than DWORD access in 32-bit and that's that.

image.png.64239446a5f8950315bf760d46d6f044.png

Any contribution to the code will be welcome.

Edited November 8, 2018 by Primož Gabrijelčič

Primož Gabrijelčič · November 8, 2018

Measurements from my i7:

image.png.89ff275c874935d958878da892ae5939.png

image.png.bd7686ffe61b027e4f77d55fde22927d.png

Basically the same as on the Xeon. A bit larger speed difference between 32-bit DWORD and 64-bit QWORD. Remember - smaller is faster.

Primož Gabrijelčič · November 8, 2018

4 hours ago, Zacherl said:

2/4/8 byte reads are atomic, if executed on a P6+ and fitting in a single cache line (any sane compiler should use correct alignments by itself)

Can you please specify what exactly do you mean with this statement?

If nobody is writing to the memory then the statement is obviously true. It doesn't matter whether reads are atomic or not - the data will always be correct.

If there is a writer - does it have to be writing with 'lock' prefix or no?

Primož Gabrijelčič · November 8, 2018

49 minutes ago, Primož Gabrijelčič said:

If there is a writer - does it have to be writing with 'lock' prefix or no?

To answer myself - writer doesn't need to use 'lock'. The proof is here: https://github.com/gabr42/GpDelphiCode/tree/master/MemAtomic

The code runs tests with 1/2/4/8 byte data on offsets from 0 to 127 (relative to a well-aligned memory block). It writes out all offsets where reads/writes were not atomic.

32-bit

1:

2: 63 127
4: 61 62 63 125 126 127
8: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

2/4 bytes: access is not atomic when straddling a cache line

8 bytes: access is never atomic

64-bit

1:

2: 63 127
4: 61 62 63 125 126 127
8: 57 58 59 60 61 62 63 121 122 123 124 125 126 127

2/4/8 bytes: access is not atomic when straddling a cache line

Zacherl · November 8, 2018

1 hour ago, Primož Gabrijelčič said:

If there is a writer - does it have to be writing with 'lock' prefix or no?

You will need a lock prefix only, if you have concurrent threads reading, modifying (e.g. incrementing by one) and writing values. This is to make sure the value does not get changed by another thread in a way like this:

T1: read 0

T2: read 0

T2: inc

T2: write 1

T1: inc

T1: write 1

By locking that operation, read + increment + write is always performed atomic:

T1: read 0

T1: inc

T1: write 1

T2: read 1

T2: inc

T3: write 2

For forther explanation read this:

6 hours ago, Zacherl said:

For multi threaded read-modify-write access, read this (TLDR: you will need to use `TInterlocked.XXX()`):

https://stackoverflow.com/a/5421844/9241044

If you only want to write values without reading and modifying of the previous value, no `LOCK` prefix is needed.

Primož Gabrijelčič · November 8, 2018

2 minutes ago, Zacherl said:

You will need a lock prefix only, if you have concurrent threads reading, modifying (e.g. incrementing by one) and writing values.

Yes, of course. My question was related to one thread reading and one thread writing.

David Heffernan · November 8, 2018

7 hours ago, Zacherl said:

Sorry, but did you actually read my post?

Thats exactly what I quoted from the latest Intel SDM:

Can you please give me some references that proof your statement? Intel SDM says (you might be correct for normal data access, but using the `LOCK` prefix is something else):

Yes, I read what you wrote. What you wrote was wrong. You said nothing about cache lines.

Zacherl · November 8, 2018

2 minutes ago, David Heffernan said:

You said nothing about cache lines.

20 hours ago, Zacherl said:

Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

Nothing more to say ...

David Heffernan · November 8, 2018

22 hours ago, Zacherl said:

Reading unaligned values from memory is slow but should still be atomic (on a Pentium6 and newer).

That's what you said.

Zacherl · November 8, 2018

25 minutes ago, David Heffernan said:

That's what you said.

It's okay man. Can't help it, if you don't want to read the complete post (e.g. the text i quoted from the Intel SDM). I'm out.

David Heffernan · November 8, 2018

59 minutes ago, Zacherl said:

It's okay man. Can't help it, if you don't want to read the complete post (e.g. the text i quoted from the Intel SDM). I'm out.

You agree then that unaligned access of data greater than a single byte is not atomic?

Stefan Glienke · November 8, 2018

40 minutes ago, David Heffernan said:

unaligned access of data greater than a single byte is not atomic

isAtomic := not IsCrossingCacheLine(unalignedData);

David Heffernan · November 9, 2018

@Stefan Glienke And the existence of two byte data that cross cache lines shows that unaligned access is not atomic.

When arguing about non deterministic properties like atomic access or thread safety the language used is a little different. When we say code is. It threadsafe we don't mean that it will always fail. We mean that we can't guarantee that it will always succeed. When we say memory reads are not atomic we don't mean that they won't always be read with multiple bus reads. We mean that we can't guarantee that they will always be read with a single bus read.

Stefan Glienke · November 9, 2018

Well you have a different definition then than Intel in paragraph 8.1.1 it seems.

So unaligned data can be as well part of a struct/record which fits into a cache line and thus is guaranteed to be atomic.

Primož Gabrijelčič · November 9, 2018

More data, some old, some new.

Firstly, two very old CPUs (the oldest I could find in the company):

This pattern repeats very consistently every 64 bytes (size of cache line):

Very interesting pattern but the worst thing is the terrible slowdown when memory access crosses the cache line.

Similar data can be seen in a Xeon of a similar age:

For a moment I thought I used the wrong data files - that's how similar both results are!

And now a suprise! A very modern & fast AMD Ryzen Threadripper:

Wow! The cache line is only 32 bytes and memory access across that line is still slow! Interestingly, accessing 4-aligned 8-byte data in 64-bits works great even when straddling cache line.

MemAtomic proves that cache line is only 32-byte:

1:
2: 15 31 47 63 79 95 111 127
4: 13 14 15 29 30 31 45 46 47 61 62 63 77 78 79 93 94 95 109 110 111 125 126 127
8: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

No wonder Intel is still a king for non-optimized software!

David Heffernan · November 13, 2018

On 11/9/2018 at 8:52 AM, Stefan Glienke said:

Well you have a different definition then than Intel in paragraph 8.1.1 it seems.

So unaligned data can be as well part of a struct/record which fits into a cache line and thus is guaranteed to be atomic.

Fair enough.

Tommi Prami · November 13, 2018

Memspeed app by Primož, build with 10.2.3, Release mode, 32 and 64 bit versions, If any dares to download...

MemSpeed.7z

Edited November 13, 2018 by Tommi Prami

Sign In

Aligned and atomic read/write

Recommended Posts

Jacek Laskowski 57

Share this post

Link to post

Zacherl 3

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Zacherl 3

Share this post

Link to post

Primož Gabrijelčič 227

Share this post

Link to post

Tommi Prami 158

Share this post

Link to post

Zacherl 3

Share this post

Link to post

Primož Gabrijelčič 227

Share this post

Link to post

Primož Gabrijelčič 227

Share this post

Link to post

Primož Gabrijelčič 227

Share this post

Link to post

Primož Gabrijelčič 227

Share this post

Link to post

Zacherl 3

Share this post

Link to post

Primož Gabrijelčič 227

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Zacherl 3

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Zacherl 3

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Stefan Glienke 2150

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Stefan Glienke 2150

Share this post

Link to post

Primož Gabrijelčič 227

Share this post

Link to post

David Heffernan 2463

Share this post

Link to post

Tommi Prami 158

Share this post

Link to post

Create an account or sign in to comment

Create an account