Jump to content
Jacek Laskowski

Aligned and atomic read/write

Recommended Posts

Based on this thread:

 

https://stackoverflow.com/questions/29958168/are-integer-reads-atomic-in-delphi

 

it can be assumed that writing or reading Integer values (4 bytes) does not have to be atomic. It depends on the alignment.
Questions:
Is writing and reading 1 byte (type Byte) always atomic (and safe) operation?

Is writing and reading 2 byte (type Word) always atomic (and safe) operation?

Edited by Jacek Laskowski

Share this post


Link to post

Reading unaligned values from memory is slow but should still be atomic (on a Pentium6 and newer). This behavior is described in the Intel SDM:

Quote

8.1.1 Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will
always be carried out atomically:
 • Reading or writing a byte
 • Reading or writing a word aligned on a 16-bit boundary
 • Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following additional memory operations
will always be carried out atomically:
 • Reading or writing a quadword aligned on a 64-bit boundary
 • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following additional memory operation
will always be carried out atomically:
 • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

The above is only valid for single-core CPUs. For multi-core CPUs you will need to utilize the "bus control signals" (click here for explanation:

Quote

Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided.

 

Delphi implements the System.SyncObjs.TInterlocked class which provides some functions to help with atomic access (e.g. `TInterlocked.Exchange()` for an atomic exchange operation). You should always use these functions to make sure your application does not run into race conditions on multi-core systems.

 

BTW: I released an unit with a few atomic type wrappers some time ago (limited functionality compared to the `std::atomic<T>` C++ types as Delphi does not allow overloading of the assignment operator):

https://github.com/flobernd/delphi-utils/blob/master/Utils.AtomicTypes.pas

Edited by Zacherl

Share this post


Link to post
8 hours ago, Zacherl said:

Reading unaligned values from memory is slow but should still be atomic (on a Pentium6 and newer). 

Actually both of these statements are wrong. Reading unaligned memory is not atomic for reads that straddle cache lines. And unaligned memory access is not slow on modern processors. 

Share this post


Link to post
On 11/6/2018 at 2:30 PM, Jacek Laskowski said:

Is writing and reading 1 byte (type Byte) always atomic (and safe) operation?

 

Is writing and reading 2 byte (type Word) always atomic (and safe) operation?

Single bytes are aligned, because they can't straddle cache lines. Reads are therefore atomic, because they are aligned. 

 

Two byte reads can straddle cache lines and unaligned reads are not atomic. 

Share this post


Link to post
4 hours ago, David Heffernan said:

Actually both of these statements are wrong. Reading unaligned memory is not atomic for reads that straddle cache lines.

Sorry, but did you actually read my post? 

 

Thats exactly what I quoted from the latest Intel SDM:

13 hours ago, Zacherl said:

 • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

 

4 hours ago, David Heffernan said:

And unaligned memory access is not slow on modern processors. 

Can you please give me some references that proof your statement? Intel SDM says (you might be correct for normal data access, but using the `LOCK` prefix is something else):

13 hours ago, Zacherl said:

nonaligned data accesses will seriously impact the performance of the processor and should be avoided.

 

Edited by Zacherl

Share this post


Link to post
32 minutes ago, Zacherl said:
  4 hours ago, David Heffernan said:

And unaligned memory access is not slow on modern processors. 

Can you please give me some references that proof your statement

 

I can confirm (from experience) that this is indeed true. I cannot find any definitive document about that, but it looks like since 2011/12 unaligned access doesn't hurt very much (at least on Intel platform).

 

Some 3rd party posts that confirm my finding: 

https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

https://www.reddit.com/r/programming/comments/2la6qc/data_alignment_for_speed_myth_or_reality/ 

 

Edited by Primož Gabrijelčič
  • Like 1

Share this post


Link to post
51 minutes ago, Primož Gabrijelčič said:

I can confirm (from experience) that this is indeed true. I cannot find any definitive document about that, but it looks like since 2011/12 unaligned access doesn't hurt very much (at least on Intel platform).

Well, okay seems like the Intel SDM needs to be updated. Did anybody test the same szenario with multi-threaded atomic write operations (`lock xchg`, `lock add`, and so on)? Could imagine different results in terms of performance here.

 

Anyways .. I guess the original question is answered. To summarize this:

  • 1 byte reads are always atomic
  • 2/4/8 byte reads are atomic, if executed on a P6+ and fitting in a single cache line (any sane compiler should use correct alignments by itself)

For multi threaded read-modify-write access, read this (TLDR: you will need to use `TInterlocked.XXX()`):

https://stackoverflow.com/a/5421844/9241044

  • Like 1

Share this post


Link to post

I put together a simple test measuring aligned vs. unaligned access speed. You can find it here: https://github.com/gabr42/GpDelphiCode/tree/master/MemSpeed

 

The code simply runs in a tight loop and does the following 1 million times:

 

            pData^ := {$IFDEF X64}$F0F0F0F0F0F0F0F0{$ELSE}$F0F0F0F0{$ENDIF};
            pData^ := {$IFDEF X64}$0F0F0F0F0F0F0F0F{$ELSE}$0F0F0F0F{$ENDIF};
 

All this is repeated for each offset in a 1024-byte buffer. 

 

All above is repeated ten times.

 

Memory is allocated with VirtualAlloc which gives back nicely aligned blocks.

 

At the end, the shortest time for each offset is logged into a file of your choosing. I was only interested in relative differences so I did nothing to convert data to "real" time unit.

 

Warning: The code needs more than a minute to run on my slow Xeon.

 

If I graph the result in Excel, I get this:

 

image.png.fca17575461c5f9807110b38f77cb6dc.png

 

There's basically no difference (besides the noise - all the junk I got installed on Windows was running along the test program).

 

There's no difference over the whole 1024-byte range. QWORD access in 64-bit is slightly faster than DWORD access in 32-bit and that's that.

 

image.png.64239446a5f8950315bf760d46d6f044.png

 

Any contribution to the code will be welcome.

Edited by Primož Gabrijelčič

Share this post


Link to post
4 hours ago, Zacherl said:
  • 2/4/8 byte reads are atomic, if executed on a P6+ and fitting in a single cache line (any sane compiler should use correct alignments by itself)

Can you please specify what exactly do you mean with this statement?

 

If nobody is writing to the memory then the statement is obviously true. It doesn't matter whether reads are atomic or not - the data will always be correct.

 

If there is a writer - does it have to be writing with 'lock' prefix or no? 

Share this post


Link to post
49 minutes ago, Primož Gabrijelčič said:

If there is a writer - does it have to be writing with 'lock' prefix or no? 

 

To answer myself - writer doesn't need to use 'lock'. The proof is here: https://github.com/gabr42/GpDelphiCode/tree/master/MemAtomic

 

The code runs tests with 1/2/4/8 byte data on offsets from 0 to 127 (relative to a well-aligned memory block). It writes out all offsets where reads/writes were not atomic.

 

32-bit

 

1: 

2: 63 127 
4: 61 62 63 125 126 127 
8: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 

 

2/4 bytes: access is not atomic when straddling a cache line

8 bytes: access is never atomic

 

64-bit

 

1: 

2: 63 127 
4: 61 62 63 125 126 127 
8: 57 58 59 60 61 62 63 121 122 123 124 125 126 127 

 

2/4/8 bytes: access is not atomic when straddling a cache line

 

Share this post


Link to post
1 hour ago, Primož Gabrijelčič said:

If there is a writer - does it have to be writing with 'lock' prefix or no? 

You will need a lock prefix only, if you have concurrent threads reading, modifying (e.g. incrementing by one) and writing values. This is to make sure the value does not get changed by another thread in a way like this:

T1: read 0

T2: read 0

T2: inc

T2: write 1

T1: inc

T1: write 1

 

By locking that operation, read + increment + write is always performed atomic:

T1: read 0

T1: inc

T1: write 1

T2: read 1

T2: inc

T3: write 2

 

For forther explanation read this:

6 hours ago, Zacherl said:

For multi threaded read-modify-write access, read this (TLDR: you will need to use `TInterlocked.XXX()`):

https://stackoverflow.com/a/5421844/9241044

 

If you only want to write values without reading and modifying of the previous value, no `LOCK` prefix is needed.

Share this post


Link to post
2 minutes ago, Zacherl said:

You will need a lock prefix only, if you have concurrent threads reading, modifying (e.g. incrementing by one) and writing values.

 

Yes, of course. My question was related to one thread reading and one thread writing.

Share this post


Link to post
7 hours ago, Zacherl said:

Sorry, but did you actually read my post? 

 

Thats exactly what I quoted from the latest Intel SDM:

 

Can you please give me some references that proof your statement? Intel SDM says (you might be correct for normal data access, but using the `LOCK` prefix is something else):

 

Yes, I read what you wrote. What you wrote was wrong. You said nothing about cache lines. 

Share this post


Link to post
2 minutes ago, David Heffernan said:

You said nothing about cache lines. 

20 hours ago, Zacherl said:

Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

Nothing more to say ...

Share this post


Link to post
22 hours ago, Zacherl said:

Reading unaligned values from memory is slow but should still be atomic (on a Pentium6 and newer). 

That's what you said.

Share this post


Link to post
25 minutes ago, David Heffernan said:

That's what you said.

It's okay man. Can't help it, if you don't want to read the complete post (e.g. the text i quoted from the Intel SDM). I'm out.

Share this post


Link to post
59 minutes ago, Zacherl said:

It's okay man. Can't help it, if you don't want to read the complete post (e.g. the text i quoted from the Intel SDM). I'm out.

You agree then that unaligned access of data greater than a single byte is not atomic? 

Share this post


Link to post
40 minutes ago, David Heffernan said:

unaligned access of data greater than a single byte is not atomic

isAtomic := not IsCrossingCacheLine(unalignedData);

 

Share this post


Link to post

@Stefan Glienke And the existence of two byte data that cross cache lines shows that unaligned access is not atomic. 

 

When arguing about non deterministic properties like atomic access or thread safety the language used is a little different. When we say code is. It threadsafe we don't mean that it will always fail. We mean that we can't guarantee that it will always succeed. When we say memory reads are not atomic we don't mean that they won't always be read with multiple bus reads. We mean that we can't guarantee that they will always be read with a single bus read. 

Share this post


Link to post

Well you have a different definition then than Intel in paragraph 8.1.1 it seems.

So unaligned data can be as well part of a struct/record which fits into a cache line and thus is guaranteed to be atomic.

Share this post


Link to post

More data, some old, some new.

 

Firstly, two very old CPUs (the oldest I could find in the company):

image.thumb.png.9c47b549712a93508361d9ea0ada60aa.png

 

This pattern repeats very consistently every 64 bytes (size of cache line):

image.thumb.png.8e4d8e9a8073018b953c52c16098ff7d.png

 

Very interesting pattern but the worst thing is the terrible slowdown when memory access crosses the cache line.

 

Similar data can be seen in a Xeon of a similar age:

 

image.thumb.png.0429eaea4e24dda4275738737285d1b7.png

 

image.thumb.png.6986d05859a149b15572f0e51d996972.png

 

For a moment I thought I used the wrong data files - that's how similar both results are! 

 

And now a suprise! A very modern & fast AMD Ryzen Threadripper:

 

image.thumb.png.a872492a37bc3c9c4ad92a3cc72f2558.png

 

image.thumb.png.d1220ac6b9455da4217ef37605a7e3be.png

 

Wow! The cache line is only 32 bytes and memory access across that line is still slow! Interestingly, accessing 4-aligned 8-byte data in 64-bits works great even when straddling cache line.

 

MemAtomic proves that cache line is only 32-byte:

 

1: 
2: 15 31 47 63 79 95 111 127 
4: 13 14 15 29 30 31 45 46 47 61 62 63 77 78 79 93 94 95 109 110 111 125 126 127 
8: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 

 

No wonder Intel is still a king for non-optimized software!

 

 

 

  • Like 3

Share this post


Link to post
On 11/9/2018 at 8:52 AM, Stefan Glienke said:

Well you have a different definition then than Intel in paragraph 8.1.1 it seems.

So unaligned data can be as well part of a struct/record which fits into a cache line and thus is guaranteed to be atomic.

Fair enough.

Share this post


Link to post

Memspeed app by Primož, build with 10.2.3, Release mode, 32 and 64 bit versions, If any dares to download...

MemSpeed.7z

Edited by Tommi Prami

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×