Dalija Prasnikar 1396 Posted December 1, 2021 1 hour ago, Anders Melander said: Okay. Here goes: You can but don't need to do bit fiddling. We just need to store 3 different states in the pointer; Unassigned Assigned Locked - being assigned It's easy to identify the first two states; If the pointer value is nil then the pointer is unassigned, if it's non-nil then it's assigned. But what about the locked state? Well, for a pointer to an object we can take advantage of the fact that memory allocations are aligned so the lower n bits will always be zero and can be (mis)used to store a lock flag. However since it's a pointer we can just as well just use a pointer value that are known never to be returned by the memory manager. For example ordinal 1. var FSingleton: TSingleton = nil; function Singleton: pointer; const MagicValue = pointer(1); begin if (TInterlocked.CompareExchange(FSingleton, MagicValue, nil) = nil) then FSingleton := TSingleton.Create; // Wait for value to become valid while (FSingleton = MagicValuel) do YieldProcessor; Result := FSingleton; end; That is better solution (it would require additional if to prevent calling CompareExchange for every access, similar to original example (full code)) , but it does not work with interfaces. 1 hour ago, Anders Melander said: Now you're mixing two different arguments. I'm not opposing full pointer exchange. I'm opposing the optimistic object instantiation (which presumably will cause a lock in the memory manager but regardless will be costlier than doing an explicit lock). Your example used additional lock, so it wasn't really clear what you are objecting to. Also, with lazy initialization code path where possible unnecessary object construction happens is extremely rare situation and even if it does happen lock in memory manager not something that would bother me. Because it happens rarely, even cost of explicit lock is negligible so I never used that "potential" performance issue as something worth measuring and proving which solution is faster. The only considerable argument is how costly is the construction of the object that might get discarded, but using lock in memory manager is a poor way to prove that point. Share this post Link to post
Anders Melander 1783 Posted December 1, 2021 23 minutes ago, Dalija Prasnikar said: That is better solution (it would require additional if to prevent calling CompareExchange for every access, similar to original example (full code)) , but it does not work with interfaces. Yes, I originally had a test for nil before the CompareExchange but removed it "for clarity". Why do you say it doesn't work with interfaces? After all an interface reference is just a pointer and the object construction and assignment should take care of the initial reference count. var FSingleton: ISingleton = nil; function Singleton: ISingleton; const MagicValue = pointer(1); begin if (FSingleton = nil) then // Hapy now? :-) if (TInterlocked.CompareExchange(pointer(FSingleton), MagicValue, nil) = nil) then FSingleton := TSingleton.Create; // Wait for value to become valid while (pointer(FSingleton) = MagicValuel) do YieldProcessor; Result := FSingleton; end; What am I missing? Share this post Link to post
Dalija Prasnikar 1396 Posted December 1, 2021 8 minutes ago, Anders Melander said: Yes, I originally had a test for nil before the CompareExchange but removed it "for clarity". Why do you say it doesn't work with interfaces? After all an interface reference is just a pointer and the object construction and assignment should take care of the initial reference count. var FSingleton: ISingleton = nil; function Singleton: ISingleton; const MagicValue = pointer(1); begin if (FSingleton = nil) then // Hapy now? :-) if (TInterlocked.CompareExchange(pointer(FSingleton), MagicValue, nil) = nil) then FSingleton := TSingleton.Create; // Wait for value to become valid while (pointer(FSingleton) = MagicValuel) do YieldProcessor; Result := FSingleton; end; What am I missing? It will crash at FSingleton := TSingleton.Create during assignment. At that point FSingleton value is pointer(1). FSingleton is interface, and assigning new value will try to call _Release on any non-nil old value to maintain appropriate reference count and that old value is invalid address. Share this post Link to post
Dalija Prasnikar 1396 Posted December 1, 2021 Possible solution for interface problem would be using fake interface instance as magic value as reference counting on that would not cause any problems (Note: I haven't tested such solution). function NopAddRef(Inst: Pointer): Integer; stdcall; begin Result := -1; end; function NopRelease(Inst: Pointer): Integer; stdcall; begin Result := -1; end; function NopQueryInterface(Inst: Pointer; const IID: TGUID; out Obj): HResult; stdcall; begin Result := E_NOINTERFACE; end; const FakeInterfaceVTable: array [0 .. 2] of Pointer = (@NopQueryInterface, @NopAddRef, @NopRelease); FakeInterfaceInstance: Pointer = @FakeInterfaceVTable; At this point, solution with lock seems like simpler approach if it makes sense to avoid unnecessary instance construction. On the other hand, if you are to write some universal lazy initialized container that will work on different types, then this solution even though more complicated would pay off. Share this post Link to post
Anders Melander 1783 Posted December 1, 2021 27 minutes ago, Dalija Prasnikar said: It will crash at FSingleton := TSingleton.Create during assignment. Ah, yes. I missed that. Easily solved though (I think): var FSingleton: ISingleton = nil; function Singleton: ISingleton; const MagicValue = pointer(1); begin if (FSingleton = nil) then if (TInterlocked.CompareExchange(pointer(FSingleton), MagicValue, nil) = nil) then begin // Create instance and reference interface (RefCount=1) var Instance: ISingleton := TSingleton.Create; // Copy interface pointer value (RefCount=1) pointer(FSingleton) := pointer(Instance); // Clear interface reference without ref counting (RefCount=1) pointer(Instance) := nil; end; // Wait for value to become valid while (pointer(FSingleton) = MagicValuel) do YieldProcessor; Result := FSingleton; end; 1 Share this post Link to post
Dalija Prasnikar 1396 Posted December 1, 2021 7 minutes ago, Anders Melander said: Ah, yes. I missed that. Easily solved though (I think): Yes.... forget the fake interface approach... brain fart... Share this post Link to post
David Schwartz 426 Posted December 2, 2021 (edited) I'm reading this debate and it triggers a lot of memories from a project I led in the first job I had out of college. I also searched for that TInterlocked.CompareExchange function and found quite an interesting discussion on SO that reminded me that we're really lucky there was a bug in Intel's 80286 chip that killed their protected-mode OS efforts and actually derailed their whole segmented memory architecture. I hope some of you enjoy this. For the record, this all happened in the 1979-86 time-frame. There's also something useful here related to this thread. ================================================================ In my first job out of college, I was hired to work at an Intel facility that created and built their line of Single Board Computers (SBCs) that used their MultiBus backplane. (It was an industrial design; most people are familiar with the analogous but much simpler S-100 bus that IBM introduced in their PCs.) I was assigned a simple project one day: to write a driver for their 8-bit real-time embedded OS (RMX-80) that ran on an SBC with an 8-bit CPU (8080 or 8085) and talked with another SBC that simply had four UARTs on it (UARTs are 8-bit serial communication devices, aka "comm ports") used for RS-232 connections. The code itself had to fiddle with things via the IN and OUT instructions because Intel's chips infamously did not do memory-mapped IO like Moto's 6800, or the 6500 used by Woz in the first Apple computer. I talked with a hardware guy and was assured this was safe to do because the MultiBus ensured one clock cycle atomicity. So we could have an 8080 or 8085 board with any number of these 4-port serial controllers plugged in, and run IN and OUT (single-byte I/O) instructions to get and fetch data from/to each UART on each serial card using different 'ports' and there would be no need to deal with locks. YAY! One weekend, some guy in Marketing started noodling around and made a grid with all 27 of the boards we made and that were in development listed along the X and Y axes, and in the intersecting cells he made an 'X' if it made sense for the two boards to communicate. Then he decided it might be a Good Idea to make this a more generic solution and took it to a Product Planning Committee. A couple of weeks later, I was informed that my "simple" project had just been considerably expanded. It was renamed "Multibus Message Exchange" or "MMX" for short. And yes, if "MMX" seems familiar, it was recycled years later for something completely different. The original thing got absorbed into their OS designs and eventually became obsolete. Here's where it gets interesting (and quite boring if you're not familiar with computer hardware)... We had SBCs that had a variety of CPUs on them: 8080, 8085, 8086, 8088, 80286 (in development) and 80386 (still being defined). Each CPU has two busses: an Address bus and a Data bus. The 8080 and 8085 had 8-bit busses for both. The 8086 and 8088 had a 16-bit Addr bus; the 80286 had a 20-bit Addr bus; and the 80386 was planned to have a 32-bit Addr bus. Meanwhile, the 8080, 8085, and 8088 had an 8-bit Data bus; the 80286 had a 16-bit Data bus; and the 80386 had a 32-bit Data bus. The Multibus had both 16-bit Data and Addr busses. With the pending design of the 80386, they widened the Multibus to support 32-bit Data and Addr busses. Some of the boards also had 8048 MPUs on them (the chips that were in early PC keyboards) that were also 8-bit devices, eventually replaced with 8051 chips. Also, some had "shared memory" on them, which was a big block of memory where the address on the outside looking in (from another board with a larger address space) was almost always different than what the code running on that board's CPU saw on its end. If you're keeping score of combinations, we're at a pretty big number for such a simple idea. This thing that started out as a simple little project now had to support the ability to send messages from one OS on one CPU to a possibly different OS on a different CPU; sometimes using shared memory that had a different address range on either side; where they both could have different sized data and address busses; and the MultiBus backplane itself could not be locked for more than one clock cycle. They were working on a "smart 4-port UART" board that had an 8085 controlling the four UARTs and 16 KB of shared memory. The 8085 had a 16-bit (64KB) address space, but the OS ran in a ROM that was hardwired to the first 16KB because when you grounded the RESET pin momentarily, it reset the CPU and set the IP to start executing at address 0. Needless to say, when you had a 16-bit or 32-bit data bus on one side and the other only had an 8-bit data bus, this caused a problem because the data could only be seen one byte at a time on one side, but it was 2 or 4 bytes wide on the other. It was also impossible to pass pointers between boards with CPUs on both sides. This made memory-mapped I/O highly problematic, which I think is one reason Intel went with the IN and OUT instructions instead. This mess was due in part to the fact that the MultiBus only had one clock cycle atomicity. We could not write 16-bit or 32-bit values to a shared memory address and ensure they would be received without getting corrupted because the shorter receiving side had to do multiple reads, which toggled the MultiBus' clock with each read. (Until this point, the engineers had no reason to design an 8-bit board to read 16- and 32-bit wide data and multiplex it into a series of 8-bit data bytes.) <sigh> If anybody remembers the old "small", "medium", "large", and "compact" memory models in old Intel compilers, this is part of what those designations were intended to address. It was a frigging nighmare. Those model designations first showed up while I was working on this project, and we got an updated compiler with a note about this apologizing that it was the only way to deal with the segmented memory model in the 80286 other than forcing everybody to use the same model, which was totally unworkable. We BEGGED the engineers in charge of the Multibus to add an explicit LOCK signal to it so boards with mis-matched data and/or address busses could have something to ensure that an INTEGER of ANY WIDTH could be transferred atomically across that damned bus that they loved to extoll was "the most advanced bus in the industry!" We succeeded in getting the Product Planning Committee to take it up for discussion, and they agreed, but they chose not to go back and fix any existing boards -- it would only be implemented going forward. Meaning we were stuck having to deal with that mess on all of the existing board combinations they decided needed to be supported. Ugh. --------------------------------------- I did a ton of research and stumbled upon something that someone at IBM had just published in one of their journals that seemed to anticipate our exact situation. It was a short article that simply explained a reliable way of building a lock or semaphore without any hardware support. All it required was an atomic TestAndSet that worked on both sides with at least one bit. We had atomic byte-sized TestAndSet operations that worked everywhere, so that's what we used. TestAndSet reads the value of a memory location and returns its value, and writes a 1 into that location, in a single atomic uninterruptible action. Most semaphores issue a lock, try to grab a flag, then either proceed or release the lock and try again. In this case, a lock isn't needed. It's done through a protocol that everybody follows. You do a TestAndSet, and if the value returned is 1 then you sleep for a bit and retry until you get it. If you get a 0, then you do it again on a second location. If you get back a zero the second time, then you have a green light to proceed. If not, you clear the first flag, sleep a bit, then try again with the first flag. When you're done, you clear the second flag, then clear the first flag. For us, this was like cutting the Gordian Knot! (The paper went into a bunch of details about why this works realiably. I don't recall them.) ========================================================================= PS: if you're read this far, I'll let you in on a little secret related to this that very few people know about the history of the 80286 and the bug that kept it from working in protected mode... The 80286 architecture was created by some guys from Honeywell that were trying to build a simplified version of their Multics secure computing system. Part of the magic was that the OS had some support in the hardware. Our OS guys worked closely with the chip designers to ensure that the 80286 chip supported all of the magic stuff that the OS needed to be secure. They had a big meeting and everybody signed off on the chip design and the chip guys went off to build it while the OS guys started working on the fancy new OS. I was hired a few months after that meeting to work on that very OS. Our hardware guys had been working in parallel to design a MultiBus SBC that they could drop the 80286 chip into and it should just work. For new chip designs, Intel would first create an In-Circuit Emulator (ICE) that they'd use for hardware testing. We got some of them and they worked properly as expected. But when we finally got the first prototype chips, they failed some tests. We sent the results back to the chip guys and were told "we're working on it" for months. We got back a couple more iterations of 286 chips and they fixed some problems, but one seems to persist. Apparently, after the final meeting and agreement on the chip design, the chip guys built the ICE units based on that design, but then decided to optimize something in the microcode and made some changes in the on-chip wiring that simplified a bunch of other logic, and they never ran it by the OS team. They also didn't update the ICE unit's design to reflect these changes, believing they were simple "refactorings" that would have no side-effects. They kept dodging the OS team's attempts to meet and help with this. When the OS team finally got their hands on the T-Spec, they went ballistic, because the chip guys basically broke the security model with their "refactoring". Seems that what they did caused the CPU's Stack PUSH instruction to be non-atomic when the chip was running in "protected mode". Meaning that if an interrupt came in while an address was being pushed onto the stack, it could be interrupted after the segment register (16 bit base index) was pushed onto the stack, but not the 16 bits (offset part) of the IP -- because they came from different registers. The result was that when the interrupt popped the return address off the stack, it got the segment register value, but the next word was NOT the IP address. Obviously, it went off into the weeds and typically generated an address fault. Unfortunately, they decided they couldn't fix it without delaying the chip's release by 6 months or so. That was getting too close to the release of the next chip, the 80386, so they decided to accelerate the release of the 80386 instead and wave-off use of "protected mode" in ALL 80286 designs. (Customers were not happy!) In fact, IBM and some other companies (besides us) had been working very diligently to build a secure OS that was supposed to run on the 80286 in "protected mode". But the 286 never worked in "protected mode" because they never fixed this problem. When we released the 80286 chip, IBM and others sold an upgraded OS that ran on it, but not what they had planned. The 80386 was released earlier, but nobody ran it in "protected mode" either because that required the segmented memory model which everybody HATED. The 80386 was supposed to have a bunch of new features, and only ended up with a few in order to get it out the door faster. But the one that captured the industry's attention the most was its support for a "flat memory model". That was the point where Intel threw in the towel and shifted entirely to a "flat memory model", doing away with all of those crazy "small", "medium", "large", and "compact" compiler models and all of that nonsense that came with a segmented addressing scheme. (Never mind that it was actually modeled after IBM's mainframes, which is one reason IBM loved its design.) Ok, now get back to work! Edited December 2, 2021 by David Schwartz 2 1 Share this post Link to post
Alexander Elagin 143 Posted December 3, 2021 On 12/2/2021 at 10:52 AM, David Schwartz said: Each CPU has two busses: an Address bus and a Data bus. The 8080 and 8085 had 8-bit busses for both. The 8086 and 8088 had a 16-bit Addr bus; the 80286 had a 20-bit Addr bus; and the 80386 was planned to have a 32-bit Addr bus. A small correction: 8080 has a 16-bit address bus and a 8-bit data bus (I am currently building a retro computer based on 8080A + 8224 + 8257 + 8275 + 8255A). 8085 also had a 16-bit address bus but it was multiplexed with the 8-bit data bus and required an additional latch for full address decode. Otherwise, your post was outstanding and very informative, thank you! 1 Share this post Link to post
Pat Foley 51 Posted December 3, 2021 1 hour ago, Alexander Elagin said: bit address bus but it was multiplexed with the 8-bit data bus and required an additional latch for full address decode. Otherwise, your post was outstanding and very informative, thank you! Does that mean a push push and remember to pop pop when using Turbo C ? There was a way to connect the buss to a i/o breakout board using the parallel printer port after cutting a few diodes off... To bring it up to 486 spec here's this https://chapmanworld.com/2018/02/09/lockless-multi-threading-in-delphi/ Share this post Link to post
Anders Melander 1783 Posted December 3, 2021 27 minutes ago, Pat Foley said: To bring it up to 486 spec here's this https://chapmanworld.com/2018/02/09/lockless-multi-threading-in-delphi/ I like this quote from Graig (in the comments): Quote Lock-less threading is the most dangerous tool in your workshop, and should only really be attempted in cases where performance is critical. Wise words. Share this post Link to post
Guest Posted December 3, 2021 1 hour ago, Anders Melander said: I like this quote from Graig (in the comments): Quote Lock-less threading is the most dangerous tool in your workshop, and should only really be attempted in cases where performance is critical. #meeto Share this post Link to post
David Schwartz 426 Posted December 3, 2021 10 hours ago, Alexander Elagin said: A small correction: 8080 has a 16-bit address bus and a 8-bit data bus (I am currently building a retro computer based on 8080A + 8224 + 8257 + 8275 + 8255A). 8085 also had a 16-bit address bus but it was multiplexed with the 8-bit data bus and required an additional latch for full address decode. Otherwise, your post was outstanding and very informative, thank you! Thanks for the correction. I guess it was the 8008 with the multiplexed 8-bit Addr bus. It has been a long time since I looked at any of that stuff. The 8085 was a bit odd. It had a bunch of additional instructions added to it that were more-or-less duplicated on the Z80, but at the last minute they decided not to publish them. They were implemented in the first couple of production runs, but then they were removed. So only a couple of new instructions showed up on it that were mainly for controlling something. Share this post Link to post
David Schwartz 426 Posted December 3, 2021 (edited) 8 hours ago, Pat Foley said: To bring it up to 486 spec here's this https://chapmanworld.com/2018/02/09/lockless-multi-threading-in-delphi/ I'm not sure what this has to do with the 486, but I would not use that solution as it can be interrupted and its state changed. The whole purpose of critical sections is that they extend a simple atomic operation to span several instructions by using a protocol that can be safely interrupted. The only other option is to disable interrupts, but even then the CPU usually has a "high-priority interrupt" that can still interrupt the execution flow. This is exactly how some of the hacks worked that used the look-ahead cache on Intel's CPUs a few years back. We were creating something that had no OS support. Turbo Pascal (and later Delphi) are designed to work inside of DOS and/or Windows. Both of these OS's supports critical sections, but they're only valid for stuff running entirely within the OS. When you're reaching out across a bus (or, today, the internet) to an asynchronous process in an unknown environment, all bets are off. Anybody who remembers the TSRs that ran in DOS knows they did not run within the OS context (even if you consider DOS an actual OS). Nor did most of the device drivers for DOS, or even Unix for that matter. We were working with a fully interrupt-driven OS where everything was managed within the OS. What this project did was gave us a way to extend a reliable communication channel across a data bus, sometimes through shared memory, with a peer on the other side, without having to know anything about what was running on the other side. We actually based it on the first 3 layers of the ISO networking model. By the time the 80486 family hit the scene, the boards using smaller, slower CPUs were retired, and the hardware was updated to make this easier. I'd left the company by that point. (laid-off) Edited December 3, 2021 by David Schwartz Share this post Link to post
Gustav Schubert 25 Posted December 18, 2021 The Help text is now available in markdown and with screenshot, if you want to have a preview of how this application looks like. Share this post Link to post