Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add atomic memcpy RFC. #3301

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

m-ou-se
Copy link
Member

@m-ou-se m-ou-se commented Aug 14, 2022

@m-ou-se m-ou-se added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Aug 14, 2022
@bjorn3
Copy link
Member

bjorn3 commented Aug 14, 2022

cc @ojeda

@ibraheemdev
Copy link
Member

ibraheemdev commented Aug 14, 2022

This could mention the atomic-maybe-uninit crate in the alternatives section (cc @taiki-e).

@5225225
Copy link

5225225 commented Aug 14, 2022

With some way for the language to be able to express "this type is valid for any bit pattern", which project safe transmute presumably will provide (and that exists in the ecosystem as bytemuck and zerocopy and probably others), I'm wondering if it would be better to return an AtomicPerByteRead<T>(MaybeUninit<T>) which we/the ecosystem could provide a safe into_inner (returning a T) if T is valid for any bit pattern.

This would also require removing the safe uninit method. But you could always presumably do an AtomicPerByte<MaybeUninit<T>> with no runtime cost to passing MaybeUninit::uninit() to new.

That's extra complexity, but means that with some help from the ecosystem/future stdlib work, this can be used in 100% safe code, if the data is fine with being torn.

@Lokathor
Copy link
Contributor

The "uninit" part of MaybeUninit is essentially not a bit pattern though. That's the problem. Even if a value is valid "for all bit patterns", you can't unwrap uninit memory into that type.

not without the fabled and legendary Freeze Intrinsic anyway.

@T-Dark0
Copy link

T-Dark0 commented Aug 14, 2022

On the other hand, AnyBitPatternOrPointerFragment isn't a type we have, nor really a type we strictly need for this. Assuming tearing can't deinitialize initialized memory, then MaybeUninit would suffice I think?

@programmerjake
Copy link
Member

note that LLVM already implements this operation:
llvm.memcpy.element.unordered.atomic Intrinsic
with an additional fence operation for acquire/release.

@comex
Copy link

comex commented Aug 15, 2022

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

Comment on lines +205 to +209
- Require `T: Copy` for `AtomicPerByte<T>`, such that we don't need to worry about
duplicating non-`Copy` data.

There might be valid use cases with non-`Copy` data.
Also, not all "memcpy'able" data is always marked as `Copy` (e.g. to prevent implicit copies).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crossbeam-deque is one of the crates that needs support for non-Copy data.

pub const fn uninit() -> Self;

pub fn load(&self, ordering: Ordering) -> MaybeUninit<T>;
pub fn store(&self, value: T, ordering: Ordering);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, in crossbeam-deque, there is a case where it is preferable that the value to be stored be MaybeUninit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to know, thanks.

For this specific API: instead of store, store_from can be used to store something from a &MaybeUninit<T>. I figured the majority of the cases would use a T, so that's what I picked for the store method, to keep it as simple as possible. (There doesn't seem to be much use to consuming a MaybeUninit<T> since it doesn't have a drop implementation anyway.)

Comment on lines +180 to +181
- In order for this to be efficient, we need an additional intrinsic hooking into
special support in LLVM. (Which LLVM needs to have anyway for C++.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you plan to implement this until LLVM implements this?

I don't think it is necessary to explain the implementation details in the RFC, but if we provide an unsound implementation until the as yet unmerged C++ proposal is implemented in LLVM in the future, that seems to be a problem.

(Also, if the language provides the functionality necessary to implement this soundly in Rust, the ecosystem can implement this soundly as well without inline assembly.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into the details yet of what's possible today with LLVM. There's a few possible outcomes:

  • We wait until LLVM supports this. (Or contribute it to LLVM.) This feature is delayed until some point in the future when we can rely on an LLVM version that includes it.
  • Until LLVM supports it, we use a theoretically unsound but known-to-work-today hack like ptr::{read_volatile, write_volatile} combined with a fence. In the standard library we can more easily rely on implementation details of today's compiler.
  • We use the existing llvm.memcpy.element.unordered.atomic, after figuring out the consequences of the unordered property.
  • Until LLVM supports appears, we implement it in the library using a loop of AtomicUsize::load()/store()s and a fence, possibly using an efficient inline assembly alternative for some popular architectures.

I'm not fully sure yet which of these are feasible.

text/3301-atomic-memcpy.md Outdated Show resolved Hide resolved
@m-ou-se
Copy link
Member Author

m-ou-se commented Aug 15, 2022

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

I'm very familiar with the standard Rust and C++ memory orderings, but I don't know much about llvm's unordered ordering. Could you give an example of unexpected results we might get if we were to implement AtomicPerByte<T>::{read, write} using llvm's unordered primitive and a fence? Thanks!

(It seems monotonic is behaves identically to unordered for loads and stores?)

text/3301-atomic-memcpy.md Outdated Show resolved Hide resolved

# Unresolved questions

- Should we require `T: Copy`?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is a quite advanced tool, I would lean towards not requiring it.

Some documentation in AtomicPerByte could ameliorate the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, IMO this should not require Copy, since there's no way to shake off this requirement. I think it's reasonable to use this for a T which happens to hold something else in an UnsafeCell in it, which would make it non-Copy.

but it's easy to accidentally cause undefined behavior by using `load`
to make an extra copy of data that shouldn't be copied.

- Naming: `AtomicPerByte`? `TearableAtomic`? `NoDataRace`? `NotQuiteAtomic`?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given these options and considering what the C++ paper chose, AtomicPerByte sounds OK and has the advantage of having Atomic as a prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AtomicPerByteMaybeUninit or AtomicPerByteManuallyDrop to also resolve the other concern around dropping? Those are terrible names though...

@ojeda
Copy link

ojeda commented Aug 15, 2022

cc @ojeda

Thanks! Cc'ing @wedsonaf since he will like it :)

@thomcc
Copy link
Member

thomcc commented Aug 15, 2022

Unordered is not monotonic (as in, it has no total order across all accesses), so LLVM is free to reorder loads/stores in ways it would not be allowed to with Relaxed (it behaves a lot more like a non-atomic variable in this sense)

In practical terms, in single-thread scenarios it behaves as expected, but when you load an atomic variable with unordered where the previous writer was another thread, you basically have to be prepared for it to hand you back any value previously written by that thread, due to the reordering allowed.

Concretely, I don't know how we'd implement relaxed ordering by fencing without having that fence have a cost on weakly ordered machines (e.g. without implementing it as an overly-strong acquire/release fence).

That said, I think we could add an intrinsic to LLVM that does what we want here. I just don't think it already exists.

(FWIW, another part of the issue is that this stuff is not that well specified, but it's likely described by the "plain" accesses explained in https://www.cs.tau.ac.il/~orilahav/papers/popl17.pdf)

@thomcc
Copy link
Member

thomcc commented Aug 15, 2022

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

I think we can easily implement this with relaxed in compiler-builtins though, but it should get a new intrinsic, since many platforms can implement it more efficiently.

@bjorn3
Copy link
Member

bjorn3 commented Aug 15, 2022

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

@thomcc
Copy link
Member

thomcc commented Aug 15, 2022

I'm not sure we'd want unordered, as mentioned above...

@thomcc
Copy link
Member

thomcc commented Aug 16, 2022

To clarify on the difference between relaxed and unordered (in terms of loads and stores), if you have

static ATOM: AtomicU8 = AtomicU8::new(0);
const O: Ordering = ???;

fn thread1() {
    ATOM.store(1, O);
    ATOM.store(2, O);
}

fn thread2() {
    let a = ATOM.load(O);
    let b = ATOM.load(O);
    assert!(a <= b);
}

thread2 will never assert if O is Relaxed, but it could if O is (the hypothetical) Unordered.

In other words, for unordered, it would be legal for 2 to be stored before 1, or for b to be loaded before a. In terms of fences, there's no fence that "upgrades" unordered to relaxed, although I believe (but am not certain) that stronger fences do apply to it.

@programmerjake
Copy link
Member

something that could work but not be technically correct is:
compiler acquire fence
unordered atomic memcpy
compiler release fence

those fences are no-ops at runtime, but prevent the compiler from reordering the unordered atomics -- assuming your on any modern cpu (except Alpha iirc) it will behave like relaxed atomics because that's what standard load/store instructions do.

@thomcc
Copy link
Member

thomcc commented Aug 16, 2022

Those fences aren't always no-ops at runtime, they actually emit code on several platforms (rust-lang/rust#62256). It's also unclear what can and can't be reordered across compiler fences (rust-lang/unsafe-code-guidelines#347), certainly plain stores can in some cases (this is easy to show happening in godbolt).

Either way, my point has not been that we can't implement this. We absolutely can and it's probably even straightforward. My point is just that I don't really think those existing intrinsics help us do that.

@tschuett
Copy link

I like MaybeAtomic, but following C++ with AtomicPerByte sounds reasonable.
The LLVM guys started something similar in 2016:
https://reviews.llvm.org/D27133

Comment on lines +65 to +68
The somewhat popular `seqlock` crate and similar implementations found in the ecosystem
all use a regular non-atomic write (preceded by an atomic fence) for writing,
and `ptr::read_volatile` (followed by an atomic fence) for reading.
This "works" "fine", but is technically undefined behavior.
Copy link
Contributor

@cbeuw cbeuw Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth noting that, besides from the volatile-volatile data race (which can be somewhat justified by always detecting and discarding the value of a racing read), the volatile-fence pattern doesn't actually provide synchronisations under the C++ memory model.

Fences only "interact" with atomic accesses. Volatiles are not atomic accesses. So unless SeqCst is involved, release fence-volatile write + volatile read-acquire fence does nothing synchronisation wise. This can cause the reader to miss a sequence update and use the volatile read value instead of discarding it, even though the writer did a racing write in the mean time.

Miri can already reproduce this: crossbeam-rs/crossbeam#859. Though most (all?) LLVM compilation schemes emit the same code for volatile and relaxed: https://godbolt.org/z/MsGnq1fGh, so it probably doesn't blow up irl, until some optimisations come along.

Byte-wise atomic is actually required to resolve this, without using potentially more expensive RMW and SCs

loop {
let s1 = self.seq.load(Acquire);
let data = read_data(&self.data, Acquire);
let s2 = self.seq.load(Relaxed);
Copy link
Member

@RalfJung RalfJung Aug 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something very subtle here that I had not appreciated until a few weeks ago: we have to ensure that the load here cannot return an outdated value that would prevent us from noticing a seqnum bump.

The reason this is the case is that if there is a concurrent write, and if any
part of data reads from that write, then we have a release-acquire pair, so then we are guaranteed to see at least the first fetch_add from write, and thus we will definitely see a version conflict. OTOH if the s1 reads-from some second fetch_add in write, then that forms a release-acquire pair, and we will definitely see the full data.

So, all the release/acquire are necessary here. (I know this is not a seqlock tutorial, and @m-ou-se is certainly aware of this, but it still seemed worth pointing out -- many people reading this will not be aware of this.)

(This is related to this comment by @cbeuw.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly. This is why people are sometimes asking for a "release-load" operation. This second load operation needs to happen "after" the read_data() part, but the usual (incorrect) read_data implementation doesn't involve atomic operations or a memory ordering, so they attempt to solve this issue with a memory ordering on that final load, which isn't possible. The right solution is a memory ordering on the read_data() operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under a reordering based atomic model (as CPUs use), a release load makes sense and works. Release loads don't really work unless they are also RMWs (fetch_add(0)) under the C11 model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the famous seqlock paper discusses "read dont-modify write" operations.

while the second one is basically a memory fence followed by series of `AtomicU8::store`s.
Except the implementation can be much more efficient.
The implementation is allowed to load/store the bytes in any order,
and doesn't have to operate on individual bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.

I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved. That would still allow merging adjacent writes (I think), but it would not allow reordering bytes. I wonder if we could get away with that, or if implementations actually need the ability to reorder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a memcpy (meaning the two regions are exclusive) you generally want to copy using increasing address order ("forward") on all hardware I've ever heard of. Even if a forward copy isn't faster (which it often is), it's still the same speed as a reverse copy.

I suspect the "any order is allowed" is just left in as wiggle room for potentially strange situations where somehow a reverse order copy would improve performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.

A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?

I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved.

In the C++ paper they are basically as:

for (size_t i = 0; i < count; ++i) {
  reinterpret_cast<char*>(dest)[i] =
      atomic_ref<char>(reinterpret_cast<char*>(source)[i]).load(memory_order::relaxed);
}
atomic_thread_fence(order);

and

atomic_thread_fence(order);
for (size_t i = 0; i < count; ++i) {
  atomic_ref<char>(reinterpret_cast<char*>(dest)[i]).store(
      reinterpret_cast<char*>(source)[i], memory_order::relaxed);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?

Yes, relaxed loads/stores to different locations can be reordered, so specifying their order is moot under the as-if rule.

In the C++ paper they are basically as:

Hm... but usually fences and accesses are far from equivalent. If we specify them like this, calling code can rely on the presence of these fences. For example changing a 4-byte atomic acquire memcpy to an AtomicU32 acquire load would not be correct (even if we know everything is initialized and aligned etc).

Fence make all preceding/following relaxed accesses potentially induce synchronization, whereas release/acquire accesses only do that for that particular access.

@RalfJung
Copy link
Member

RalfJung commented Aug 20, 2022

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

Yeah, I don't think we should expose Unordered to users in any way until we are ready and willing to have our own concurrency memory model separate from that of C++ (or until C++ has something like unordered, and it's been shown to also make sense formally). There are some formal memory models with "plain" memory accesses, which are similar to unordered (no total mo order but race conditions allowed), but I have no idea if those are an accurate model of LLVM's unordered accesses. Both serve the same goal though, so there's a high chance they are at least related: both aim to model Java's regular memory accesses.

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

Well I sure hope we're not using them in any way that actually becomes observable in program behavior, as that would be unsound.

@HadrienG2
Copy link

HadrienG2 commented Sep 13, 2022

Most of the discussion so far is about the general concept and implementability of an "atomic per byte memcpy", but I'd also like some feedback on the API design itself. E.g. does the usage of MaybeUninit make sense, or should we have a separate MaybeTeared or something like that? Is it ergonomic enough, especially for slices? Etc.

It does seem kind of odd to me that unsafe code is needed to unpack the result of a byte-wise load even for types where that is actually always safe because any correctly sized stream of bytes is a valid representation.

For a caricatural example, I do not see why I should need unsafe code to load the contents of an AtomicPerByte<[u8; LENGTH]> into a &mut [u8; LENGTH], given that this operation is UB-free by definition.

In the actual use case that I have in mind (well, more exactly a client of that lib), it would be convenient to load an &[f32] byte-wise into an &mut [f32], which should be safe as well since any bag of 4 bytes is a valid f32 (though possibly a surprising one).

Basically, I second comment #3301 (comment) regarding the idea that any type that allows safe-transmute from a suitably sized stream of bytes, should be safe to load from AtomicPerByte storage. MaybeUninit does not model that guarantee well as it's designed to also handle uninitialized memory, which this abstraction cannot emit.

@RalfJung
Copy link
Member

RalfJung commented Sep 18, 2022

It does seem kind of odd to me that unsafe code is needed to unpack the result of a byte-wise load even for types where that is actually always safe because any correctly sized stream of bytes is a valid representation.

We have to be careful here, because not every stream of bytes is valid for [u8; N] -- if any of the bytes is uninitialized, that would be invalid. (So e.g. transmuting (u8, u16) to u32 or to [u8; 4] is not sound.)

However, [u8; N] has the property that mixing the bytes of any two or more valid instances, yields a valid instance. [bool; N] also has that property. But [NonZeroU32; N] does not. For the types that have that property we could offer an API that skips the MaybeUninit.

@Lokathor
Copy link
Contributor

Lokathor commented Sep 18, 2022

The way I read the API as proposed in the RFC text, it always goes between two instances of uses the same type. So, the transmutation questions shouldn't be an issue. If there's (u8, 16) giving you an uninit byte in there somewhere, then both the source and dest locations would have the same uninit byte position, and it remains sound.

EDIT: In other words, AtomicPerByte<(u8, 16)> can only store (u8, 16) values and load (u8, 16) values, you can't store as (u8, 16) and then load as [u8; 4], at least not as far as I can see in the API given here

@WaffleLapkin
Copy link
Member

You could have an enum with one variant being A((u8, 16)) and the other B([u8; 4]), a write can change the tag before setting the value, etc.

@Lokathor
Copy link
Contributor

Yes, but that enum wouldn't count as "any correctly sized stream of bytes is a valid representation" to begin with.

@bjorn3
Copy link
Member

bjorn3 commented Sep 18, 2022

The only types for which "any correctly sized stream of bytes is a valid representation" are MaybeUninit<T> and ZST's. Other types don't allow uninit bytes.

@RalfJung
Copy link
Member

RalfJung commented Sep 18, 2022

As I said, the property "any correctly sized stream of bytes is a valid representation" is not necessary for making this API return a T. (It is sufficient. But it is unnecessarily restrictive.) We should stop talking about it, and talk about the actually relevant property instead: mixing the bytes of multiple valid representations yields a valid representation.

Indeed (u8, u16) also has that property. But Either<(u8, u16), u32> does not.

@m-ou-se
Copy link
Member Author

m-ou-se commented Sep 18, 2022

However, [u8; N] has the property that mixing the bytes of any two or more valid instances, yields a valid instance. [bool; N] also has that property. But [NonZeroU32; N] does not. For the types that have that property we could offer an API that skips the MaybeUninit.

That could mean we should use a new MaybeTeared<T> here, instead of MaybeUninit<T>. I suppose that's slightly more accurate, but I'm not sure if it's worth adding such a type.

@Lokathor
Copy link
Contributor

If any of the bytes of a value depend on other bytes, and the value gets torn at one or more random positions within the byte sequence, then you really can't do much with that torn thing. You mostly just have to reset the whole blob of bytes to some known good value.

Comment on lines +197 to +203
- Instead of a type, this could all be just two functions on raw pointers,
such as something like `std::ptr::copy_nonoverlaping_load_atomic_per_byte`.

This means having to use `UnsafeCell` and more unsafe code wherever this functionality is used.

It'd be inconsistent with the other atomic operations.
We don't have e.g. `std::ptr::load_atomic` that operates on pointers either.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would still be convenient to have both the AtomicPerByte type and the two functions on raw pointers. This would allow for additional use cases, such as:

  • This lets programmers choose if they want to use just a Seqlock or combine a Seqlock with an RwLock to form a hybrid lock (for example).
  • The Seqlock version counter and the protected data are able to live on separate cache lines, which reduces false sharing while a reader is repeatedly trying to read the version counter in a loop (in cases where the version counter is checked for writers before doing a potentially racy read).
  • The protected data memory layout is unconstrained by the lock, e.g. elements in a continuous array can each be associated with individual locks stored in a separate array, without having to account for this in the data array memory layout, which could otherwise lead to additional unnecessary implicit padding if the elements and the version counter have a different size.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, I believe I got two things mixed up! The AtomicPerByte type already allows for all these use cases, although there is another use case for which it is not clear whether it is covered by AtomicPerByte:

Does AtomicPerByte make any guarantees about whether its load_to_slice/store_from_slice methods make element-wise atomic copies of T or does it do an atomic copy for the entire slice at once? The latter may allow for a more efficient atomic memcpy implementation if the slice happens to be word-aligned even though T is not (and the slice length is also a multiple of the word-alignment), but then it would be unsound to call methods like <[AtomicPerByte<T>]>::split_at that can affect the alignment of the slice if T is not word-aligned. If AtomicPerByte would promise to only make element-wise atomic copies in this case, then this yields a (very advanced) use case for the two separate pointer methods, which could only be soundly called if the programmer can uphold that there will be no multiple racing atomic memcpys that:

  1. operate on an overlapping memory region, and
  2. for which one of the copies uses a non-word-aligned offset or a length which is not a multiple of the word size.

otherwise this could lead to mixing non-perfectly-overlapping atomic accesses in the memcpy implementation which might be UB (assuming that atomic memcpy uses the machine word size as the largest granularity for a single instruction, otherwise replace "word size" in this comment with the maximum granularity with which memcpy can theoretically operate - EDIT: it seems like the idea for atomic SIMD operations has been floated before and there are 128-bit atomic instructions like core::arch::x86_64::cmpxchg16b, suggesting that an atomic memcpy could choose to use larger than machine word copy granularity).

Copy link
Member

@RalfJung RalfJung Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect AtomicPerByte to make per-byte atomic copies, as its name indicates and as the RFC specifies. That means for elements that are larger than 1 byte, this is not even atomic for individual elements, let alone the entire slice.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "element-wise atomic copy" I mean whether the internal implementation of AtomicPerByte will perform a separate atomic memcpy method call per element, as opposed to performing a single atomic memcpy method call on the entire slice just once (which may still issue one or more atomic instructions due to making per-byte atomic copies). The difference may matter for the alignment of the pointer that is passed to the atomic memcpy method, which may cause the internal implementation of the atomic memcpy method to choose a different atomic word size and/or alignment depending on that pointer's alignment, leading to potentially mixing non-perfectly-overlapping atomic accesses in the face of safe methods like <[AtomicPerByte<T>]>::split_at, for which the second half of the split slice may have a different alignment in memory and size than the whole slice.

To complete the example of how this could be unsound when AtomicPerByte does just a single atomic memcpy method call internally for the entire slice, consider calling AtomicPerByte::store_from_slice(&whole_slice, ..) and AtomicPerByte::load_to_slice(whole_slice.split_at(1).1, ..)) simultaneously where whole_slice is a slice of bytes aligned to the word size or largest SIMD size.

If "perform a single atomic memcpy method call for this slice of T" is not a supported use case of AtomicPerByte, then maybe that could be an argument to also include the atomic_load_per_byte_memcpy and atomic_store_per_byte_memcpy methods in the public API proposed by the RFC? (currently it is phrased to exclude exposing those methods)

Copy link
Member

@RalfJung RalfJung Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "element-wise atomic copy" I mean whether the internal implementation of AtomicPerByte will perform a separate atomic memcpy method call per element, as opposed to performing a single atomic memcpy method call on the entire slice just once (which may still issue one or more atomic instructions due to making per-byte atomic copies).

I don't think any of that is even defined in the spec; these are all unspecified implementation details.

The only alignment that load_to_slice and store_from_slice can assume is the one given by the type: this is aligned according to [Self], which has the same alignment as Self. Also clearly these methods may not perform any (Abstract Machine visible) memory access outside of the slice they are handed.

I feel like I don't understand the question, since this seems obvious.

Or are you repeating #3301 (comment) here? I am assuming this thread is disjoint from the one where we are discussing the x86 "concurrent atomic accesses to the not-perfectly-overlapping locations" problem.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question here is different than the linked comment though touching on similar areas. By my understanding it is:

Can AtomicPerByte::store_from_slice use larger atomic stores to write multiple elements at once or not. (The same argument applies to loads but for brevity I'm limiting my explanation to stores)

Consider the following situation, explained rather much but i'd rather overexplain than introduce more confusion by underexplaining:

static ARRAY: [AtomicPerByte<u16>; LEN] = [AtomicPerByte::new(0); LEN];

fn thread_1() {
    let new = [MaybeUninit::new(1u16); LEN];
    AtomicPerByte::store_from_slice(&ARRAY, &new, Ordering::Release);
}

If store_from_slice can make larger loads, then it could be implemented as an optimized memcpy, which may do individual stores larger than the size of the individual elements. For instance:

  • write 0-7 bytes at the start and end of the slice using one-byte atomic stores
  • write the middle (now 8-byte aligned) part of the slice with 64-byte atomic stores

However, this is unsound, because another thread could store to a non-exactly overlapping slice:

fn thread_2() {
    let new = [MaybeUninit::new(2u16); 3];
    let lower = &ARRAY[0..3];
    AtomicPerByte::store_from_slice(&lower, &new, Ordering::Release);
}

In the case that ARRAY happened to be 8-byte aligned, (and LEN >= 4) then thread 1 will write against the beginning of the array with a 8-byte store, and thread 2 will write against the beginning of the array with a 1-byte store, leading to undefined behavior on x86_64.

store_from_slice (and load_to_slice), on platforms like x86_64 that disallow non-exactly-overlapping atomic accesses, cannot be implemented with atomic stores that overlap multiple elements, and thus are limited in store sizes by the size of the element.

Note that on x86_64 I believe this problem is only true for Release stores (and Acquire loads) because Relaxed operations compile to non-atomic stores which are allowed to be non-exactly-overlapping (I think?? I'm not confident in this part.)

Phrased differently:

store_from_slice cannot write to (Abstract Machine visible) memory outside the slice (already expected). But furthermore it must take into account that where atomic operations cannot non-exactly overlap, it must make the same-size atomic operations for any slice, and so must do its writes as-if it does individual AtomicPerByte::stores for each element. And furthermore it must actually do this in reality (paying a performance cost versus larger writes when its not actually required) for at least x86_64 with Acquire or Release.

Thus,

If "perform a single atomic memcpy method call for this slice of T" is not a supported use case of AtomicPerByte, then maybe that could be an argument to also include the atomic_load_per_byte_memcpy and atomic_store_per_byte_memcpy methods in the public API proposed by the RFC? (currently it is phrased to exclude exposing those methods)

is an argument to add additional functions

unsafe fn atomic_load_per_byte_memcpy<T>(src: &[AtomicPerByte<T>], dest: &mut [T], ordering: Ordering);
unsafe fn atomic_store_per_byte_memcpy<T>(src: &[T], dest; &[AtomicPerByte<T>], ordering: Ordering);

that have the additional requirement that no non-exactly-overlapping slice of the provided slice of AtomicPerBytes is accessed concurrently, thereby enabling the optimization to use larger atomic writes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leading to undefined behavior on x86_64

Despite what the manual says, Intel has a "verbal" guarantee that non-exactly-overlapping atomics is fine: rust-lang/unsafe-code-guidelines#345 (comment). It may tear across cache lines (which is fine) but will not put the CPU into an undefined state. This situation is unlikely to change given that need to keep C++'s byte-wise memcpy working too.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite what the manual says, Intel has a "verbal" guarantee that non-exactly-overlapping atomics is fine

While this is true, in that linked issue and elsewhere in this pr this has been discussed and my readings of these discussions is a disposition to respect the manual and treat it as undefined, even if it works in practice. If that changes, then that does solve this issue, but otherwise this remains a problem (though it could be not solved at all or a solution could be postponed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been discussed

That mailing list post was sent on 6 April, all prior discussions were done without this piece of information. The only comment on this RFC on non-exactly-overlapping access since then was yours, so I don't think we can say there's a disposition yet, especially in light of the new information.

In general I don't think "stick to the documentation" is a good approach when dealing with proprietary platforms, otherwise we can't ship anything for macOS at all. If the vendor is receptive, we should nag them to change the documentation like @m-ou-se has done before with MS MicrosoftDocs/sdk-api#447, but even if they don't change it, we shouldn't have our hands tied by it.

In this case, the C++ proposal is committed to allowing multi-byte access optimisations, and they are privy to more information from vendors, so I think sticking to their approach should be safe.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a totally reasonable (and probably good) resolution to this and other issues. I'm not trying to argue against (or necessarily for) that, but rather to clearly explain the issue that was raised in this particular sub thread, over which there seemed to be confusion that was not resolved.

@Pointerbender
Copy link

ibraheemdev: For specific types the load granularity can be larger than a byte and we will likely want to take advantage of that, so the name AtomicPerByte might not fit well.

RalfJung: that is covered by the as-if rule. It is impossible for the program to tell if the load is truly per-byte, of if adjacent reads are merged into 2-byte reads (or larger).

thomcc: From the x86 architecture manual 3A/I 8.1.2.2:

Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths. For example, if one processor uses accesses a semaphore using a word access, other processors should not access the semaphore using a byte access.

RalfJung: I agree for now we should keep ruling them out, because I assume that's what C++ does -- and that should probably be explicitly documented for atomic memcpy, because that operation is much more likely to be used in unsafe type-punning code than the regular Atomic types.

P1478: Note that on standard hardware, it should be OK to actually perform the copy at larger than byte granularity. Copying multiple bytes as part of one operation is indistinguishable from running them so quickly that the intermediate state is not observed. In fact, we expect that existing assembly memcpy implementations will suffice when suffixed with the required fence.

I'm trying to wrap my head around the implications of these statements. Does this mean that:

  1. it's never allowed for any atomic byte copy implementation to copy more than 1 byte in a single instruction, or
  2. is (only) the internal (llvm?) implementation of atomic_{load|store}_per_byte_memcpy allowed to copy larger-than-1-byte values on targets where this is legal + it is never allowed to mix any of the Rust atomic types (not even AtomicU8) with atomic_{load|store}_per_byte_memcpy? (because from the programmer's perspective there is no way to tell which atomic granularity the internal implementation will choose).

If the second situation is the case, does this also imply that it could be UB if atomic_{load|store}_per_byte_memcpy is called from multiple threads simultaneously on overlapping memory regions with a different byte count or a different offset?

@RalfJung
Copy link
Member

it's never allowed for any atomic byte copy implementation to copy more than 1 byte in a single instruction, or

That is definitely not the intent.

is (only) the internal (llvm?) implementation of atomic_{load|store}per_byte_memcpy allowed to copy larger-than-1-byte values on targets where this is legal + it is never allowed to mix any of the Rust atomic types (not even AtomicU8) with atomic{load|store}_per_byte_memcpy? (because from the programmer's perspective there is no way to tell which atomic granularity the internal implementation will choose).

Ah, good question. If we have two overlapping non-equal atomic-per-byte accesses we could run afoul of the x86 requirement to not mix differently-sized atomic accesses. Does C++ do anything here to prevent that?

@tschuett
Copy link

A conforming implementation could use u32 for load and u64 for store.

@RalfJung
Copy link
Member

That wouldn't be conforming if it causes UB on the x86 assembly level due to mixed-size concurrent accesses. (See above.)

@tschuett
Copy link

I did't say x86, but in general this should be valid.

@RalfJung
Copy link
Member

RalfJung commented Dec 21, 2022

Well we are talking about x86 here currently, so if you mean something else you should clarify that you are not replying to the previous posts just above yours. By default people will naturally assume that you are continuing the active discussion, not throwing in an unrelated remark.

@golddranks
Copy link

golddranks commented Feb 16, 2023

I know that the motivation for this feature is implementing seqlocks, but let me state another motivation: safe mmap.

Currently at least Linux allows modifying mapped files and reflects the modifications to program memory, even if the map is created with MAP_PRIVATE. This means that referring the mapped memory as &[u8] is clearly unsound. (Though fine, if modifications don't happen; but even if that's the intention, that's not something guaranteeable at the level of a programming language.) Accessing the memory with ptr::read_volatile or via UnsafeCell seems plausible, but while unlikely to cause problems in practice, the docs of both state that concurrent modification is UB.

I think that with the current memory model, only atomics can guarantee freedom from UB in the face of spurious modifications. (Another possibility seems to be allowing concurrent modifications with ptr::read_volatile and UnsafeCell, relaxing the memory model for accesses that are not attempting to communicate via the memory or trying to synchronize with any other accesses, but I'm not sure if that's possible or plausible.)

This API looks relevant for this use in the sense that the motivation for this mmap use case is just avoiding UB in the face of spurious modifications; and having performant accesses is also important, whereas caring about tearing is not relevant.

I recently expressed my dissatisfaction with the current situation of mmap in Reddit, and had a bunch of discussion: https://www.reddit.com/r/rust/comments/10u4anm/how_to_use_mmap_safely_in_rust/ I'm aware that many people who use mmap are either unaware of its soundness issues, or are aware of them but forced to ignore them. I don't think using AtomicPerByte to express the mmapped memory safely would necessarily fix all of those issues, but at least it would be one step in a better direction.

@Lokathor
Copy link
Contributor

I would be in favor of adding atomic volatile primitives so that volatile stuff can be done concurrently. LLVM supports this we just don't support it in Rust (yet?)

@programmerjake
Copy link
Member

afaict representing a mmap via [AtomicU8] is valid

@golddranks
Copy link

@programmerjake Ah should have mentioned that. Yes, I think all of the current numeric atomics are valid for that use case. This would just let the user to pick a custom type, while letting AtomicPerByte wrapping to provide the (lack of) guarantees.

@m-ou-se
Copy link
Member Author

m-ou-se commented Aug 3, 2023

However, [u8; N] has the property that mixing the bytes of any two or more valid instances, yields a valid instance. [bool; N] also has that property. But [NonZeroU32; N] does not. For the types that have that property we could offer an API that skips the MaybeUninit.

That could mean we should use a new MaybeTeared<T> here, instead of MaybeUninit<T>. I suppose that's slightly more accurate, but I'm not sure if it's worth adding such a type.

After thinking about it a bit more, I'm going to try to see what this proposal looks like if we add a new MaybeTeared<T> and use that instead of using MaybeUninit<T>. I'm starting to think that might actually be better, making it clearer what the unsafety is about.

@RalfJung
Copy link
Member

RalfJung commented Aug 3, 2023 via email

@m-ou-se
Copy link
Member Author

m-ou-se commented Aug 3, 2023

MaybeTeared<T> would be the return type of AtomicPerByte<T>::load(), representing a value of type T that might have been torn. It'd have something like an unsafe .assume_not_teared() to convert it to a T.

For certain types such as [u8; N], it's always safe to call that function (although it might not result in the expected value of course). There could be a trait for that category of types, but that's a tricky to define usefully. Would it include types like u32 or f32 for which any teared value still results in a valid value, even though it's very unlikely it'd be a meaningful value? This gets us into 'safe transmute' territories, which I think falls outside the scope of this RFC for now.

@programmerjake
Copy link
Member

note that Teared is generally considered incorrect when referring to the past sense of tear as in to rip up.
https://en.m.wiktionary.org/wiki/teared#Verb_2
you probably want tore or torn or tearing

@programmerjake
Copy link
Member

I just realized that atomic memcpy would probably require x86 to define what happens with mixed-size atomics: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/mixed-size.20atomics.20and.20atomic.20memcpy/near/403111841

@RalfJung
Copy link
Member

This question came up before but indeed has not been answered yet, I think.

@HadrienG2
Copy link

HadrienG2 commented Nov 20, 2023

I have vague memories of someone once mentioning in one of the numerous threads about this topic that they were approached by Intel engineers who claimed that this is fine on every current Intel hardware, but spec people are feeling nervous about committing to the public statement that it will stay fine on all future hardware, as they fear this could prevent future hardware optimizations.

So basically x86-TSO memory model history repeating itself: spec specifies a certain memory model, hardware actually follows a stronger memory model, software people would like to leverage useful properties of the stronger hardware memory model in their program, but hardware people first need to decide if they are ready to permanently close the door to the envisioned future hardware optimizations that led them to specify a weaker-than-necessary memory model in the first place.

Anyway, instead of trusting my faillible memory, maybe let me ping @joshtriplett in case they have more reliable memories or can in turn summon some former Intel colleague(s) to provide more authoritative material.


If the eventual conclusion is that mixed-size atomics are not fine, then it would indeed heavily constrain the implementation of the proposed atomic memcpy functionality.

On one hand, we would like implementations to use the most efficient hardware load and store operations, not juste byte loads and stores but at least native word loads and stores and ideally even SIMD loads and stores and more exotic load/stores like rep movs too, as long as the target range of memory addresses is large enough. So restricting ourselves to the moral equivalent of this...

struct AtomicMemcpyZone(Box<[AtomicU8]>);

...would be undesirable, and not an obvious improvement over the statu quo.

On the other hand, getting to the desired goal without hitting mixed-size atomics UB would mean something along the lines of the following...

  • Specifying, for each target configuration, a largest memory access granularity that atomic memcpy will use (e.g. max-width SIMD load and store).
  • Conceptually slicing memory into aligned blocks of this granularity.
  • Stating that if two memory operations (not just memcpy, any atomic and non-atomic memory op counts) hit the same aligned block without hitting the same range of bytes within the block, and at least one of them is a memcpy, then it's instant UB for the program.

...and that would definitely be a very nasty memory model to commit to :(

@cbeuw
Copy link
Contributor

cbeuw commented Nov 20, 2023

Intel engineers who claimed that this is fine on every current Intel hardware

See: rust-lang/unsafe-code-guidelines#345 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet