New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add atomic memcpy RFC. #3301
base: master
Are you sure you want to change the base?
Add atomic memcpy RFC. #3301
Conversation
cc @ojeda |
This could mention the |
With some way for the language to be able to express "this type is valid for any bit pattern", which project safe transmute presumably will provide (and that exists in the ecosystem as This would also require removing the safe That's extra complexity, but means that with some help from the ecosystem/future stdlib work, this can be used in 100% safe code, if the data is fine with being torn. |
The "uninit" part of not without the fabled and legendary Freeze Intrinsic anyway. |
On the other hand, |
note that LLVM already implements this operation: |
The trouble with that intrinsic is that |
- Require `T: Copy` for `AtomicPerByte<T>`, such that we don't need to worry about | ||
duplicating non-`Copy` data. | ||
|
||
There might be valid use cases with non-`Copy` data. | ||
Also, not all "memcpy'able" data is always marked as `Copy` (e.g. to prevent implicit copies). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crossbeam-deque is one of the crates that needs support for non-Copy
data.
pub const fn uninit() -> Self; | ||
|
||
pub fn load(&self, ordering: Ordering) -> MaybeUninit<T>; | ||
pub fn store(&self, value: T, ordering: Ordering); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, in crossbeam-deque, there is a case where it is preferable that the value to be stored be MaybeUninit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good to know, thanks.
For this specific API: instead of store
, store_from
can be used to store something from a &MaybeUninit<T>
. I figured the majority of the cases would use a T
, so that's what I picked for the store
method, to keep it as simple as possible. (There doesn't seem to be much use to consuming a MaybeUninit<T>
since it doesn't have a drop implementation anyway.)
- In order for this to be efficient, we need an additional intrinsic hooking into | ||
special support in LLVM. (Which LLVM needs to have anyway for C++.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you plan to implement this until LLVM implements this?
I don't think it is necessary to explain the implementation details in the RFC, but if we provide an unsound implementation until the as yet unmerged C++ proposal is implemented in LLVM in the future, that seems to be a problem.
(Also, if the language provides the functionality necessary to implement this soundly in Rust, the ecosystem can implement this soundly as well without inline assembly.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked into the details yet of what's possible today with LLVM. There's a few possible outcomes:
- We wait until LLVM supports this. (Or contribute it to LLVM.) This feature is delayed until some point in the future when we can rely on an LLVM version that includes it.
- Until LLVM supports it, we use a theoretically unsound but known-to-work-today hack like
ptr::{read_volatile, write_volatile}
combined with a fence. In the standard library we can more easily rely on implementation details of today's compiler. - We use the existing
llvm.memcpy.element.unordered.atomic
, after figuring out the consequences of theunordered
property. - Until LLVM supports appears, we implement it in the library using a loop of
AtomicUsize::load()
/store()
s and a fence, possibly using an efficient inline assembly alternative for some popular architectures.
I'm not fully sure yet which of these are feasible.
I'm very familiar with the standard Rust and C++ memory orderings, but I don't know much about llvm's (It seems |
|
||
# Unresolved questions | ||
|
||
- Should we require `T: Copy`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is a quite advanced tool, I would lean towards not requiring it.
Some documentation in AtomicPerByte
could ameliorate the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, IMO this should not require Copy, since there's no way to shake off this requirement. I think it's reasonable to use this for a T
which happens to hold something else in an UnsafeCell in it, which would make it non-Copy.
but it's easy to accidentally cause undefined behavior by using `load` | ||
to make an extra copy of data that shouldn't be copied. | ||
|
||
- Naming: `AtomicPerByte`? `TearableAtomic`? `NoDataRace`? `NotQuiteAtomic`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given these options and considering what the C++ paper chose, AtomicPerByte
sounds OK and has the advantage of having Atomic
as a prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AtomicPerByteMaybeUninit
or AtomicPerByteManuallyDrop
to also resolve the other concern around dropping? Those are terrible names though...
Unordered is not monotonic (as in, it has no total order across all accesses), so LLVM is free to reorder loads/stores in ways it would not be allowed to with Relaxed (it behaves a lot more like a non-atomic variable in this sense) In practical terms, in single-thread scenarios it behaves as expected, but when you load an atomic variable with unordered where the previous writer was another thread, you basically have to be prepared for it to hand you back any value previously written by that thread, due to the reordering allowed. Concretely, I don't know how we'd implement relaxed ordering by fencing without having that fence have a cost on weakly ordered machines (e.g. without implementing it as an overly-strong acquire/release fence). That said, I think we could add an intrinsic to LLVM that does what we want here. I just don't think it already exists. (FWIW, another part of the issue is that this stuff is not that well specified, but it's likely described by the "plain" accesses explained in https://www.cs.tau.ac.il/~orilahav/papers/popl17.pdf) |
CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past). I think we can easily implement this with relaxed in compiler-builtins though, but it should get a new intrinsic, since many platforms can implement it more efficiently. |
We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes. |
I'm not sure we'd want unordered, as mentioned above... |
To clarify on the difference between relaxed and unordered (in terms of loads and stores), if you have static ATOM: AtomicU8 = AtomicU8::new(0);
const O: Ordering = ???;
fn thread1() {
ATOM.store(1, O);
ATOM.store(2, O);
}
fn thread2() {
let a = ATOM.load(O);
let b = ATOM.load(O);
assert!(a <= b);
}
In other words, for unordered, it would be legal for 2 to be stored before 1, or for |
something that could work but not be technically correct is: those fences are no-ops at runtime, but prevent the compiler from reordering the unordered atomics -- assuming your on any modern cpu (except Alpha iirc) it will behave like relaxed atomics because that's what standard load/store instructions do. |
Those fences aren't always no-ops at runtime, they actually emit code on several platforms (rust-lang/rust#62256). It's also unclear what can and can't be reordered across compiler fences (rust-lang/unsafe-code-guidelines#347), certainly plain stores can in some cases (this is easy to show happening in godbolt). Either way, my point has not been that we can't implement this. We absolutely can and it's probably even straightforward. My point is just that I don't really think those existing intrinsics help us do that. |
I like |
The somewhat popular `seqlock` crate and similar implementations found in the ecosystem | ||
all use a regular non-atomic write (preceded by an atomic fence) for writing, | ||
and `ptr::read_volatile` (followed by an atomic fence) for reading. | ||
This "works" "fine", but is technically undefined behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth noting that, besides from the volatile-volatile data race (which can be somewhat justified by always detecting and discarding the value of a racing read), the volatile-fence pattern doesn't actually provide synchronisations under the C++ memory model.
Fences only "interact" with atomic accesses. Volatiles are not atomic accesses. So unless SeqCst is involved, release fence-volatile write + volatile read-acquire fence does nothing synchronisation wise. This can cause the reader to miss a sequence update and use the volatile read value instead of discarding it, even though the writer did a racing write in the mean time.
Miri can already reproduce this: crossbeam-rs/crossbeam#859. Though most (all?) LLVM compilation schemes emit the same code for volatile and relaxed: https://godbolt.org/z/MsGnq1fGh, so it probably doesn't blow up irl, until some optimisations come along.
Byte-wise atomic is actually required to resolve this, without using potentially more expensive RMW and SCs
loop { | ||
let s1 = self.seq.load(Acquire); | ||
let data = read_data(&self.data, Acquire); | ||
let s2 = self.seq.load(Relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something very subtle here that I had not appreciated until a few weeks ago: we have to ensure that the load
here cannot return an outdated value that would prevent us from noticing a seqnum bump.
The reason this is the case is that if there is a concurrent write
, and if any
part of data
reads from that write, then we have a release-acquire pair, so then we are guaranteed to see at least the first fetch_add
from write
, and thus we will definitely see a version conflict. OTOH if the s1
reads-from some second fetch_add
in write
, then that forms a release-acquire pair, and we will definitely see the full data.
So, all the release/acquire are necessary here. (I know this is not a seqlock tutorial, and @m-ou-se is certainly aware of this, but it still seemed worth pointing out -- many people reading this will not be aware of this.)
(This is related to this comment by @cbeuw.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly. This is why people are sometimes asking for a "release-load" operation. This second load operation needs to happen "after" the read_data()
part, but the usual (incorrect) read_data
implementation doesn't involve atomic operations or a memory ordering, so they attempt to solve this issue with a memory ordering on that final load, which isn't possible. The right solution is a memory ordering on the read_data()
operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under a reordering based atomic model (as CPUs use), a release load makes sense and works. Release loads don't really work unless they are also RMWs (fetch_add(0)
) under the C11 model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the famous seqlock paper discusses "read dont-modify write" operations.
while the second one is basically a memory fence followed by series of `AtomicU8::store`s. | ||
Except the implementation can be much more efficient. | ||
The implementation is allowed to load/store the bytes in any order, | ||
and doesn't have to operate on individual bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.
I was originally thinking this would be specified as a series of AtomicU8
load/store with the respective order, no fence involved. That would still allow merging adjacent writes (I think), but it would not allow reordering bytes. I wonder if we could get away with that, or if implementations actually need the ability to reorder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a memcpy (meaning the two regions are exclusive) you generally want to copy using increasing address order ("forward") on all hardware I've ever heard of. Even if a forward copy isn't faster (which it often is), it's still the same speed as a reverse copy.
I suspect the "any order is allowed" is just left in as wiggle room for potentially strange situations where somehow a reverse order copy would improve performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.
A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?
I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved.
In the C++ paper they are basically as:
for (size_t i = 0; i < count; ++i) { reinterpret_cast<char*>(dest)[i] = atomic_ref<char>(reinterpret_cast<char*>(source)[i]).load(memory_order::relaxed); } atomic_thread_fence(order);
and
atomic_thread_fence(order); for (size_t i = 0; i < count; ++i) { atomic_ref<char>(reinterpret_cast<char*>(dest)[i]).store( reinterpret_cast<char*>(source)[i], memory_order::relaxed); }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?
Yes, relaxed loads/stores to different locations can be reordered, so specifying their order is moot under the as-if rule.
In the C++ paper they are basically as:
Hm... but usually fences and accesses are far from equivalent. If we specify them like this, calling code can rely on the presence of these fences. For example changing a 4-byte atomic acquire memcpy to an AtomicU32 acquire load would not be correct (even if we know everything is initialized and aligned etc).
Fence make all preceding/following relaxed accesses potentially induce synchronization, whereas release/acquire accesses only do that for that particular access.
Yeah, I don't think we should expose Unordered to users in any way until we are ready and willing to have our own concurrency memory model separate from that of C++ (or until C++ has something like unordered, and it's been shown to also make sense formally). There are some formal memory models with "plain" memory accesses, which are similar to unordered (no total mo order but race conditions allowed), but I have no idea if those are an accurate model of LLVM's unordered accesses. Both serve the same goal though, so there's a high chance they are at least related: both aim to model Java's regular memory accesses.
Well I sure hope we're not using them in any way that actually becomes observable in program behavior, as that would be unsound. |
It does seem kind of odd to me that unsafe code is needed to unpack the result of a byte-wise load even for types where that is actually always safe because any correctly sized stream of bytes is a valid representation. For a caricatural example, I do not see why I should need unsafe code to load the contents of an In the actual use case that I have in mind (well, more exactly a client of that lib), it would be convenient to load an Basically, I second comment #3301 (comment) regarding the idea that any type that allows safe-transmute from a suitably sized stream of bytes, should be safe to load from AtomicPerByte storage. MaybeUninit does not model that guarantee well as it's designed to also handle uninitialized memory, which this abstraction cannot emit. |
We have to be careful here, because not every stream of bytes is valid for However, |
The way I read the API as proposed in the RFC text, it always EDIT: In other words, |
You could have an enum with one variant being |
Yes, but that enum wouldn't count as "any correctly sized stream of bytes is a valid representation" to begin with. |
The only types for which "any correctly sized stream of bytes is a valid representation" are |
As I said, the property "any correctly sized stream of bytes is a valid representation" is not necessary for making this API return a Indeed |
That could mean we should use a new |
If any of the bytes of a value depend on other bytes, and the value gets torn at one or more random positions within the byte sequence, then you really can't do much with that torn thing. You mostly just have to reset the whole blob of bytes to some known good value. |
- Instead of a type, this could all be just two functions on raw pointers, | ||
such as something like `std::ptr::copy_nonoverlaping_load_atomic_per_byte`. | ||
|
||
This means having to use `UnsafeCell` and more unsafe code wherever this functionality is used. | ||
|
||
It'd be inconsistent with the other atomic operations. | ||
We don't have e.g. `std::ptr::load_atomic` that operates on pointers either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would still be convenient to have both the AtomicPerByte
type and the two functions on raw pointers. This would allow for additional use cases, such as:
- This lets programmers choose if they want to use just a
Seqlock
or combine aSeqlock
with anRwLock
to form a hybrid lock (for example). - The
Seqlock
version counter and the protected data are able to live on separate cache lines, which reduces false sharing while a reader is repeatedly trying to read the version counter in a loop (in cases where the version counter is checked for writers before doing a potentially racy read). - The protected data memory layout is unconstrained by the lock, e.g. elements in a continuous array can each be associated with individual locks stored in a separate array, without having to account for this in the data array memory layout, which could otherwise lead to additional unnecessary implicit padding if the elements and the version counter have a different size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this, I believe I got two things mixed up! The AtomicPerByte
type already allows for all these use cases, although there is another use case for which it is not clear whether it is covered by AtomicPerByte
:
Does AtomicPerByte
make any guarantees about whether its load_to_slice
/store_from_slice
methods make element-wise atomic copies of T
or does it do an atomic copy for the entire slice at once? The latter may allow for a more efficient atomic memcpy
implementation if the slice happens to be word-aligned even though T
is not (and the slice length is also a multiple of the word-alignment), but then it would be unsound to call methods like <[AtomicPerByte<T>]>::split_at
that can affect the alignment of the slice if T
is not word-aligned. If AtomicPerByte
would promise to only make element-wise atomic copies in this case, then this yields a (very advanced) use case for the two separate pointer methods, which could only be soundly called if the programmer can uphold that there will be no multiple racing atomic memcpy
s that:
- operate on an overlapping memory region, and
- for which one of the copies uses a non-word-aligned offset or a length which is not a multiple of the word size.
otherwise this could lead to mixing non-perfectly-overlapping atomic accesses in the memcpy
implementation which might be UB (assuming that atomic memcpy
uses the machine word size as the largest granularity for a single instruction, otherwise replace "word size" in this comment with the maximum granularity with which memcpy
can theoretically operate - EDIT: it seems like the idea for atomic SIMD operations has been floated before and there are 128-bit atomic instructions like core::arch::x86_64::cmpxchg16b
, suggesting that an atomic memcpy
could choose to use larger than machine word copy granularity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect AtomicPerByte
to make per-byte atomic copies, as its name indicates and as the RFC specifies. That means for elements that are larger than 1 byte, this is not even atomic for individual elements, let alone the entire slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With "element-wise atomic copy" I mean whether the internal implementation of AtomicPerByte
will perform a separate atomic memcpy method call per element, as opposed to performing a single atomic memcpy method call on the entire slice just once (which may still issue one or more atomic instructions due to making per-byte atomic copies). The difference may matter for the alignment of the pointer that is passed to the atomic memcpy
method, which may cause the internal implementation of the atomic memcpy
method to choose a different atomic word size and/or alignment depending on that pointer's alignment, leading to potentially mixing non-perfectly-overlapping atomic accesses in the face of safe methods like <[AtomicPerByte<T>]>::split_at
, for which the second half of the split slice may have a different alignment in memory and size than the whole slice.
To complete the example of how this could be unsound when AtomicPerByte
does just a single atomic memcpy method call internally for the entire slice, consider calling AtomicPerByte::store_from_slice(&whole_slice, ..)
and AtomicPerByte::load_to_slice(whole_slice.split_at(1).1, ..)
) simultaneously where whole_slice
is a slice of bytes aligned to the word size or largest SIMD size.
If "perform a single atomic memcpy
method call for this slice of T
" is not a supported use case of AtomicPerByte
, then maybe that could be an argument to also include the atomic_load_per_byte_memcpy
and atomic_store_per_byte_memcpy
methods in the public API proposed by the RFC? (currently it is phrased to exclude exposing those methods)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With "element-wise atomic copy" I mean whether the internal implementation of AtomicPerByte will perform a separate atomic memcpy method call per element, as opposed to performing a single atomic memcpy method call on the entire slice just once (which may still issue one or more atomic instructions due to making per-byte atomic copies).
I don't think any of that is even defined in the spec; these are all unspecified implementation details.
The only alignment that load_to_slice
and store_from_slice
can assume is the one given by the type: this
is aligned according to [Self]
, which has the same alignment as Self
. Also clearly these methods may not perform any (Abstract Machine visible) memory access outside of the slice they are handed.
I feel like I don't understand the question, since this seems obvious.
Or are you repeating #3301 (comment) here? I am assuming this thread is disjoint from the one where we are discussing the x86 "concurrent atomic accesses to the not-perfectly-overlapping locations" problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the question here is different than the linked comment though touching on similar areas. By my understanding it is:
Can AtomicPerByte::store_from_slice
use larger atomic stores to write multiple elements at once or not. (The same argument applies to loads but for brevity I'm limiting my explanation to stores)
Consider the following situation, explained rather much but i'd rather overexplain than introduce more confusion by underexplaining:
static ARRAY: [AtomicPerByte<u16>; LEN] = [AtomicPerByte::new(0); LEN];
fn thread_1() {
let new = [MaybeUninit::new(1u16); LEN];
AtomicPerByte::store_from_slice(&ARRAY, &new, Ordering::Release);
}
If store_from_slice
can make larger loads, then it could be implemented as an optimized memcpy, which may do individual stores larger than the size of the individual elements. For instance:
- write 0-7 bytes at the start and end of the slice using one-byte atomic stores
- write the middle (now 8-byte aligned) part of the slice with 64-byte atomic stores
However, this is unsound, because another thread could store to a non-exactly overlapping slice:
fn thread_2() {
let new = [MaybeUninit::new(2u16); 3];
let lower = &ARRAY[0..3];
AtomicPerByte::store_from_slice(&lower, &new, Ordering::Release);
}
In the case that ARRAY happened to be 8-byte aligned, (and LEN >= 4
) then thread 1 will write against the beginning of the array with a 8-byte store, and thread 2 will write against the beginning of the array with a 1-byte store, leading to undefined behavior on x86_64
.
store_from_slice
(and load_to_slice
), on platforms like x86_64
that disallow non-exactly-overlapping atomic accesses, cannot be implemented with atomic stores that overlap multiple elements, and thus are limited in store sizes by the size of the element.
Note that on x86_64
I believe this problem is only true for Release
stores (and Acquire
loads) because Relaxed
operations compile to non-atomic stores which are allowed to be non-exactly-overlapping (I think?? I'm not confident in this part.)
Phrased differently:
store_from_slice
cannot write to (Abstract Machine visible) memory outside the slice (already expected). But furthermore it must take into account that where atomic operations cannot non-exactly overlap, it must make the same-size atomic operations for any slice, and so must do its writes as-if it does individual AtomicPerByte::store
s for each element. And furthermore it must actually do this in reality (paying a performance cost versus larger writes when its not actually required) for at least x86_64
with Acquire
or Release
.
Thus,
If "perform a single atomic memcpy method call for this slice of T" is not a supported use case of AtomicPerByte, then maybe that could be an argument to also include the atomic_load_per_byte_memcpy and atomic_store_per_byte_memcpy methods in the public API proposed by the RFC? (currently it is phrased to exclude exposing those methods)
is an argument to add additional functions
unsafe fn atomic_load_per_byte_memcpy<T>(src: &[AtomicPerByte<T>], dest: &mut [T], ordering: Ordering);
unsafe fn atomic_store_per_byte_memcpy<T>(src: &[T], dest; &[AtomicPerByte<T>], ordering: Ordering);
that have the additional requirement that no non-exactly-overlapping slice of the provided slice of AtomicPerByte
s is accessed concurrently, thereby enabling the optimization to use larger atomic writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leading to undefined behavior on x86_64
Despite what the manual says, Intel has a "verbal" guarantee that non-exactly-overlapping atomics is fine: rust-lang/unsafe-code-guidelines#345 (comment). It may tear across cache lines (which is fine) but will not put the CPU into an undefined state. This situation is unlikely to change given that need to keep C++'s byte-wise memcpy working too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite what the manual says, Intel has a "verbal" guarantee that non-exactly-overlapping atomics is fine
While this is true, in that linked issue and elsewhere in this pr this has been discussed and my readings of these discussions is a disposition to respect the manual and treat it as undefined, even if it works in practice. If that changes, then that does solve this issue, but otherwise this remains a problem (though it could be not solved at all or a solution could be postponed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has been discussed
That mailing list post was sent on 6 April, all prior discussions were done without this piece of information. The only comment on this RFC on non-exactly-overlapping access since then was yours, so I don't think we can say there's a disposition yet, especially in light of the new information.
In general I don't think "stick to the documentation" is a good approach when dealing with proprietary platforms, otherwise we can't ship anything for macOS at all. If the vendor is receptive, we should nag them to change the documentation like @m-ou-se has done before with MS MicrosoftDocs/sdk-api#447, but even if they don't change it, we shouldn't have our hands tied by it.
In this case, the C++ proposal is committed to allowing multi-byte access optimisations, and they are privy to more information from vendors, so I think sticking to their approach should be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a totally reasonable (and probably good) resolution to this and other issues. I'm not trying to argue against (or necessarily for) that, but rather to clearly explain the issue that was raised in this particular sub thread, over which there seemed to be confusion that was not resolved.
I'm trying to wrap my head around the implications of these statements. Does this mean that:
If the second situation is the case, does this also imply that it could be UB if |
That is definitely not the intent.
Ah, good question. If we have two overlapping non-equal atomic-per-byte accesses we could run afoul of the x86 requirement to not mix differently-sized atomic accesses. Does C++ do anything here to prevent that? |
A conforming implementation could use u32 for load and u64 for store. |
That wouldn't be conforming if it causes UB on the x86 assembly level due to mixed-size concurrent accesses. (See above.) |
I did't say x86, but in general this should be valid. |
Well we are talking about x86 here currently, so if you mean something else you should clarify that you are not replying to the previous posts just above yours. By default people will naturally assume that you are continuing the active discussion, not throwing in an unrelated remark. |
I know that the motivation for this feature is implementing seqlocks, but let me state another motivation: safe Currently at least Linux allows modifying mapped files and reflects the modifications to program memory, even if the map is created with I think that with the current memory model, only atomics can guarantee freedom from UB in the face of spurious modifications. (Another possibility seems to be allowing concurrent modifications with This API looks relevant for this use in the sense that the motivation for this I recently expressed my dissatisfaction with the current situation of |
I would be in favor of adding |
afaict representing a mmap via |
@programmerjake Ah should have mentioned that. Yes, I think all of the current numeric atomics are valid for that use case. This would just let the user to pick a custom type, while letting |
After thinking about it a bit more, I'm going to try to see what this proposal looks like if we add a new |
Since it is a property of a type, wouldn't this be an (unsafe) trait rather than a wrapper type?
|
For certain types such as |
note that |
I just realized that atomic |
This question came up before but indeed has not been answered yet, I think. |
I have vague memories of someone once mentioning in one of the numerous threads about this topic that they were approached by Intel engineers who claimed that this is fine on every current Intel hardware, but spec people are feeling nervous about committing to the public statement that it will stay fine on all future hardware, as they fear this could prevent future hardware optimizations. So basically x86-TSO memory model history repeating itself: spec specifies a certain memory model, hardware actually follows a stronger memory model, software people would like to leverage useful properties of the stronger hardware memory model in their program, but hardware people first need to decide if they are ready to permanently close the door to the envisioned future hardware optimizations that led them to specify a weaker-than-necessary memory model in the first place. Anyway, instead of trusting my faillible memory, maybe let me ping @joshtriplett in case they have more reliable memories or can in turn summon some former Intel colleague(s) to provide more authoritative material. If the eventual conclusion is that mixed-size atomics are not fine, then it would indeed heavily constrain the implementation of the proposed atomic memcpy functionality. On one hand, we would like implementations to use the most efficient hardware load and store operations, not juste byte loads and stores but at least native word loads and stores and ideally even SIMD loads and stores and more exotic load/stores like struct AtomicMemcpyZone(Box<[AtomicU8]>); ...would be undesirable, and not an obvious improvement over the statu quo. On the other hand, getting to the desired goal without hitting mixed-size atomics UB would mean something along the lines of the following...
...and that would definitely be a very nasty memory model to commit to :( |
|
Rendered