Age | Commit message (Collapse) | Author | Lines |
|
this global lock allows certain unlock-type primitives to exclude
mmap/munmap operations which could change the identity of virtual
addresses while references to them still exist.
the original design mistakenly assumed mmap/munmap would conversely
need to exclude the same operations which exclude mmap/munmap, so the
vmlock was implemented as a sort of 'symmetric recursive rwlock'. this
turned out to be unnecessary.
commit 25d12fc0fc51f1fae0f85b4649a6463eb805aa8f already shortened the
interval during which mmap/munmap held their side of the lock, but
left the inappropriate lock design and some inefficiency.
the new design uses a separate function, __vm_wait, which does not
hold any lock itself and only waits for lock users which were already
present when it was called to release the lock. this is sufficient
because of the way operations that need to be excluded are sequenced:
the "unlock-type" operations using the vmlock need only block
mmap/munmap operations that are precipitated by (and thus sequenced
after) the atomic-unlock they perform while holding the vmlock.
this allows for a spectacular lack of synchronization in the __vm_wait
function itself.
|
|
these are mandatory cancellation points per POSIX, so their omission
was a conformance bug.
|
|
The intent of this is to avoid name space pollution of the C threads
implementation.
This has two sides to it. First we have to provide symbols that wouldn't
pollute the name space for the C threads implementation. Second we have
to clean up some internal uses of POSIX functions such that they don't
implicitly drag in such symbols.
|
|
the whole point of this locking is to prevent munmap, or mmap with
MAP_FIXED, from deallocating virtual addresses, or changing the
backing a given virtual address refers to, during certain race windows
involving self-synchronized unmapping or destruction of pthread
synchronization objects. there is no need for exclusion in the other
direction, so it suffices to take the lock momentarily and release it
before making the syscall, rather than holding it across the syscall.
|
|
|
|
|
|
PAGE_SIZE was hardcoded to 4096, which is historically what most
systems use, but on several archs it is a kernel config parameter,
user space can only know it at execution time from the aux vector.
PAGE_SIZE and PAGESIZE are not defined on archs where page size is
a runtime parameter, applications should use sysconf(_SC_PAGE_SIZE)
to query it. Internally libc code defines PAGE_SIZE to libc.page_size,
which is set to aux[AT_PAGESZ] in __init_libc and early in __dynlink
as well. (Note that libc.page_size can be accessed without GOT, ie.
before relocations are done)
Some fpathconf settings are hardcoded to 4096, these should be actually
queried from the filesystem using statfs.
|
|
|
|
internally, other parts of the library assume sizes don't overflow
ssize_t and/or ptrdiff_t, and the way this assumption is made valid is
by preventing creating of such large objects. malloc already does so,
but the check was missing from mmap.
this is also a quality of implementation issue: even if the
implementation internally could handle such objects, applications
could inadvertently invoke undefined behavior by subtracting pointers
within an object. it is very difficult to guard against this in
applications, so a good implementation should simply ensure that it
does not happen.
|
|
the previous logic was assuming the kernel would give EINVAL when
passed an invalid address, but instead with MAP_FIXED it was giving
EPERM, as it considered this an attempt to map over kernel memory.
instead of trying to get the kernel to do the rigth thing, the new
code just handles the error in userspace.
I have also cleaned up the code to use a single mask to check for
invalid low bits and unsupported high bits, so it's simpler and more
clearly correct. the old code was actually wrong for sizeof(long)
smaller than sizeof(off_t) but not equal to 4; now it should be
correct for all possibilities.
for 64-bit systems, the low-bits test is new and extraneous (the
kernel should catch the error anyway when the mmap2 syscall is not
used), but it's cheap anyway. if this is an issue, the OFF_MASK
definition could be tweaked to omit the low bits when SYS_mmap2 is not
defined.
|
|
this function was overly complicated and not even obviously correct.
avoid using openat/linkat just like in shm_open, and instead expand
pathname using code shared with shm_open. remove bogus (and dangerous,
with priorities) use of spinlocks.
this commit also heavily streamlines the code and ensures there are no
failure cases that can happen after a new semaphore has been created
in the filesystem, since that case is unreportable.
|
|
1. don't make non-cloexec file descriptors
2. cancellation safety (cleanup handlers were missing, now unneeded)
3. share name validation/mapping code between open/unlink functions
4. avoid wasteful/slow syscalls
|
|
|
|
this implementation is rather heavy-weight, but it's the first
solution i've found that's actually correct. all waiters actually wait
twice at the barrier so that they can synchronize exit, and they hold
a "vm lock" that prevents changes to virtual memory mappings (and
blocks pthread_barrier_destroy) until all waiters are finished
inspecting the barrier.
thus, it is safe for any thread to destroy and/or unmap the barrier's
memory as soon as pthread_barrier_wait returns, without further
synchronization.
|
|
per POSIX: The mprotect() function shall change the access protections
to be that specified by prot for those whole pages containing any part
of the address space of the process starting at address addr and
continuing for len bytes.
on the other hand, linux mprotect fails with EINVAL if the base
address and/or length is not page-aligned, so we have to align them
before making the syscall.
|
|
|
|
the check against MADV_DONTNEED to because linux MADV_DONTNEED
semantics conflict dangerously with the POSIX semantics
|
|
|
|
|
|
|
|
- hide all the legacy xxxxxx32 name cruft in syscall.h so the actual
source files can be clean and uniform across all archs.
- cleanup llseek/lseek and mmap2/mmap handling for 32/64 bit systems
- alternate implementation for nice if the target lacks nice syscall
|
|
|