Age | Commit message (Collapse) | Author | Lines |
|
during calls to free, any free chunks adjacent to the chunk being
freed are momentarily held in allocated state for the purpose of
merging, possibly leaving little or no available free memory for other
threads to allocate. under this condition, other threads will attempt
to expand the heap rather than waiting to use memory that will soon be
available. the race window where this happens is normally very small,
but became huge when free chooses to use madvise to release unused
physical memory, causing unbounded heap size growth.
this patch drastically shrinks the race window for unwanted heap
expansion by performing madvise with the bin lock held and marking the
bin non-empty in the binmask before making the expensive madvise
syscall. testing by Timo Teräs has shown this approach to be a
suitable mitigation.
more invasive changes to the synchronization between malloc and free
would be needed to completely eliminate the problem. it's not clear
whether such changes would improve or worsen typical-case performance,
or whether this would be a worthwhile direction to take malloc
development.
|
|
previously, calloc's implementation encoded assumptions about the
implementation of malloc, accessing a size_t word just prior to the
allocated memory to determine if it was obtained by mmap to optimize
out the zero-filling. when __simple_malloc is used (static linking a
program with no realloc/free), it doesn't matter if the result of this
check is wrong, since all allocations are zero-initialized anyway. but
the access could be invalid if it crosses a page boundary or if the
pointer is not sufficiently aligned, which can happen for very small
allocations.
this patch fixes the issue by moving the zero-fill logic into malloc.c
with the full malloc, as a new function named __malloc0, which is
provided by a weak alias to __simple_malloc (which always gives
zero-filled memory) when the full malloc is not in use.
|
|
this extends the brk/stack collision protection added to full malloc
in commit 276904c2f6bde3a31a24ebfa201482601d18b4f9 to also protect the
__simple_malloc function used in static-linked programs that don't
reference the free function.
it also extends support for using mmap when brk fails, which full
malloc got in commit 5446303328adf4b4e36d9fba21848e6feb55fab4, to
__simple_malloc.
since __simple_malloc may expand the heap by arbitrarily large
increments, the stack collision detection is enhanced to detect
interval overlap rather than just proximity of a single address to the
stack. code size is increased a bit, but this is partly offset by the
sharing of code between the two malloc implementations, which due to
linking semantics, both get linked in a program that needs the full
malloc with realloc/free support.
|
|
the linux/nommu fdpic ELF loader sets up the brk range to overlap
entirely with the main thread's stack (but growing from opposite
ends), so that the resulting failure mode for malloc is not to return
a null pointer but to start returning pointers to memory that overlaps
with the caller's stack. needless to say this extremely dangerous and
makes brk unusable.
since it's non-trivial to detect execution environments that might be
affected by this kernel bug, and since the severity of the bug makes
any sort of detection that might yield false-negatives unsafe, we
instead check the proximity of the brk to the stack pointer each time
the brk is to be expanded. both the main thread's stack (where the
real known risk lies) and the calling thread's stack are checked. an
arbitrary gap distance of 8 MB is imposed, chosen to be larger than
linux default main-thread stack reservation sizes and larger than any
reasonable stack configuration on nommu.
the effeciveness of this patch relies on an assumption that the amount
by which the brk is being grown is smaller than the gap limit, which
is always true for malloc's use of brk. reliance on this assumption is
why the check is being done in malloc-specific code and not in __brk.
|
|
this re-check idiom seems to have been copied from the alloc_fwd and
alloc_rev functions, which guess a bin based on non-synchronized
memory access to adjacent chunk headers then need to confirm, after
locking the bin, that the chunk is actually in the bin they locked.
the check being removed, however, was being performed on a chunk
obtained from the already-locked bin. there is no race to account for
here; the check could only fail in the event of corrupt free lists,
and even then it would not catch them but simply continue running.
since the bin_index function is mildly expensive, it seems preferable
to remove the check rather than trying to convert it into a useful
consistency check. casual testing shows a 1-5% reduction in run time.
|
|
the malloc init code provided its own version of pthread_once type
logic, including the exact same bug that was fixed in pthread_once in
commit 0d0c2f40344640a2a6942dda156509593f51db5d.
since this code is called adjacent to expand_heap, which takes a lock,
there is no reason to have pthread_once-type initialization. simply
moving the init code into the interval where expand_heap already holds
its lock on the brk achieves the same result with much less
synchronization logic, and allows the buggy code to be eliminated
rather than just fixed.
|
|
the memory model we use internally for atomics permits plain loads of
values which may be subject to concurrent modification without
requiring that a special load function be used. since a compiler is
free to make transformations that alter the number of loads or the way
in which loads are performed, the compiler is theoretically free to
break this usage. the most obvious concern is with atomic cas
constructs: something of the form tmp=*p;a_cas(p,tmp,f(tmp)); could be
transformed to a_cas(p,*p,f(*p)); where the latter is intended to show
multiple loads of *p whose resulting values might fail to be equal;
this would break the atomicity of the whole operation. but even more
fundamental breakage is possible.
with the changes being made now, objects that may be modified by
atomics are modeled as volatile, and the atomic operations performed
on them by other threads are modeled as asynchronous stores by
hardware which happens to be acting on the request of another thread.
such modeling of course does not itself address memory synchronization
between cores/cpus, but that aspect was already handled. this all
seems less than ideal, but it's the best we can do without mandating a
C11 compiler and using the C11 model for atomics.
in the case of pthread_once_t, the ABI type of the underlying object
is not volatile-qualified. so we are assuming that accessing the
object through a volatile-qualified lvalue via casts yields volatile
access semantics. the language of the C standard is somewhat unclear
on this matter, but this is an assumption the linux kernel also makes,
and seems to be the correct interpretation of the standard.
|
|
this issue mainly affects PIE binaries and execution of programs via
direct invocation of the dynamic linker binary: depending on kernel
behavior, in these cases the initial brk may be placed at at location
where it cannot be extended, due to conflicting adjacent maps.
when brk fails, mmap is used instead to expand the heap. in order to
avoid expensive bookkeeping for managing fragmentation by merging
these new heap regions, the minimum size for new heap regions
increases exponentially in the number of regions. this limits the
number of regions, and thereby the number of fixed fragmentation
points, to a quantity which is logarithmic with respect to the size of
virtual address space and thus negligible. the exponential growth is
tuned so as to avoid expanding the heap by more than approximately 50%
of its current total size.
|
|
I wrongly assumed the brk syscall would set errno, but on failure it
returns the old value of the brk rather than an error code.
|
|
if a multithreaded program became non-multithreaded (i.e. all other
threads exited) while one thread held an internal lock, the remaining
thread would fail to release the lock. the the program then became
multithreaded again at a later time, any further attempts to obtain
the lock would deadlock permanently.
the underlying cause is that the value of libc.threads_minus_1 at
unlock time might not match the value at lock time. one solution would
be returning a flag to the caller indicating whether the lock was
taken and needs to be unlocked, but there is a simpler solution: using
the lock itself as such a flag.
note that this flag is not needed anyway for correctness; if the lock
is not held, the unlock code is harmless. however, the memory
synchronization properties associated with a_store are costly on some
archs, so it's best to avoid executing the unlock code when it is
unnecessary.
|
|
the sizes in the header and footer for a chunk should always match. if
they don't, the program has definitely invoked undefined behavior, and
the most likely cause is a simple overflow, either of a buffer in the
block being freed or the one just below it.
crashing here should not only improve security of buggy programs, but
also aid in debugging, since the crash happens in a context where you
have a pointer to the likely-overflowed buffer.
|
|
this change fixes an obscure issue with some nonstandard kernels,
where the initial brk syscall returns a pointer just past the end of
bss rather than the beginning of a new page. in that case, the dynamic
linker has already reclaimed the space between the end of bss and the
page end for use by malloc, and memory corruption (allocating the same
memory twice) will occur when malloc again claims it on the first call
to brk.
|
|
with this patch, the malloc in libc.so built with -Os is nearly the
same speed as the one built with -O3. thus it solves the performance
regression that resulted from removing the forced -O3 when building
libc.so; now libc.so can be both small and fast.
|
|
CHUNK_SIZE macro was defined incorrectly and shaving off at least one
significant bit in the size of mmapped chunks, resulting in the test
for oldlen==newlen always failing and incurring a syscall. fortunately
i don't think this issue caused any other observable behavior; the
definition worked correctly for all non-mmapped chunks where its
correctness matters more, since their lengths are always multiples of
the alignment.
|
|
gcc generates extremely bad code (7 byte immediate mov) for the old
null pointer write approach. it should be generating something like
"xor %eax,%eax ; mov %al,(%eax)". in any case, using a dedicated
crashing opcode accomplishes the same thing in one byte.
|
|
a valid mmapped block will have an even (actually aligned) "extra"
field, whereas a freed chunk on the heap will always have an in-use
neighbor.
this fixes a potential bug if mmap ever allocated memory below the
main program/brk (in which case it would be wrongly-detected as a
double-free by the old code) and allows the double-free check to work
for donated memory outside of the brk area (or, in the future,
secondary heap zones if support for their creation is added).
|
|
|
|
even if size_t was 32-bit already, the fact that the value was
unsigned and that gcc is too stupid to figure out it would be positive
as a signed quantity (due to the immediately-prior arithmetic and
conditionals) results in gcc compiling the integer-to-float conversion
as zero extension to 64 bits followed by an "fildll" (64 bit)
instruction rather than a simple "fildl" (32 bit) instruction on x86.
reportedly fildll is very slow on certain p4-class machines; even if
not, the new code is slightly smaller.
|
|
|
|
|
|
the bug appeared only with requests roughly 2*sizeof(size_t) to
4*sizeof(size_t) bytes smaller than a multiple of the page size, and
only for requests large enough to be serviced by mmap instead of the
normal heap. it was only ever observed on 64-bit machines but
presumably could also affect 32-bit (albeit with a smaller window of
opportunity).
|
|
if init_malloc returns positive (successful first init), malloc will
retry getting a chunk from the free bins rather than expanding the
heap again. also pass init_malloc a hint for the size of the initial
allocation.
|
|
|
|
this change is made with some reluctance, but i think it's for the
best. correct programs must handle either behavior, so there is little
advantage to having malloc(0) return NULL. and i managed to actually
make the malloc code slightly smaller with this change.
|
|
|