path: root/kernel
AgeCommit message (Collapse)AuthorLines
2015-03-06ebpf: bpf_map_*: fix linker error on avr32 and openrisc archDaniel Borkmann-0/+5
Fengguang reported, that on openrisc and avr32 architectures, we get the following linker errors on *_defconfig builds that have no bpf syscall support: net/built-in.o:(.rodata+0x1cd0): undefined reference to `bpf_map_lookup_elem_proto' net/built-in.o:(.rodata+0x1cd4): undefined reference to `bpf_map_update_elem_proto' net/built-in.o:(.rodata+0x1cd8): undefined reference to `bpf_map_delete_elem_proto' Fix it up by providing built-in weak definitions of the symbols, so they can be overridden when the syscall is enabled. I think the issue might be that gcc is not able to optimize all that away. This patch fixes the linker errors for me, tested with Fengguang's make.cross [1] script. [1] Reported-by: Fengguang Wu <> Fixes: d4052c4aea0c ("ebpf: remove CONFIG_BPF_SYSCALL ifdefs in socket filter code") Signed-off-by: Daniel Borkmann <> Acked-by: Alexei Starovoitov <> Signed-off-by: David S. Miller <>
2015-03-03Merge git:// S. Miller-163/+247
Conflicts: drivers/net/ethernet/rocker/rocker.c The rocker commit was two overlapping changes, one to rename the ->vport member to ->pport, and another making the bitmask expression use '1ULL' instead of plain '1'. Signed-off-by: David S. Miller <>
2015-03-01Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds-0/+1
git:// Pull locking fix from Ingo Molnar: "An rtmutex deadlock path fixlet" * 'locking-urgent-for-linus' of git:// locking/rtmutex: Set state back to running on error
2015-03-01cls_bpf: add initial eBPF support for programmable classifiersDaniel Borkmann-0/+2
This work extends the "classic" BPF programmable tc classifier by extending its scope also to native eBPF code! This allows for user space to implement own custom, 'safe' C like classifiers (or whatever other frontend language LLVM et al may provide in future), that can then be compiled with the LLVM eBPF backend to an eBPF elf file. The result of this can be loaded into the kernel via iproute2's tc. In the kernel, they can be JITed on major archs and thus run in native performance. Simple, minimal toy example to demonstrate the workflow: #include <linux/ip.h> #include <linux/if_ether.h> #include <linux/bpf.h> #include "tc_bpf_api.h" __section("classify") int cls_main(struct sk_buff *skb) { return (0x800 << 16) | load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos)); } char __license[] __section("license") = "GPL"; The classifier can then be compiled into eBPF opcodes and loaded via tc, for example: clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o tc filter add dev em1 parent 1: bpf cls.o [...] As it has been demonstrated, the scope can even reach up to a fully fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c). For tc, maps are allowed to be used, but from kernel context only, in other words, eBPF code can keep state across filter invocations. In future, we perhaps may reattach from a different application to those maps e.g., to read out collected statistics/state. Similarly as in socket filters, we may extend functionality for eBPF classifiers over time depending on the use cases. For that purpose, cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so we can allow additional functions/accessors (e.g. an ABI compatible offset translation to skb fields/metadata). For an initial cls_bpf support, we allow the same set of helper functions as eBPF socket filters, but we could diverge at some point in time w/o problem. I was wondering whether cls_bpf and act_bpf could share C programs, I can imagine that at some point, we introduce i) further common handlers for both (or even beyond their scope), and/or if truly needed ii) some restricted function space for each of them. Both can be abstracted easily through struct bpf_verifier_ops in future. The context of cls_bpf versus act_bpf is slightly different though: a cls_bpf program will return a specific classid whereas act_bpf a drop/non-drop return code, latter may also in future mangle skbs. That said, we can surely have a "classify" and "action" section in a single object file, or considered mentioned constraint add a possibility of a shared section. The workflow for getting native eBPF running from tc [1] is as follows: for f_bpf, I've added a slightly modified ELF parser code from Alexei's kernel sample, which reads out the LLVM compiled object, sets up maps (and dynamically fixes up map fds) if any, and loads the eBPF instructions all centrally through the bpf syscall. The resulting fd from the loaded program itself is being passed down to cls_bpf, which looks up struct bpf_prog from the fd store, and holds reference, so that it stays available also after tc program lifetime. On tc filter destruction, it will then drop its reference. Moreover, I've also added the optional possibility to annotate an eBPF filter with a name (e.g. path to object file, or something else if preferred) so that when tc dumps currently installed filters, some more context can be given to an admin for a given instance (as opposed to just the file descriptor number). Last but not least, bpf_prog_get() and bpf_prog_put() needed to be exported, so that eBPF can be used from cls_bpf built as a module. Thanks to 60a3b2253c41 ("net: bpf: make eBPF interpreter images read-only") I think this is of no concern since anything wanting to alter eBPF opcode after verification stage would crash the kernel. [1] Signed-off-by: Daniel Borkmann <> Cc: Jamal Hadi Salim <> Cc: Jiri Pirko <> Acked-by: Alexei Starovoitov <> Signed-off-by: David S. Miller <>
2015-03-01ebpf: move read-only fields to bpf_prog and shrink bpf_prog_auxDaniel Borkmann-6/+5
is_gpl_compatible and prog_type should be moved directly into bpf_prog as they stay immutable during bpf_prog's lifetime, are core attributes and they can be locked as read-only later on via bpf_prog_select_runtime(). With a bit of rearranging, this also allows us to shrink bpf_prog_aux to exactly 1 cacheline. Signed-off-by: Daniel Borkmann <> Acked-by: Alexei Starovoitov <> Signed-off-by: David S. Miller <>
2015-03-01ebpf: add sched_cls_type and map it to sk_filter's verifier opsDaniel Borkmann-2/+13
As discussed recently and at netconf/netdev01, we want to prevent making bpf_verifier_ops registration available for modules, but have them at a controlled place inside the kernel instead. The reason for this is, that out-of-tree modules can go crazy and define and register any verfifier ops they want, doing all sorts of crap, even bypassing available GPLed eBPF helper functions. We don't want to offer such a shiny playground, of course, but keep strict control to ourselves inside the core kernel. This also encourages us to design eBPF user helpers carefully and generically, so they can be shared among various subsystems using eBPF. For the eBPF traffic classifier (cls_bpf), it's a good start to share the same helper facilities as we currently do in eBPF for socket filters. That way, we have BPF_PROG_TYPE_SCHED_CLS look like it's own type, thus one day if there's a good reason to diverge the set of helper functions from the set available to socket filters, we keep ABI compatibility. In future, we could place all bpf_prog_type_list at a central place, perhaps. Signed-off-by: Daniel Borkmann <> Signed-off-by: Alexei Starovoitov <> Signed-off-by: David S. Miller <>
2015-03-01ebpf: constify various function pointer structsDaniel Borkmann-9/+9
We can move bpf_map_ops and bpf_verifier_ops and other structs into ro section, bpf_map_type_list and bpf_prog_type_list into read mostly. Signed-off-by: Daniel Borkmann <> Acked-by: Alexei Starovoitov <> Signed-off-by: David S. Miller <>
2015-03-01ebpf: remove kernel test stubsDaniel Borkmann-81/+0
Now that we have BPF_PROG_TYPE_SOCKET_FILTER up and running, we can remove the test stubs which were added to get the verifier suite up. We can just let the test cases probe under socket filter type instead. In the fill/spill test case, we cannot (yet) access fields from the context (skb), but we may adapt that test case in future. Signed-off-by: Daniel Borkmann <> Acked-by: Alexei Starovoitov <> Signed-off-by: David S. Miller <>
2015-03-01locking/rtmutex: Set state back to running on errorSebastian Andrzej Siewior-0/+1
The "usual" path is: - rt_mutex_slowlock() - set_current_state() - task_blocks_on_rt_mutex() (ret 0) - __rt_mutex_slowlock() - sleep or not but do return with __set_current_state(TASK_RUNNING) - back to caller. In the early error case where task_blocks_on_rt_mutex() return -EDEADLK we never change the task's state back to RUNNING. I assume this is intended. Without this change after ww_mutex using rt_mutex the selftest passes but later I get plenty of: | bad: scheduling from the idle thread! backtraces. Signed-off-by: Sebastian Andrzej Siewior <> Acked-by: Mike Galbraith <> Cc: Linus Torvalds <> Cc: Maarten Lankhorst <> Cc: Peter Zijlstra <> Cc: Thomas Gleixner <> Fixes: afffc6c1805d ("locking/rtmutex: Optimize setting task running after being blocked") Link: Signed-off-by: Ingo Molnar <>
2015-02-28kernel/sys.c: fix UNAME26 for 4.0Jon DeVree-1/+2
There's a uname workaround for broken userspace which can't handle kernel versions of 3.x. Update it for 4.x. Signed-off-by: Jon DeVree <> Cc: Andi Kleen <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2015-02-24Merge branch 'for-linus' of ↵Linus Torvalds-5/+5
git:// Pull livepatching fixes from Jiri Kosina: "Two tiny fixes for livepatching infrastructure: - extending RCU critical section to cover all accessess to RCU-protected variable, by Petr Mladek - proper format string passing to kobject_init_and_add(), by Jiri Kosina" * 'for-linus' of git:// livepatch: RCU protect struct klp_func all the time when used in klp_ftrace_handler() livepatch: fix format string in kobject_init_and_add()
2015-02-22livepatch: RCU protect struct klp_func all the time when used in ↵Petr Mladek-3/+3
klp_ftrace_handler() func->new_func has been accessed after rcu_read_unlock() in klp_ftrace_handler() and therefore the access was not protected. Signed-off-by: Petr Mladek <> Acked-by: Josh Poimboeuf <> Signed-off-by: Jiri Kosina <>
2015-02-21Merge branch 'upstream' of git:// Torvalds-0/+12
Pull MIPS updates from Ralf Baechle: "This is the main pull request for MIPS: - a number of fixes that didn't make the 3.19 release. - a number of cleanups. - preliminary support for Cavium's Octeon 3 SOCs which feature up to 48 MIPS64 R3 cores with FPU and hardware virtualization. - support for MIPS R6 processors. Revision 6 of the MIPS architecture is a major revision of the MIPS architecture which does away with many of original sins of the architecture such as branch delay slots. This and other changes in R6 require major changes throughout the entire MIPS core architecture code and make up for the lion share of this pull request. - finally some preparatory work for eXtendend Physical Address support, which allows support of up to 40 bit of physical address space on 32 bit processors" [ Ahh, MIPS can't leave the PAE brain damage alone. It's like every CPU architect has to make that mistake, but pee in the snow by changing the TLA. But whether it's called PAE, LPAE or XPA, it's horrid crud - Linus ] * 'upstream' of git:// (114 commits) MIPS: sead3: Corrected get_c0_perfcount_int MIPS: mm: Remove dead macro definitions MIPS: OCTEON: irq: add CIB and other fixes MIPS: OCTEON: Don't do acknowledge operations for level triggered irqs. MIPS: OCTEON: More OCTEONIII support MIPS: OCTEON: Remove setting of processor specific CVMCTL icache bits. MIPS: OCTEON: Core-15169 Workaround and general CVMSEG cleanup. MIPS: OCTEON: Update octeon-model.h code for new SoCs. MIPS: OCTEON: Implement DCache errata workaround for all CN6XXX MIPS: OCTEON: Add little-endian support to asm/octeon/octeon.h MIPS: OCTEON: Implement the core-16057 workaround MIPS: OCTEON: Delete unused COP2 saving code MIPS: OCTEON: Use correct instruction to read 64-bit COP0 register MIPS: OCTEON: Save and restore CP2 SHA3 state MIPS: OCTEON: Fix FP context save. MIPS: OCTEON: Save/Restore wider multiply registers in OCTEON III CPUs MIPS: boot: Provide more uImage options MIPS: Remove unneeded #ifdef __KERNEL__ from asm/processor.h MIPS: ip22-gio: Remove legacy suspend/resume support mips: pci: Add ifdef around pci_proc_domain ...
2015-02-21Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds-3/+7
git:// Pull ntp fix from Ingo Molnar: "An adjtimex interface regression fix for 32-bit systems" [ A check that was added in a previous commit is really only a concern for 64bit systems, but was applied to both 32 and 64bit systems, which results in breaking 32bit systems. Thus the fix here is to make the check only apply to 64bit systems ] * 'timers-urgent-for-linus' of git:// ntp: Fixup adjtimex freq validation on 32-bit systems
2015-02-21Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds-1/+2
git:// Pull locking fixes from Ingo Molnar: "Two fixes: the paravirt spin_unlock() corruption/crash fix, and an rtmutex NULL dereference crash fix" * 'locking-urgent-for-linus' of git:// x86/spinlocks/paravirt: Fix memory corruption on unlock locking/rtmutex: Avoid a NULL pointer dereference on deadlock
2015-02-21Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds-99/+148
git:// Pull scheduler fixes from Ingo Molnar: "Thiscontains misc fixes: preempt_schedule_common() and io_schedule() recursion fixes, sched/dl fixes, a completion_done() revert, two sched/rt fixes and a comment update patch" * 'sched-urgent-for-linus' of git:// sched/rt: Avoid obvious configuration fail sched/autogroup: Fix failure to set cpu.rt_runtime_us sched/dl: Do update_rq_clock() in yield_task_dl() sched: Prevent recursion in io_schedule() sched/completion: Serialize completion_done() with complete() sched: Fix preempt_schedule_common() triggering tracing recursion sched/dl: Prevent enqueue of a sleeping task in dl_task_timer() sched: Make dl_task_time() use task_rq_lock() sched: Clarify ordering between task_rq_lock() and move_queued_task()
2015-02-21Merge branches 'core-urgent-for-linus' and 'irq-urgent-for-linus' of ↵Linus Torvalds-0/+1
git:// Pull rcu fix and x86 irq fix from Ingo Molnar: - Fix a bug that caused an RCU warning splat. - Two x86 irq related fixes: a hotplug crash fix and an ACPI IRQ registry fix. * 'core-urgent-for-linus' of git:// rcu: Clear need_qs flag to prevent splat * 'irq-urgent-for-linus' of git:// x86/irq: Check for valid irq descriptor in check_irq_vectors_for_cpu_disable() x86/irq: Fix regression caused by commit b568b8601f05
2015-02-20Merge tag 'for_linux-3.20-rc1' of ↵Linus Torvalds-23/+64
git:// Pull kgdb/kdb updates from Jason Wessel: "KGDB/KDB New: - KDB: improved searching - No longer enter debug core on panic if panic timeout is set KGDB/KDB regressions / cleanups - fix pdf doc build errors - prevent junk characters on kdb console from printk levels" * tag 'for_linux-3.20-rc1' of git:// kgdb, docs: Fix <para> pdfdocs build errors debug: prevent entering debug mode on panic/exception. kdb: Const qualifier for kdb_getstr's prompt argument kdb: Provide forward search at more prompt kdb: Fix a prompt management bug when using | grep kdb: Remove stack dump when entering kgdb due to NMI kdb: Avoid printing KERN_ levels to consoles kdb: Fix off by one error in kdb_cpu() kdb: fix incorrect counts in KDB summary command output
2015-02-19debug: prevent entering debug mode on panic/exception.Colin Cross-0/+17
On non-developer devices, kgdb prevents the device from rebooting after a panic. Incase of panics and exceptions, to allow the device to reboot, prevent entering debug mode to avoid getting stuck waiting for the user to interact with debugger. To avoid entering the debugger on panic/exception without any extra configuration, panic_timeout is being used which can be set via /proc/sys/kernel/panic at run time and CONFIG_PANIC_TIMEOUT sets the default value. Setting panic_timeout indicates that the user requested machine to perform unattended reboot after panic. We dont want to get stuck waiting for the user input incase of panic. Cc: Andrew Morton <> Cc: Cc: Cc: Android Kernel Team <> Cc: John Stultz <> Cc: Sumit Semwal <> Signed-off-by: Colin Cross <> [Kiran: Added context to commit message. panic_timeout is used instead of break_on_panic and break_on_exception to honor CONFIG_PANIC_TIMEOUT Modified the commit as per community feedback] Signed-off-by: Kiran Raparthy <> Signed-off-by: Daniel Thompson <> Signed-off-by: Jason Wessel <>
2015-02-19kdb: Const qualifier for kdb_getstr's prompt argumentDaniel Thompson-2/+2
All current callers of kdb_getstr() can pass constant pointers via the prompt argument. This patch adds a const qualification to make explicit the fact that this is safe. Signed-off-by: Daniel Thompson <> Signed-off-by: Jason Wessel <>
2015-02-19kdb: Provide forward search at more promptDaniel Thompson-5/+26
Currently kdb allows the output of comamnds to be filtered using the | grep feature. This is useful but does not permit the output emitted shortly after a string match to be examined without wading through the entire unfiltered output of the command. Such a feature is particularly useful to navigate function traces because these traces often have a useful trigger string *before* the point of interest. This patch reuses the existing filtering logic to introduce a simple forward search to kdb that can be triggered from the more prompt. Signed-off-by: Daniel Thompson <> Signed-off-by: Jason Wessel <>
2015-02-19kdb: Fix a prompt management bug when using | grepDaniel Thompson-2/+2
Currently when the "| grep" feature is used to filter the output of a command then the prompt is not displayed for the subsequent command. Likewise any characters typed by the user are also not echoed to the display. This rather disconcerting problem eventually corrects itself when the user presses Enter and the kdb_grepping_flag is cleared as kdb_parse() tries to make sense of whatever they typed. This patch resolves the problem by moving the clearing of this flag from the middle of command processing to the beginning. Signed-off-by: Daniel Thompson <> Signed-off-by: Jason Wessel <>
2015-02-19kdb: Remove stack dump when entering kgdb due to NMIDaniel Thompson-1/+0
Issuing a stack dump feels ergonomically wrong when entering due to NMI. Entering due to NMI is normally a reaction to a user request, either the NMI button on a server or a "magic knock" on a UART. Therefore the backtrace behaviour on entry due to NMI should be like SysRq-g (no stack dump) rather than like oops. Note also that the stack dump does not offer any information that cannot be trivial retrieved using the 'bt' command. Signed-off-by: Daniel Thompson <> Signed-off-by: Jason Wessel <>
2015-02-19kdb: Avoid printing KERN_ levels to consolesDaniel Thompson-10/+14
Currently when kdb traps printk messages then the raw log level prefix (consisting of '\001' followed by a numeral) does not get stripped off before the message is issued to the various I/O handlers supported by kdb. This causes annoying visual noise as well as causing problems grepping for ^. It is also a change of behaviour compared to normal usage of printk() usage. For example <SysRq>-h ends up with different output to that of kdb's "sr h". This patch addresses the problem by stripping log levels from messages before they are issued to the I/O handlers. printk() which can also act as an i/o handler in some cases is special cased; if the caller provided a log level then the prefix will be preserved when sent to printk(). The addition of non-printable characters to the output of kdb commands is a regression, albeit and extremely elderly one, introduced by commit 04d2c8c83d0e ("printk: convert the format for KERN_<LEVEL> to a 2 byte pattern"). Note also that this patch does *not* restore the original behaviour from v3.5. Instead it makes printk() from within a kdb command display the message without any prefix (i.e. like printk() normally does). Signed-off-by: Daniel Thompson <> Cc: Joe Perches <> Cc: Signed-off-by: Jason Wessel <>
2015-02-19kdb: Fix off by one error in kdb_cpu()Jason Wessel-2/+2
There was a follow on replacement patch against the prior "kgdb: Timeout if secondary CPUs ignore the roundup". See: This patch is the delta vs the patch that was committed upstream: * Fix an off-by-one error in kdb_cpu(). * Replace NR_CPUS with CONFIG_NR_CPUS to tell checkpatch that we really want a static limit. * Removed the "KGDB: " prefix from the pr_crit() in debug_core.c (kgdb-next contains a patch which introduced pr_fmt() to this file to the tag will now be applied automatically). Cc: Daniel Thompson <> Cc: <> Signed-off-by: Jason Wessel <>
2015-02-19kdb: fix incorrect counts in KDB summary command outputJay Lan-1/+1
The output of KDB 'summary' command should report MemTotal, MemFree and Buffers output in kB. Current codes report in unit of pages. A define of K(x) as is defined in the code, but not used. This patch would apply the define to convert the values to kB. Please include me on Cc on replies. I do not subscribe to linux-kernel. Signed-off-by: Jay Lan <> Cc: <> Signed-off-by: Jason Wessel <>
2015-02-19Merge branch 'kbuild' of ↵Linus Torvalds-31/+5
git:// Pull kbuild updates from Michal Marek: - several cleanups in kbuild - serialize multiple *config targets so that 'make defconfig kvmconfig' works - The cc-ifversion macro got support for an else-branch * 'kbuild' of git:// kbuild,gcov: simplify kernel/gcov/Makefile more kbuild: allow cc-ifversion to have the argument for false condition kbuild,gcov: simplify kernel/gcov/Makefile kbuild,gcov: remove unnecessary workaround kbuild: do not add $(call ...) to invoke cc-version or cc-fullversion kbuild: fix cc-ifversion macro kbuild: drop $(version_h) from MRPROPER_FILES kbuild: use mixed-targets when two or more config targets are given kbuild: remove redundant line from bounds.h/asm-offsets.h kbuild: merge bounds.h and asm-offsets.h rules kbuild: Drop support for clean-rule
2015-02-18Merge branch 'rcu/next' of ↵Ingo Molnar-0/+1
git:// into core/urgent Pull RCU fix from Paul E. McKenney. Signed-off-by: Ingo Molnar <>
2015-02-18sched/rt: Avoid obvious configuration failPeter Zijlstra-3/+11
Setting the root group's cpu.rt_runtime_us to 0 is a bad thing; it would disallow the kernel creating RT tasks. One can of course still set it to 1, which will (likely) still wreck your kernel, but at least make it clear that setting it to 0 is not good. Collect both sanity checks into the one place while we're there. Suggested-by: Zefan Li <> Signed-off-by: Peter Zijlstra (Intel) <> Cc: Linus Torvalds <> Link: Signed-off-by: Ingo Molnar <>
2015-02-18sched/autogroup: Fix failure to set cpu.rt_runtime_usPeter Zijlstra-5/+7
Because task_group() uses a cache of autogroup_task_group(), whose output depends on sched_class, switching classes can generate problems. In particular, when started as fair, the cache points to the autogroup, so when switching to RT the tg_rt_schedulable() test fails for every cpu.rt_{runtime,period}_us change because now the autogroup has tasks and no runtime. Furthermore, going back to the previous semantics of varying task_group() with sched_class has the down-side that the sched_debug output varies as well, even though the task really is in the autogroup. Therefore add an autogroup exception to tg_has_rt_tasks() -- such that both (all) task_group() usages in sched/core now have one. And remove all the remnants of the variable task_group() output. Reported-by: Zefan Li <> Signed-off-by: Peter Zijlstra (Intel) <> Cc: Linus Torvalds <> Cc: Mike Galbraith <> Cc: Stefan Bader <> Fixes: 8323f26ce342 ("sched: Fix race in task_group()") Link: Signed-off-by: Ingo Molnar <>
2015-02-18sched/dl: Do update_rq_clock() in yield_task_dl()Kirill Tkhai-0/+1
update_curr_dl() needs actual rq clock. Signed-off-by: Kirill Tkhai <> Signed-off-by: Peter Zijlstra (Intel) <> Cc: Linus Torvalds <> Link: Signed-off-by: Ingo Molnar <>
2015-02-18ntp: Fixup adjtimex freq validation on 32-bit systemsJohn Stultz-3/+7
Additional validation of adjtimex freq values to avoid potential multiplication overflows were added in commit 5e5aeb4367b (time: adjtimex: Validate the ADJ_FREQUENCY values) Unfortunately the patch used LONG_MAX/MIN instead of LLONG_MAX/MIN, which was fine on 64-bit systems, but being much smaller on 32-bit systems caused false positives resulting in most direct frequency adjustments to fail w/ EINVAL. ntpd only does direct frequency adjustments at startup, so the issue was not as easily observed there, but other time sync applications like ptpd and chrony were more effected by the bug. See bugs: This patch changes the checks to use LLONG_MAX for clarity, and additionally the checks are disabled on 32-bit systems since LLONG_MAX/PPM_SCALE is always larger then the 32-bit long freq value, so multiplication overflows aren't possible there. Reported-by: Josh Boyer <> Reported-by: George Joseph <> Tested-by: George Joseph <> Signed-off-by: John Stultz <> Signed-off-by: Peter Zijlstra (Intel) <> Cc: <> # v3.19+ Cc: Linus Torvalds <> Cc: Sasha Levin <> Link: [ Prettified the changelog and the comments a bit. ] Signed-off-by: Ingo Molnar <>
2015-02-18sched: Prevent recursion in io_schedule()NeilBrown-19/+12
io_schedule() calls blk_flush_plug() which, depending on the contents of current->plug, can initiate arbitrary blk-io requests. Note that this contrasts with blk_schedule_flush_plug() which requires all non-trivial work to be handed off to a separate thread. This makes it possible for io_schedule() to recurse, and initiating block requests could possibly call mempool_alloc() which, in times of memory pressure, uses io_schedule(). Apart from any stack usage issues, io_schedule() will not behave correctly when called recursively as delayacct_blkio_start() does not allow for repeated calls. So: - use ->in_iowait to detect recursion. Set it earlier, and restore it to the old value. - move the call to "raw_rq" after the call to blk_flush_plug(). As this is some sort of per-cpu thing, we want some chance that we are on the right CPU - When io_schedule() is called recurively, use blk_schedule_flush_plug() which cannot further recurse. - as this makes io_schedule() a lot more complex and as io_schedule() must match io_schedule_timeout(), but all the changes in io_schedule_timeout() and make io_schedule a simple wrapper for that. Signed-off-by: NeilBrown <> Signed-off-by: Peter Zijlstra (Intel) <> [ Moved the now rudimentary io_schedule() into sched.h. ] Cc: Jens Axboe <> Cc: Linus Torvalds <> Cc: Tony Battersby <> Link: Signed-off-by: Ingo Molnar <>
2015-02-18sched/completion: Serialize completion_done() with complete()Oleg Nesterov-2/+17
Commit de30ec47302c "Remove unnecessary ->wait.lock serialization when reading completion state" was not correct, without lock/unlock the code like stop_machine_from_inactive_cpu() while (!completion_done()) cpu_relax(); can return before complete() finishes its spin_unlock() which writes to this memory. And spin_unlock_wait(). While at it, change try_wait_for_completion() to use READ_ONCE(). Reported-by: Paul E. McKenney <> Reported-by: Davidlohr Bueso <> Tested-by: Paul E. McKenney <> Signed-off-by: Oleg Nesterov <> Signed-off-by: Peter Zijlstra (Intel) <> [ Added a comment with the barrier. ] Cc: Linus Torvalds <> Cc: Nicholas Mc Guire <> Cc: Cc: Fixes: de30ec47302c ("sched/completion: Remove unnecessary ->wait.lock serialization when reading completion state") Link: Signed-off-by: Ingo Molnar <>
2015-02-18sched: Fix preempt_schedule_common() triggering tracing recursionFrederic Weisbecker-1/+1
Since the function graph tracer needs to disable preemption, it might call preempt_schedule() after reenabling it if something triggered the need for rescheduling in between. Therefore we can't trace preempt_schedule() itself because we would face a function tracing recursion otherwise as the tracer is always called before PREEMPT_ACTIVE gets set to prevent that recursion. This is why preempt_schedule() is tagged as "notrace". But the same issue applies to every function called by preempt_schedule() before PREEMPT_ACTIVE is actually set. And preempt_schedule_common() is one such example. Unfortunately we forgot to tag it as notrace as well and as a result we are encountering tracing recursion since it got introduced by: a18b5d0181923 ("sched: Fix missing preemption opportunity") Let's fix that by applying the appropriate function tag to preempt_schedule_common(). Reported-by: Huang Ying <> Signed-off-by: Frederic Weisbecker <> Signed-off-by: Peter Zijlstra (Intel) <> Acked-by: Steven Rostedt <> Cc: Linus Torvalds <> Link: Signed-off-by: Ingo Molnar <>
2015-02-18sched/dl: Prevent enqueue of a sleeping task in dl_task_timer()Kirill Tkhai-0/+20
A deadline task may be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; Later the timer fires, but the task is still dequeued: dl_task_timer() enqueue_task_dl() /* queues on dl_rq; on_rq remains 0 */ Someone wakes it up: try_to_wake_up() enqueue_dl_entity() BUG_ON(on_dl_rq()) Patch fixes this problem, it prevents queueing !on_rq tasks on dl_rq. Reported-by: Fengguang Wu <> Signed-off-by: Kirill Tkhai <> Signed-off-by: Peter Zijlstra (Intel) <> [ Wrote comment. ] Cc: Juri Lelli <> Fixes: 1019a359d3dc ("sched/deadline: Fix stale yield state") Link: Signed-off-by: Ingo Molnar <>
2015-02-18sched: Make dl_task_time() use task_rq_lock()Peter Zijlstra-85/+79
Kirill reported that a dl task can be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; This invalidates the assumption from commit 0f397f2c90ce ("sched/dl: Fix race in dl_task_timer()"): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". And therefore we have to use the full task_rq_lock() here. This further amends the fact that we forgot to update the rq lock loop for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"). Reported-by: Kirill Tkhai <> Signed-off-by: Peter Zijlstra (Intel) <> Cc: Juri Lelli <> Link: Signed-off-by: Ingo Molnar <>
2015-02-18sched: Clarify ordering between task_rq_lock() and move_queued_task()Peter Zijlstra-0/+16
There was a wee bit of confusion around the exact ordering here; clarify things. Reported-by: Kirill Tkhai <> Signed-off-by: Peter Zijlstra (Intel) <> Cc: Linus Torvalds <> Cc: Oleg Nesterov <> Cc: Paul E. McKenney <> Link: Signed-off-by: Ingo Molnar <>
2015-02-18locking/rtmutex: Avoid a NULL pointer dereference on deadlockSebastian Andrzej Siewior-1/+2
With task_blocks_on_rt_mutex() returning early -EDEADLK we never add the waiter to the waitqueue. Later, we try to remove it via remove_waiter() and go boom in rt_mutex_top_waiter() because rb_entry() gives a NULL pointer. ( Tested on v3.18-RT where rtmutex is used for regular mutex and I tried to get one twice in a row. ) Not sure when this started but I guess 397335f004f4 ("rtmutex: Fix deadlock detector for real") or commit 3d5c9340d194 ("rtmutex: Handle deadlock detection smarter"). Signed-off-by: Sebastian Andrzej Siewior <> Acked-by: Peter Zijlstra <> Cc: Thomas Gleixner <> Cc: <> # for v3.16 and later kernels Link: Signed-off-by: Ingo Molnar <>
2015-02-17Merge branch 'getname2' of ↵Linus Torvalds-151/+37
git:// Pull getname/putname updates from Al Viro: "Rework of getname/getname_kernel/etc., mostly from Paul Moore. Gets rid of quite a pile of kludges between namei and audit..." * 'getname2' of git:// audit: replace getname()/putname() hacks with reference counters audit: fix filename matching in __audit_inode() and __audit_inode_child() audit: enable filename recording via getname_kernel() simpler calling conventions for filename_mountpoint() fs: create proper filename objects using getname_kernel() fs: rework getname_kernel to handle up to PATH_MAX sized filenames cut down the number of do_path_lookup() callers
2015-02-17Merge branch 'for-linus' of ↵Linus Torvalds-52/+47
git:// Pull misc VFS updates from Al Viro: "This cycle a lot of stuff sits on topical branches, so I'll be sending more or less one pull request per branch. This is the first pile; more to follow in a few. In this one are several misc commits from early in the cycle (before I went for separate branches), plus the rework of mntput/dput ordering on umount, switching to use of fs_pin instead of convoluted games in namespace_unlock()" * 'for-linus' of git:// switch the IO-triggering parts of umount to fs_pin new fs_pin killing logics allow attaching fs_pin to a group not associated with some superblock get rid of the second argument of acct_kill() take count and rcu_head out of fs_pin dcache: let the dentry count go down to zero without taking d_lock pull bumping refcount into ->kill() kill pin_put() mode_t whack-a-mole: chelsio file->f_path.dentry is pinned down for as long as the file is open... get rid of lustre_dump_dentry() gut proc_register() a bit kill d_validate() ncpfs: get rid of d_validate() nonsense selinuxfs: don't open-code d_genocide()
2015-02-17Merge branch 'akpm' (patches from Andrew)Linus Torvalds-18/+23
Merge yet more updates from Andrew Morton: - a pile of minor fs fixes and cleanups - kexec updates - random misc fixes in various places: vmcore, rbtree, eventfd, ipc, seccomp. - a series of python-based kgdb helper scripts * emailed patches from Andrew Morton <>: (58 commits) seccomp: cap SECCOMP_RET_ERRNO data to MAX_ERRNO samples/seccomp: improve label helper ipc,sem: use current->state helpers scripts/gdb: disable pagination while printing from breakpoint handler scripts/gdb: define maintainer scripts/gdb: convert CpuList to generator function scripts/gdb: convert ModuleList to generator function scripts/gdb: use a generator instead of iterator for task list scripts/gdb: ignore byte-compiled python files scripts/gdb: port to python3 / gdb7.7 scripts/gdb: add basic documentation scripts/gdb: add lx-lsmod command scripts/gdb: add class to iterate over CPU masks scripts/gdb: add lx_current convenience function scripts/gdb: add internal helper and convenience function for per-cpu lookup scripts/gdb: add get_gdbserver_type helper scripts/gdb: add internal helper and convenience function to retrieve thread_info scripts/gdb: add is_target_arch helper scripts/gdb: add helper and convenience function to look up tasks scripts/gdb: add task iteration class ...
2015-02-17seccomp: cap SECCOMP_RET_ERRNO data to MAX_ERRNOKees Cook-1/+3
The value resulting from the SECCOMP_RET_DATA mask could exceed MAX_ERRNO when setting errno during a SECCOMP_RET_ERRNO filter action. This makes sure we have a reliable value being set, so that an invalid errno will not be ignored by userspace. Signed-off-by: Kees Cook <> Reported-by: Dmitry V. Levin <> Cc: Andy Lutomirski <> Cc: Will Drewry <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2015-02-17kernel/module.c: do not inline do_init_module()Jan Kiszka-2/+7
This provides a reliable breakpoint target, required for automatic symbol loading via the gdb helper command 'lx-symbols'. Signed-off-by: Jan Kiszka <> Acked-by: Rusty Russell <> Cc: Thomas Gleixner <> Cc: Jason Wessel <> Cc: Andi Kleen <> Cc: Ben Widawsky <> Cc: Borislav Petkov <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2015-02-17kexec: simplify conditionalGeoff Levand-7/+10
Simplify the code around one of the conditionals in the kexec_load syscall routine. The original code was confusing with a redundant check on KEXEC_ON_CRASH and comments outside of the conditional block. This change switches the order of the conditional check, and cleans up the comments for the conditional. There is no functional change to the code. Signed-off-by: Geoff Levand <> Acked-by: Vivek Goyal <> Cc: Arnd Bergmann <> Cc: Benjamin Herrenschmidt <> Cc: H. Peter Anvin <> Cc: Maximilian Attems <> Cc: Michal Marek <> Cc: Paul Bolle <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2015-02-17kexec: fix a typo in commentAlexander Kuleshov-1/+1
Signed-off-by: Alexander Kuleshov <> Acked-by: "Eric W. Biederman" <> Acked-by: Vivek Goyal <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2015-02-17kexec: remove never used member destination in kimageBaoquan He-4/+0
struct kimage has a member destination which is used to store the real destination address of each page when load segment from user space buffer to kernel. But we never retrieve the value stored in kimage->destination, so this member variable in kimage and its assignment operation are redundent code. I guess for_each_kimage_entry just does the work that kimage->destination is expected to do. So in this patch just make a cleanup to remove it. Signed-off-by: Baoquan He <> Cc: "Eric W. Biederman" <> Cc: Vivek Goyal <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2015-02-17signal: use current->state helpersDavidlohr Bueso-2/+2
Call __set_current_state() instead of assigning the new state directly. These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP environments, keeping track of who changed the state. Signed-off-by: Davidlohr Bueso <> Acked-by: Oleg Nesterov <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2015-02-17ptrace: remove linux/compat.h inclusion under CONFIG_COMPATFabian Frederick-1/+0
Commit 84c751bd4aeb ("ptrace: add ability to retrieve signals without removing from a queue (v4)") includes <linux/compat.h> globally in ptrace.c This patch removes inclusion under if defined CONFIG_COMPAT. Signed-off-by: Fabian Frederick <> Acked-by: Oleg Nesterov <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2015-02-17Merge tag 'suspend-to-idle-3.20-rc1' of ↵Linus Torvalds-17/+142
git:// Pull suspend-to-idle updates from Rafael Wysocki: "Suspend-to-idle timer quiescing support for v3.20-rc1 Until now suspend-to-idle has not been able to save much more energy than runtime PM because of timer interrupts that periodically bring CPUs out of idle while they are waiting for a wakeup interrupt. Of course, the timer interrupts are not wakeup ones, so the handling of them can be deferred until a real wakeup interrupt happens, but at the same time we don't want to mass-expire timers at that point. The solution is to suspend the entire timekeeping when the last CPU is entering an idle state and resume it when the first CPU goes out of idle. That has to be done with care, though, so as to avoid accessing suspended clocksources etc. end we need extra support from idle drivers for that. This series of commits adds support for quiescing timers during suspend-to-idle and adds the requisite callbacks to intel_idle and the ACPI cpuidle driver" * tag 'suspend-to-idle-3.20-rc1' of git:// ACPI / idle: Implement ->enter_freeze callback routine intel_idle: Add ->enter_freeze callbacks PM / sleep: Make it possible to quiesce timers during suspend-to-idle timekeeping: Make it safe to use the fast timekeeper while suspended timekeeping: Pass readout base to update_fast_timekeeper() PM / sleep: Re-implement suspend-to-idle handling