path: root/Documentation/s390
diff options
Diffstat (limited to 'Documentation/s390')
12 files changed, 4907 insertions, 0 deletions
diff --git a/Documentation/s390/3270.ChangeLog b/Documentation/s390/3270.ChangeLog
new file mode 100644
index 000000000000..031c36081946
--- /dev/null
+++ b/Documentation/s390/3270.ChangeLog
@@ -0,0 +1,44 @@
+ChangeLog for the UTS Global 3270-support patch
+Sep 2002: Get bootup colors right on 3270 console
+ * In tubttybld.c, substantially revise ESC processing so that
+ ESC sequences (especially coloring ones) and the strings
+ they affect work as right as 3270 can get them. Also, set
+ screen height to omit the two rows used for input area, in
+ tty3270_open() in tubtty.c.
+Sep 2002: Dynamically get 3270 input buffer
+ * Oversize 3270 screen widths may exceed GEOM_MAXINPLEN columns,
+ so get input-area buffer dynamically when sizing the device in
+ tubmakemin() in tuball.c (if it's the console) or tty3270_open()
+ in tubtty.c (if needed). Change tubp->tty_input to be a
+ pointer rather than an array, in tubio.h.
+Sep 2002: Fix tubfs kmalloc()s
+ * Do read and write lengths correctly in fs3270_read()
+ and fs3270_write(), whilst never asking kmalloc()
+ for more than 0x800 bytes. Affects tubfs.c and tubio.h.
+Sep 2002: Recognize 3270 control unit type 3174
+ * Recognize control-unit type 0x3174 as well as 0x327?.
+ The IBM 2047 device emulates a 3174 control unit.
+ Modularize control-unit recognition in tuball.c by
+ adding and invoking new tub3270_is_ours().
+Apr 2002: Fix 3270 console reboot loop
+ * (Belated log entry) Fixed reboot loop if 3270 console,
+ in tubtty.c:ttu3270_bh().
+Feb 6, 2001:
+ * This changelog is new
+ * tub3270 now supports 3270 console:
+ Specify y for CONFIG_3270 and y for CONFIG_3270_CONSOLE.
+ Support for 3215 will not appear if 3270 console support
+ is chosen.
+ NOTE: The default is 3270 console support, NOT 3215.
+ * the components are remodularized: added source modules are
+ tubttybld.c and tubttyscl.c, for screen-building code and
+ scroll-timeout code.
+ * tub3270 source for this (2.4.0) version is #ifdeffed to
+ build with both 2.4.0 and
+ * color support and minimal other ESC-sequence support is added.
diff --git a/Documentation/s390/3270.txt b/Documentation/s390/3270.txt
new file mode 100644
index 000000000000..0a044e647d2d
--- /dev/null
+++ b/Documentation/s390/3270.txt
@@ -0,0 +1,274 @@
+IBM 3270 Display System support
+This file describes the driver that supports local channel attachment
+of IBM 3270 devices. It consists of three sections:
+ * Introduction
+ * Installation
+ * Operation
+This paper describes installing and operating 3270 devices under
+Linux/390. A 3270 device is a block-mode rows-and-columns terminal of
+which I'm sure hundreds of millions were sold by IBM and clonemakers
+twenty and thirty years ago.
+You may have 3270s in-house and not know it. If you're using the
+VM-ESA operating system, define a 3270 to your virtual machine by using
+the command "DEF GRAF <hex-address>" This paper presumes you will be
+defining four 3270s with the CP/CMS commands
+ DEF GRAF 620
+ DEF GRAF 621
+ DEF GRAF 622
+ DEF GRAF 623
+Your network connection from VM-ESA allows you to use x3270, tn3270, or
+another 3270 emulator, started from an xterm window on your PC or
+workstation. With the DEF GRAF command, an application such as xterm,
+and this Linux-390 3270 driver, you have another way of talking to your
+Linux box.
+This paper covers installation of the driver and operation of a
+dialed-in x3270.
+You install the driver by installing a patch, doing a kernel build, and
+running the configuration script (, in this directory).
+WARNING: If you are using 3270 console support, you must rerun the
+configuration script every time you change the console's address (perhaps
+by using the condev= parameter in silo's /boot/parmfile). More precisely,
+you should rerun the configuration script every time your set of 3270s,
+including the console 3270, changes subchannel identifier relative to
+one another. ReIPL as soon as possible after running the configuration
+script and the resulting /tmp/mkdev3270.
+If you have chosen to make tub3270 a module, you add a line to
+/etc/modprobe.conf. If you are working on a VM virtual machine, you
+can use DEF GRAF to define virtual 3270 devices.
+You may generate both 3270 and 3215 console support, or one or the
+other, or neither. If you generate both, the console type under VM is
+not changed. Use #CP Q TERM to see what the current console type is.
+Use #CP TERM CONMODE 3270 to change it to 3270. If you generate only
+3270 console support, then the driver automatically converts your console
+at boot time to a 3270 if it is a 3215.
+In brief, these are the steps:
+ 1. Install the tub3270 patch
+ 2. (If a module) add a line to /etc/modprobe.conf
+ 3. (If VM) define devices with DEF GRAF
+ 4. Reboot
+ 5. Configure
+To test that everything works, assuming VM and x3270,
+ 1. Bring up an x3270 window.
+ 2. Use the DIAL command in that window.
+ 3. You should immediately see a Linux login screen.
+Here are the installation steps in detail:
+ 1. The 3270 driver is a part of the official Linux kernel
+ source. Build a tree with the kernel source and any necessary
+ patches. Then do
+ make oldconfig
+ (If you wish to disable 3215 console support, edit
+ .config; change CONFIG_TN3215's value to "n";
+ and rerun "make oldconfig".)
+ make image
+ make modules
+ make modules_install
+ 2. (Perform this step only if you have configured tub3270 as a
+ module.) Add a line to /etc/modprobe.conf to automatically
+ load the driver when it's needed. With this line added,
+ you will see login prompts appear on your 3270s as soon as
+ boot is complete (or with emulated 3270s, as soon as you dial
+ into your vm guest using the command "DIAL <vmguestname>").
+ Since the line-mode major number is 227, the line to add to
+ /etc/modprobe.conf should be:
+ alias char-major-227 tub3270
+ 3. Define graphic devices to your vm guest machine, if you
+ haven't already. Define them before you reboot (reipl):
+ 4. Reboot. The reboot process scans hardware devices, including
+ 3270s, and this enables the tub3270 driver once loaded to respond
+ correctly to the configuration requests of the next step. If
+ you have chosen 3270 console support, your console now behaves
+ as a 3270, not a 3215.
+ 5. Run the 3270 configuration script config3270. It is
+ distributed in this same directory, Documentation/s390, as
+ Inspect the output script it produces,
+ /tmp/mkdev3270, and then run that script. This will create the
+ necessary character special device files and make the necessary
+ changes to /etc/inittab. If you have selected DEVFS, the driver
+ itself creates the device files, and /tmp/mkdev3270 only changes
+ /etc/inittab.
+ Then notify /sbin/init that /etc/inittab has changed, by issuing
+ the telinit command with the q operand:
+ cd Documentation/s390
+ sh
+ sh /tmp/mkdev3270
+ telinit q
+ This should be sufficient for your first time. If your 3270
+ configuration has changed and you're reusing config3270, you
+ should follow these steps:
+ Change 3270 configuration
+ Reboot
+ Run config3270 and /tmp/mkdev3270
+ Reboot
+Here are the testing steps in detail:
+ 1. Bring up an x3270 window, or use an actual hardware 3278 or
+ 3279, or use the 3270 emulator of your choice. You would be
+ running the emulator on your PC or workstation. You would use
+ the command, for example,
+ x3270 vm-esa-domain-name &
+ if you wanted a 3278 Model 4 with 43 rows of 80 columns, the
+ default model number. The driver does not take advantage of
+ extended attributes.
+ The screen you should now see contains a VM logo with input
+ lines near the bottom. Use TAB to move to the bottom line,
+ probably labeled "COMMAND ===>".
+ 2. Use the DIAL command instead of the LOGIN command to connect
+ to one of the virtual 3270s you defined with the DEF GRAF
+ commands:
+ dial my-vm-guest-name
+ 3. You should immediately see a login prompt from your
+ Linux-390 operating system. If that does not happen, you would
+ see instead the line "DIALED TO my-vm-guest-name 0620".
+ To troubleshoot: do these things.
+ A. Is the driver loaded? Use the lsmod command (no operands)
+ to find out. Probably it isn't. Try loading it manually, with
+ the command "insmod tub3270". Does that command give error
+ messages? Ha! There's your problem.
+ B. Is the /etc/inittab file modified as in installation step 3
+ above? Use the grep command to find out; for instance, issue
+ "grep 3270 /etc/inittab". Nothing found? There's your
+ problem!
+ C. Are the device special files created, as in installation
+ step 2 above? Use the ls -l command to find out; for instance,
+ issue "ls -l /dev/3270/tty620". The output should start with the
+ letter "c" meaning character device and should contain "227, 1"
+ just to the left of the device name. No such file? no "c"?
+ Wrong major number? Wrong minor number? There's your
+ problem!
+ D. Do you get the message
+ "HCPDIA047E my-vm-guest-name 0620 does not exist"?
+ If so, you must issue the command "DEF GRAF 620" from your VM
+ 3215 console and then reboot the system.
+The driver defines three areas on the 3270 screen: the log area, the
+input area, and the status area.
+The log area takes up all but the bottom two lines of the screen. The
+driver writes terminal output to it, starting at the top line and going
+down. When it fills, the status area changes from "Linux Running" to
+"Linux More...". After a scrolling timeout of (default) 5 sec, the
+screen clears and more output is written, from the top down.
+The input area extends from the beginning of the second-to-last screen
+line to the start of the status area. You type commands in this area
+and hit ENTER to execute them.
+The status area initializes to "Linux Running" to give you a warm
+fuzzy feeling. When the log area fills up and output awaits, it
+changes to "Linux More...". At this time you can do several things or
+nothing. If you do nothing, the screen will clear in (default) 5 sec
+and more output will appear. You may hit ENTER with nothing typed in
+the input area to toggle between "Linux More..." and "Linux Holding",
+which indicates no scrolling will occur. (If you hit ENTER with "Linux
+Running" and nothing typed, the application receives a newline.)
+You may change the scrolling timeout value. For example, the following
+command line:
+ echo scrolltime=60 > /proc/tty/driver/tty3270
+changes the scrolling timeout value to 60 sec. Set scrolltime to 0 if
+you wish to prevent scrolling entirely.
+Other things you may do when the log area fills up are: hit PA2 to
+clear the log area and write more output to it, or hit CLEAR to clear
+the log area and the input area and write more output to the log area.
+Some of the Program Function (PF) and Program Attention (PA) keys are
+preassigned special functions. The ones that are not yield an alarm
+when pressed.
+PA1 causes a SIGINT to the currently running application. You may do
+the same thing from the input area, by typing "^C" and hitting ENTER.
+PA2 causes the log area to be cleared. If output awaits, it is then
+written to the log area.
+PF3 causes an EOF to be received as input by the application. You may
+cause an EOF also by typing "^D" and hitting ENTER.
+No PF key is preassigned to cause a job suspension, but you may cause a
+job suspension by typing "^Z" and hitting ENTER. You may wish to
+assign this function to a PF key. To make PF7 cause job suspension,
+execute the command:
+ echo pf7=^z > /proc/tty/driver/tty3270
+If the input you type does not end with the two characters "^n", the
+driver appends a newline character and sends it to the tty driver;
+otherwise the driver strips the "^n" and does not append a newline.
+The IBM 3215 driver behaves similarly.
+Pf10 causes the most recent command to be retrieved from the tube's
+command stack (default depth 20) and displayed in the input area. You
+may hit PF10 again for the next-most-recent command, and so on. A
+command is entered into the stack only when the input area is not made
+invisible (such as for password entry) and it is not identical to the
+current top entry. PF10 rotates backward through the command stack;
+PF11 rotates forward. You may assign the backward function to any PF
+key (or PA key, for that matter), say, PA3, with the command:
+ echo -e pa3=\\033k > /proc/tty/driver/tty3270
+This assigns the string ESC-k to PA3. Similarly, the string ESC-j
+performs the forward function. (Rationale: In bash with vi-mode line
+editing, ESC-k and ESC-j retrieve backward and forward history.
+Suggestions welcome.)
+Is a stack size of twenty commands not to your liking? Change it on
+the fly. To change to saving the last 100 commands, execute the
+ echo recallsize=100 > /proc/tty/driver/tty3270
+Have a command you issue frequently? Assign it to a PF or PA key! Use
+the command
+ echo pf24="mkdir foobar; cd foobar" > /proc/tty/driver/tty3270
+to execute the commands mkdir foobar and cd foobar immediately when you
+hit PF24. Want to see the command line first, before you execute it?
+Use the -n option of the echo command:
+ echo -n pf24="mkdir foo; cd foo" > /proc/tty/driver/tty3270
+Happy testing! I welcome any and all comments about this document, the
+driver, etc etc.
+Dick Hitt <>
diff --git a/Documentation/s390/CommonIO b/Documentation/s390/CommonIO
new file mode 100644
index 000000000000..a831d9ae5a5e
--- /dev/null
+++ b/Documentation/s390/CommonIO
@@ -0,0 +1,109 @@
+S/390 common I/O-Layer - command line parameters and /proc entries
+Command line parameters
+* cio_msg = yes | no
+ Determines whether information on found devices and sensed device
+ characteristics should be shown during startup, i. e. messages of the types
+ "Detected device 0.0.4711 on subchannel 0.0.0042" and "SenseID: Device
+ 0.0.4711 reports: ...".
+ Default is off.
+* cio_ignore = {all} |
+ {<device> | <range of devices>} |
+ {!<device> | !<range of devices>}
+ The given devices will be ignored by the common I/O-layer; no detection
+ and device sensing will be done on any of those devices. The subchannel to
+ which the device in question is attached will be treated as if no device was
+ attached.
+ An ignored device can be un-ignored later; see the "/proc entries"-section for
+ details.
+ The devices must be given either as bus ids (0.0.abcd) or as hexadecimal
+ device numbers (0xabcd or abcd, for 2.4 backward compatibility).
+ You can use the 'all' keyword to ignore all devices.
+ The '!' operator will cause the I/O-layer to _not_ ignore a device.
+ The order on the command line is not important.
+ For example,
+ cio_ignore=0.0.0023-0.0.0042,0.0.4711
+ will ignore all devices ranging from 0.0.0023 to 0.0.0042 and the device
+ 0.0.4711, if detected.
+ As another example,
+ cio_ignore=all,!0.0.4711,!0.0.fd00-0.0.fd02
+ will ignore all devices but 0.0.4711, 0.0.fd00, 0.0.fd01, 0.0.fd02.
+ By default, no devices are ignored.
+/proc entries
+* /proc/cio_ignore
+ Lists the ranges of devices (by bus id) which are ignored by common I/O.
+ You can un-ignore certain or all devices by piping to /proc/cio_ignore.
+ "free all" will un-ignore all ignored devices,
+ "free <device range>, <device range>, ..." will un-ignore the specified
+ devices.
+ For example, if devices 0.0.0023 to 0.0.0042 and 0.0.4711 are ignored,
+ - echo free 0.0.0030-0.0.0032 > /proc/cio_ignore
+ will un-ignore devices 0.0.0030 to 0.0.0032 and will leave devices 0.0.0023
+ to 0.0.002f, 0.0.0033 to 0.0.0042 and 0.0.4711 ignored;
+ - echo free 0.0.0041 > /proc/cio_ignore will furthermore un-ignore device
+ 0.0.0041;
+ - echo free all > /proc/cio_ignore will un-ignore all remaining ignored
+ devices.
+ When a device is un-ignored, device recognition and sensing is performed and
+ the device driver will be notified if possible, so the device will become
+ available to the system.
+ You can also add ranges of devices to be ignored by piping to
+ /proc/cio_ignore; "add <device range>, <device range>, ..." will ignore the
+ specified devices.
+ Note: Already known devices cannot be ignored.
+ For example, if device 0.0.abcd is already known and all other devices
+ 0.0.a000-0.0.afff are not known,
+ "echo add 0.0.a000-0.0.accc, 0.0.af00-0.0.afff > /proc/cio_ignore"
+ will add 0.0.a000-0.0.abcc, 0.0.abce-0.0.accc and 0.0.af00-0.0.afff to the
+ list of ignored devices and skip 0.0.abcd.
+ The devices can be specified either by bus id (0.0.abcd) or, for 2.4 backward
+ compatibilty, by the device number in hexadecimal (0xabcd or abcd).
+* /proc/s390dbf/cio_*/ (S/390 debug feature)
+ Some views generated by the debug feature to hold various debug outputs.
+ - /proc/s390dbf/cio_crw/sprintf
+ Messages from the processing of pending channel report words (machine check
+ handling), which will also show when CONFIG_DEBUG_CRW is defined.
+ - /proc/s390dbf/cio_msg/sprintf
+ Various debug messages from the common I/O-layer; generally, messages which
+ will also show when CONFIG_DEBUG_IO is defined.
+ - /proc/s390dbf/cio_trace/hex_ascii
+ Logs the calling of functions in the common I/O-layer and, if applicable,
+ which subchannel they were called for.
+ The level of logging can be changed to be more or less verbose by piping to
+ /proc/s390dbf/cio_*/level a number between 0 and 6; see the documentation on
+ the S/390 debug feature (Documentation/s390/s390dbf.txt) for details.
+* For some of the information present in the /proc filesystem in 2.4 (namely,
+ /proc/subchannels and /proc/chpids), see driver-model.txt.
+ Information formerly in /proc/irq_count is now in /proc/interrupts.
diff --git a/Documentation/s390/DASD b/Documentation/s390/DASD
new file mode 100644
index 000000000000..9963f1e9c98a
--- /dev/null
+++ b/Documentation/s390/DASD
@@ -0,0 +1,73 @@
+DASD device driver
+S/390's disk devices (DASDs) are managed by Linux via the DASD device
+driver. It is valid for all types of DASDs and represents them to
+Linux as block devices, namely "dd". Currently the DASD driver uses a
+single major number (254) and 4 minor numbers per volume (1 for the
+physical volume and 3 for partitions). With respect to partitions see
+below. Thus you may have up to 64 DASD devices in your system.
+The kernel parameter 'dasd=from-to,...' may be issued arbitrary times
+in the kernel's parameter line or not at all. The 'from' and 'to'
+parameters are to be given in hexadecimal notation without a leading
+If you supply kernel parameters the different instances are processed
+in order of appearance and a minor number is reserved for any device
+covered by the supplied range up to 64 volumes. Additional DASDs are
+ignored. If you do not supply the 'dasd=' kernel parameter at all, the
+DASD driver registers all supported DASDs of your system to a minor
+number in ascending order of the subchannel number.
+The driver currently supports ECKD-devices and there are stubs for
+support of the FBA and CKD architectures. For the FBA architecture
+only some smart data structures are missing to make the support
+We performed our testing on 3380 and 3390 type disks of different
+sizes, under VM and on the bare hardware (LPAR), using internal disks
+of the multiprise as well as a RAMAC virtual array. Disks exported by
+an Enterprise Storage Server (Seascape) should work fine as well.
+We currently implement one partition per volume, which is the whole
+volume, skipping the first blocks up to the volume label. These are
+reserved for IPL records and IBM's volume label to assure
+accessibility of the DASD from other OSs. In a later stage we will
+provide support of partitions, maybe VTOC oriented or using a kind of
+partition table in the label record.
+-Low-level format (?CKD only)
+For using an ECKD-DASD as a Linux harddisk you have to low-level
+format the tracks by issuing the BLKDASDFORMAT-ioctl on that
+device. This will erase any data on that volume including IBM volume
+labels, VTOCs etc. The ioctl may take a 'struct format_data *' or
+'NULL' as an argument.
+typedef struct {
+ int start_unit;
+ int stop_unit;
+ int blksize;
+} format_data_t;
+When a NULL argument is passed to the BLKDASDFORMAT ioctl the whole
+disk is formatted to a blocksize of 1024 bytes. Otherwise start_unit
+and stop_unit are the first and last track to be formatted. If
+stop_unit is -1 it implies that the DASD is formatted from start_unit
+up to the last track. blksize can be any power of two between 512 and
+4096. We recommend no blksize lower than 1024 because the ext2fs uses
+1kB blocks anyway and you gain approx. 50% of capacity increasing your
+blksize from 512 byte to 1kB.
+-Make a filesystem
+Then you can mk??fs the filesystem of your choice on that volume or
+partition. For reasons of sanity you should build your filesystem on
+the partition /dev/dd?1 instead of the whole volume. You only lose 3kB
+but may be sure that you can reuse your data after introduction of a
+real partition table.
+- Performance sometimes is rather low because we don't fully exploit clustering
+- Add IBM'S Disk layout to genhd
+- Enhance driver to use more than one major number
+- Enable usage as a module
+- Support Cache fast write and DASD fast write (ECKD)
diff --git a/Documentation/s390/Debugging390.txt b/Documentation/s390/Debugging390.txt
new file mode 100644
index 000000000000..adbfe620c061
--- /dev/null
+++ b/Documentation/s390/Debugging390.txt
@@ -0,0 +1,2536 @@
+ Debugging on Linux for s/390 & z/Architecture
+ by
+ Denis Joseph Barrow (,
+ Copyright (C) 2000-2001 IBM Deutschland Entwicklung GmbH, IBM Corporation
+ Best viewed with fixed width fonts
+Overview of Document:
+This document is intended to give an good overview of how to debug
+Linux for s/390 & z/Architecture it isn't intended as a complete reference & not a
+tutorial on the fundamentals of C & assembly, it dosen't go into
+390 IO in any detail. It is intended to complement the documents in the
+reference section below & any other worthwhile references you get.
+It is intended like the Enterprise Systems Architecture/390 Reference Summary
+to be printed out & used as a quick cheat sheet self help style reference when
+problems occur.
+Register Set
+Address Spaces on Intel Linux
+Address Spaces on Linux for s/390 & z/Architecture
+The Linux for s/390 & z/Architecture Kernel Task Structure
+Register Usage & Stackframes on Linux for s/390 & z/Architecture
+A sample program with comments
+Compiling programs for debugging on Linux for s/390 & z/Architecture
+Figuring out gcc compile errors
+Debugging Tools
+Performance Debugging
+Debugging under VM
+s/390 & z/Architecture IO Overview
+Debugging IO on s/390 & z/Architecture under VM
+GDB on s/390 & z/Architecture
+Stack chaining in gdb by hand
+Examining core dumps
+Debugging modules
+The proc file system
+Starting points for debugging scripting languages etc.
+Dumptool & Lcrash
+Special Thanks
+Register Set
+The current architectures have the following registers.
+16 General propose registers, 32 bit on s/390 64 bit on z/Architecture, r0-r15 or gpr0-gpr15 used for arithmetic & addressing.
+16 Control registers, 32 bit on s/390 64 bit on z/Architecture, ( cr0-cr15 kernel usage only ) used for memory management,
+interrupt control,debugging control etc.
+16 Access registers ( ar0-ar15 ) 32 bit on s/390 & z/Architecture
+not used by normal programs but potentially could
+be used as temporary storage. Their main purpose is their 1 to 1
+association with general purpose registers and are used in
+the kernel for copying data between kernel & user address spaces.
+Access register 0 ( & access register 1 on z/Architecture ( needs 64 bit
+pointer ) ) is currently used by the pthread library as a pointer to
+the current running threads private area.
+16 64 bit floating point registers (fp0-fp15 ) IEEE & HFP floating
+point format compliant on G5 upwards & a Floating point control reg (FPC)
+4 64 bit registers (fp0,fp2,fp4 & fp6) HFP only on older machines.
+Linux (currently) always uses IEEE & emulates G5 IEEE format on older machines,
+( provided the kernel is configured for this ).
+The PSW is the most important register on the machine it
+is 64 bit on s/390 & 128 bit on z/Architecture & serves the roles of
+a program counter (pc), condition code register,memory space designator.
+In IBM standard notation I am counting bit 0 as the MSB.
+It has several advantages over a normal program counter
+in that you can change address translation & program counter
+in a single instruction. To change address translation,
+e.g. switching address translation off requires that you
+have a logical=physical mapping for the address you are
+currently running at.
+ Bit Value
+s/390 z/Architecture
+0 0 Reserved ( must be 0 ) otherwise specification exception occurs.
+1 1 Program Event Recording 1 PER enabled,
+ PER is used to facilititate debugging e.g. single stepping.
+2-4 2-4 Reserved ( must be 0 ).
+5 5 Dynamic address translation 1=DAT on.
+6 6 Input/Output interrupt Mask
+7 7 External interrupt Mask used primarily for interprocessor signalling &
+ clock interrupts.
+8-11 8-11 PSW Key used for complex memory protection mechanism not used under linux
+12 12 1 on s/390 0 on z/Architecture
+13 13 Machine Check Mask 1=enable machine check interrupts
+14 14 Wait State set this to 1 to stop the processor except for interrupts & give
+ time to other LPARS used in CPU idle in the kernel to increase overall
+ usage of processor resources.
+15 15 Problem state ( if set to 1 certain instructions are disabled )
+ all linux user programs run with this bit 1
+ ( useful info for debugging under VM ).
+16-17 16-17 Address Space Control
+ 00 Primary Space Mode when DAT on
+ The linux kernel currently runs in this mode, CR1 is affiliated with
+ this mode & points to the primary segment table origin etc.
+ 01 Access register mode this mode is used in functions to
+ copy data between kernel & user space.
+ 10 Secondary space mode not used in linux however CR7 the
+ register affiliated with this mode is & this & normally
+ CR13=CR7 to allow us to copy data between kernel & user space.
+ We do this as follows:
+ We set ar2 to 0 to designate its
+ affiliated gpr ( gpr2 )to point to primary=kernel space.
+ We set ar4 to 1 to designate its
+ affiliated gpr ( gpr4 ) to point to secondary=home=user space
+ & then essentially do a memcopy(gpr2,gpr4,size) to
+ copy data between the address spaces, the reason we use home space for the
+ kernel & don't keep secondary space free is that code will not run in
+ secondary space.
+ 11 Home Space Mode all user programs run in this mode.
+ it is affiliated with CR13.
+18-19 18-19 Condition codes (CC)
+20 20 Fixed point overflow mask if 1=FPU exceptions for this event
+ occur ( normally 0 )
+21 21 Decimal overflow mask if 1=FPU exceptions for this event occur
+ ( normally 0 )
+22 22 Exponent underflow mask if 1=FPU exceptions for this event occur
+ ( normally 0 )
+23 23 Significance Mask if 1=FPU exceptions for this event occur
+ ( normally 0 )
+24-31 24-30 Reserved Must be 0.
+ 31 Extended Addressing Mode
+ 32 Basic Addressing Mode
+ Used to set addressing mode
+ PSW 31 PSW 32
+ 0 0 24 bit
+ 0 1 31 bit
+ 1 1 64 bit
+32 1=31 bit addressing mode 0=24 bit addressing mode (for backward
+ compatibility ), linux always runs with this bit set to 1
+33-64 Instruction address.
+ 33-63 Reserved must be 0
+ 64-127 Address
+ In 24 bits mode bits 64-103=0 bits 104-127 Address
+ In 31 bits mode bits 64-96=0 bits 97-127 Address
+ Note: unlike 31 bit mode on s/390 bit 96 must be zero
+ when loading the address with LPSWE otherwise a
+ specification exception occurs, LPSW is fully backward
+ compatible.
+Prefix Page(s)
+This per cpu memory area is too intimately tied to the processor not to mention.
+It exists between the real addresses 0-4096 on s/390 & 0-8192 z/Architecture & is exchanged
+with a 1 page on s/390 or 2 pages on z/Architecture in absolute storage by the set
+prefix instruction in linux'es startup.
+This page is mapped to a different prefix for each processor in an SMP configuration
+( assuming the os designer is sane of course :-) ).
+Bytes 0-512 ( 200 hex ) on s/390 & 0-512,4096-4544,4604-5119 currently on z/Architecture
+are used by the processor itself for holding such information as exception indications &
+entry points for exceptions.
+Bytes after 0xc00 hex are used by linux for per processor globals on s/390 & z/Architecture
+( there is a gap on z/Architecure too currently between 0xc00 & 1000 which linux uses ).
+The closest thing to this on traditional architectures is the interrupt
+vector table. This is a good thing & does simplify some of the kernel coding
+however it means that we now cannot catch stray NULL pointers in the
+kernel without hard coded checks.
+Address Spaces on Intel Linux
+The traditional Intel Linux is approximately mapped as follows forgive
+the ascii art.
+0xFFFFFFFF 4GB Himem *****************
+ * *
+ * Kernel Space *
+ * *
+ ***************** ****************
+User Space Himem (typically 0xC0000000 3GB )* User Stack * * *
+ ***************** * *
+ * Shared Libs * * Next Process *
+ ***************** * to *
+ * * <== * Run * <==
+ * User Program * * *
+ * Data BSS * * *
+ * Text * * *
+ * Sections * * *
+0x00000000 ***************** ****************
+Now it is easy to see that on Intel it is quite easy to recognise a kernel address
+as being one greater than user space himem ( in this case 0xC0000000).
+& addresses of less than this are the ones in the current running program on this
+processor ( if an smp box ).
+If using the virtual machine ( VM ) as a debugger it is quite difficult to
+know which user process is running as the address space you are looking at
+could be from any process in the run queue.
+The limitation of Intels addressing technique is that the linux
+kernel uses a very simple real address to virtual addressing technique
+of Real Address=Virtual Address-User Space Himem.
+This means that on Intel the kernel linux can typically only address
+Himem=0xFFFFFFFF-0xC0000000=1GB & this is all the RAM these machines
+can typically use.
+They can lower User Himem to 2GB or lower & thus be
+able to use 2GB of RAM however this shrinks the maximum size
+of User Space from 3GB to 2GB they have a no win limit of 4GB unless
+they go to 64 Bit.
+On 390 our limitations & strengths make us slightly different.
+For backward compatibility we are only allowed use 31 bits (2GB)
+of our 32 bit addresses,however, we use entirely separate address
+spaces for the user & kernel.
+This means we can support 2GB of non Extended RAM on s/390, & more
+with the Extended memory management swap device &
+currently 4TB of physical memory currently on z/Architecture.
+Address Spaces on Linux for s/390 & z/Architecture
+Our addressing scheme is as follows
+Himem 0x7fffffff 2GB on s/390 ***************** ****************
+currently 0x3ffffffffff (2^42)-1 * User Stack * * *
+on z/Architecture. ***************** * *
+ * Shared Libs * * *
+ ***************** * *
+ * * * Kernel *
+ * User Program * * *
+ * Data BSS * * *
+ * Text * * *
+ * Sections * * *
+0x00000000 ***************** ****************
+This also means that we need to look at the PSW problem state bit
+or the addressing mode to decide whether we are looking at
+user or kernel space.
+Virtual Addresses on s/390 & z/Architecture
+A virtual address on s/390 is made up of 3 parts
+The SX ( segment index, roughly corresponding to the PGD & PMD in linux terminology )
+being bits 1-11.
+The PX ( page index, corresponding to the page table entry (pte) in linux terminology )
+being bits 12-19.
+The remaining bits BX (the byte index are the offset in the page )
+i.e. bits 20 to 31.
+On z/Architecture in linux we currently make up an address from 4 parts.
+The region index bits (RX) 0-32 we currently use bits 22-32
+The segment index (SX) being bits 33-43
+The page index (PX) being bits 44-51
+The byte index (BX) being bits 52-63
+1) s/390 has no PMD so the PMD is really the PGD also.
+A lot of this stuff is defined in pgtable.h.
+2) Also seeing as s/390's page indexes are only 1k in size
+(bits 12-19 x 4 bytes per pte ) we use 1 ( page 4k )
+to make the best use of memory by updating 4 segment indices
+entries each time we mess with a PMD & use offsets
+0,1024,2048 & 3072 in this page as for our segment indexes.
+On z/Architecture our page indexes are now 2k in size
+( bits 12-19 x 8 bytes per pte ) we do a similar trick
+but only mess with 2 segment indices each time we mess with
+a PMD.
+3) As z/Architecture supports upto a massive 5-level page table lookup we
+can only use 3 currently on Linux ( as this is all the generic kernel
+currently supports ) however this may change in future
+this allows us to access ( according to my sums )
+4TB of virtual storage per process i.e.
+4096*512(PTES)*1024(PMDS)*2048(PGD) = 4398046511104 bytes,
+enough for another 2 or 3 of years I think :-).
+to do this we use a region-third-table designation type in
+our address space control registers.
+The Linux for s/390 & z/Architecture Kernel Task Structure
+Each process/thread under Linux for S390 has its own kernel task_struct
+defined in linux/include/linux/sched.h
+The S390 on initialisation & resuming of a process on a cpu sets
+the __LC_KERNEL_STACK variable in the spare prefix area for this cpu
+( which we use for per processor globals).
+The kernel stack pointer is intimately tied with the task stucture for
+each processor as follows.
+ s/390
+ ************************
+ * 1 page kernel stack *
+ * ( 4K ) *
+ ************************
+ * 1 page task_struct *
+ * ( 4K ) *
+8K aligned ************************
+ z/Architecture
+ ************************
+ * 2 page kernel stack *
+ * ( 8K ) *
+ ************************
+ * 2 page task_struct *
+ * ( 8K ) *
+16K aligned ************************
+What this means is that we don't need to dedicate any register or global variable
+to point to the current running process & can retrieve it with the following
+very simple construct for s/390 & one very similar for z/Architecture.
+static inline struct task_struct * get_current(void)
+ struct task_struct *current;
+ __asm__("lhi %0,-8192\n\t"
+ "nr %0,15"
+ : "=r" (current) );
+ return current;
+i.e. just anding the current kernel stack pointer with the mask -8192.
+Thankfully because Linux dosen't have support for nested IO interrupts
+& our devices have large buffers can survive interrupts being shut for
+short amounts of time we don't need a separate stack for interrupts.
+Register Usage & Stackframes on Linux for s/390 & z/Architecture
+This is the code that gcc produces at the top & the bottom of
+each function, it usually is fairly consistent & similar from
+function to function & if you know its layout you can probalby
+make some headway in finding the ultimate cause of a problem
+after a crash without a source level debugger.
+Note: To follow stackframes requires a knowledge of C or Pascal &
+limited knowledge of one assembly language.
+It should be noted that there are some differences between the
+s/390 & z/Architecture stack layouts as the z/Architecture stack layout didn't have
+to maintain compatibility with older linkage formats.
+This is a built in compiler function for runtime allocation
+of extra space on the callers stack which is obviously freed
+up on function exit ( e.g. the caller may choose to allocate nothing
+of a buffer of 4k if required for temporary purposes ), it generates
+very efficient code ( a few cycles ) when compared to alternatives
+like malloc.
+automatics: These are local variables on the stack,
+i.e they aren't in registers & they aren't static.
+This is a pointer to the stack pointer before entering a
+framed functions ( see frameless function ) prologue got by
+deferencing the address of the current stack pointer,
+ i.e. got by accessing the 32 bit value at the stack pointers
+current location.
+This is a pointer to the back of the literal pool which
+is an area just behind each procedure used to store constants
+in each function.
+call-clobbered: The caller probably needs to save these registers if there
+is something of value in them, on the stack or elsewhere before making a
+call to another procedure so that it can restore it later.
+The code generated by the compiler to return to the caller.
+A frameless function in Linux for s390 & z/Architecture is one which doesn't
+need more than the register save area ( 96 bytes on s/390, 160 on z/Architecture )
+given to it by the caller.
+A frameless function never:
+1) Sets up a back chain.
+2) Calls alloca.
+3) Calls other normal functions
+4) Has automatics.
+This is a pointer to the global-offset-table in ELF
+( Executable Linkable Format, Linux'es most common executable format ),
+all globals & shared library objects are found using this pointer.
+ELF shared libraries are typically only loaded when routines in the shared
+library are actually first called at runtime. This is lazy binding.
+This is a table found from the GOT which contains pointers to routines
+in other shared libraries which can't be called to by easier means.
+The code generated by the compiler to set up the stack frame.
+This is extra area allocated on the stack of the calling function if the
+parameters for the callee's cannot all be put in registers, the same
+area can be reused by each function the caller calls.
+A COFF executable format based concept of a procedure reference
+actually being 8 bytes or more as opposed to a simple pointer to the routine.
+This is typically defined as follows
+Routine Descriptor offset 0=Pointer to Function
+Routine Descriptor offset 4=Pointer to Table of Contents
+The table of contents/TOC is roughly equivalent to a GOT pointer.
+& it means that shared libraries etc. can be shared between several
+environments each with their own TOC.
+static-chain: This is used in nested functions a concept adopted from pascal
+by gcc not used in ansi C or C++ ( although quite useful ), basically it
+is a pointer used to reference local variables of enclosing functions.
+You might come across this stuff once or twice in your lifetime.
+The function below should return 11 though gcc may get upset & toss warnings
+about unused variables.
+int FunctionA(int a)
+ int b;
+ FunctionC(int c)
+ {
+ b=c+1;
+ }
+ FunctionC(10);
+ return(b);
+s/390 & z/Architecture Register usage
+r0 used by syscalls/assembly call-clobbered
+r1 used by syscalls/assembly call-clobbered
+r2 argument 0 / return value 0 call-clobbered
+r3 argument 1 / return value 1 (if long long) call-clobbered
+r4 argument 2 call-clobbered
+r5 argument 3 call-clobbered
+r6 argument 5 saved
+r7 pointer-to arguments 5 to ... saved
+r8 this & that saved
+r9 this & that saved
+r10 static-chain ( if nested function ) saved
+r11 frame-pointer ( if function used alloca ) saved
+r12 got-pointer saved
+r13 base-pointer saved
+r14 return-address saved
+r15 stack-pointer saved
+f0 argument 0 / return value ( float/double ) call-clobbered
+f2 argument 1 call-clobbered
+f4 z/Architecture argument 2 saved
+f6 z/Architecture argument 3 saved
+The remaining floating points
+f1,f3,f5 f7-f15 are call-clobbered.
+1) The only requirement is that registers which are used
+by the callee are saved, e.g. the compiler is perfectly
+capible of using r11 for purposes other than a frame a
+frame pointer if a frame pointer is not needed.
+2) In functions with variable arguments e.g. printf the calling procedure
+is identical to one without variable arguments & the same number of
+parameters. However, the prologue of this function is somewhat more
+hairy owing to it having to move these parameters to the stack to
+get va_start, va_arg & va_end to work.
+3) Access registers are currently unused by gcc but are used in
+the kernel. Possibilities exist to use them at the moment for
+temporary storage but it isn't recommended.
+4) Only 4 of the floating point registers are used for
+parameter passing as older machines such as G3 only have only 4
+& it keeps the stack frame compatible with other compilers.
+However with IEEE floating point emulation under linux on the
+older machines you are free to use the other 12.
+5) A long long or double parameter cannot be have the
+first 4 bytes in a register & the second four bytes in the
+outgoing args area. It must be purely in the outgoing args
+area if crossing this boundary.
+6) Floating point parameters are mixed with outgoing args
+on the outgoing args area in the order the are passed in as parameters.
+7) Floating point arguments 2 & 3 are saved in the outgoing args area for
+Stack Frame Layout
+s/390 z/Architecture
+0 0 back chain ( a 0 here signifies end of back chain )
+4 8 eos ( end of stack, not used on Linux for S390 used in other linkage formats )
+8 16 glue used in other s/390 linkage formats for saved routine descriptors etc.
+12 24 glue used in other s/390 linkage formats for saved routine descriptors etc.
+16 32 scratch area
+20 40 scratch area
+24 48 saved r6 of caller function
+28 56 saved r7 of caller function
+32 64 saved r8 of caller function
+36 72 saved r9 of caller function
+40 80 saved r10 of caller function
+44 88 saved r11 of caller function
+48 96 saved r12 of caller function
+52 104 saved r13 of caller function
+56 112 saved r14 of caller function
+60 120 saved r15 of caller function
+64 128 saved f4 of caller function
+72 132 saved f6 of caller function
+80 undefined
+96 160 outgoing args passed from caller to callee
+96+x 160+x possible stack alignment ( 8 bytes desirable )
+96+x+y 160+x+y alloca space of caller ( if used )
+96+x+y+z 160+x+y+z automatics of caller ( if used )
+0 back-chain
+A sample program with comments.
+Comments on the function test
+1) It didn't need to set up a pointer to the constant pool gpr13 as it isn't used
+( :-( ).
+2) This is a frameless function & no stack is bought.
+3) The compiler was clever enough to recognise that it could return the
+value in r2 as well as use it for the passed in parameter ( :-) ).
+4) The basr ( branch relative & save ) trick works as follows the instruction
+has a special case with r0,r0 with some instruction operands is understood as
+the literal value 0, some risc architectures also do this ). So now
+we are branching to the next address & the address new program counter is
+in r13,so now we subtract the size of the function prologue we have executed
++ the size of the literal pool to get to the top of the literal pool
+0040037c int test(int b)
+{ # Function prologue below
+ 40037c: 90 de f0 34 stm %r13,%r14,52(%r15) # Save registers r13 & r14
+ 400380: 0d d0 basr %r13,%r0 # Set up pointer to constant pool using
+ 400382: a7 da ff fa ahi %r13,-6 # basr trick
+ return(5+b);
+ # Huge main program
+ 400386: a7 2a 00 05 ahi %r2,5 # add 5 to r2
+ # Function epilogue below
+ 40038a: 98 de f0 34 lm %r13,%r14,52(%r15) # restore registers r13 & 14
+ 40038e: 07 fe br %r14 # return
+Comments on the function main
+1) The compiler did this function optimally ( 8-) )
+Literal pool for main.
+400390: ff ff ff ec .long 0xffffffec
+main(int argc,char *argv[])
+{ # Function prologue below
+ 400394: 90 bf f0 2c stm %r11,%r15,44(%r15) # Save necessary registers
+ 400398: 18 0f lr %r0,%r15 # copy stack pointer to r0
+ 40039a: a7 fa ff a0 ahi %r15,-96 # Make area for callee saving
+ 40039e: 0d d0 basr %r13,%r0 # Set up r13 to point to
+ 4003a0: a7 da ff f0 ahi %r13,-16 # literal pool
+ 4003a4: 50 00 f0 00 st %r0,0(%r15) # Save backchain
+ return(test(5)); # Main Program Below
+ 4003a8: 58 e0 d0 00 l %r14,0(%r13) # load relative address of test from
+ # literal pool
+ 4003ac: a7 28 00 05 lhi %r2,5 # Set first parameter to 5
+ 4003b0: 4d ee d0 00 bas %r14,0(%r14,%r13) # jump to test setting r14 as return
+ # address using branch & save instruction.
+ # Function Epilogue below
+ 4003b4: 98 bf f0 8c lm %r11,%r15,140(%r15)# Restore necessary registers.
+ 4003b8: 07 fe br %r14 # return to do program exit
+Compiler updates
+main(int argc,char *argv[])
+ 4004fc: 90 7f f0 1c stm %r7,%r15,28(%r15)
+ 400500: a7 d5 00 04 bras %r13,400508 <main+0xc>
+ 400504: 00 40 04 f4 .long 0x004004f4
+ # compiler now puts constant pool in code to so it saves an instruction
+ 400508: 18 0f lr %r0,%r15
+ 40050a: a7 fa ff a0 ahi %r15,-96
+ 40050e: 50 00 f0 00 st %r0,0(%r15)
+ return(test(5));
+ 400512: 58 10 d0 00 l %r1,0(%r13)
+ 400516: a7 28 00 05 lhi %r2,5
+ 40051a: 0d e1 basr %r14,%r1
+ # compiler adds 1 extra instruction to epilogue this is done to
+ # avoid processor pipeline stalls owing to data dependencies on g5 &
+ # above as register 14 in the old code was needed directly after being loaded
+ # by the lm %r11,%r15,140(%r15) for the br %14.
+ 40051c: 58 40 f0 98 l %r4,152(%r15)
+ 400520: 98 7f f0 7c lm %r7,%r15,124(%r15)
+ 400524: 07 f4 br %r4
+Hartmut ( our compiler developer ) also has been threatening to take out the
+stack backchain in optimised code as this also causes pipeline stalls, you
+have been warned.
+64 bit z/Architecture code disassembly
+If you understand the stuff above you'll understand the stuff
+below too so I'll avoid repeating myself & just say that
+some of the instructions have g's on the end of them to indicate
+they are 64 bit & the stack offsets are a bigger,
+the only other difference you'll find between 32 & 64 bit is that
+we now use f4 & f6 for floating point arguments on 64 bit.
+00000000800005b0 <test>:
+int test(int b)
+ return(5+b);
+ 800005b0: a7 2a 00 05 ahi %r2,5
+ 800005b4: b9 14 00 22 lgfr %r2,%r2 # downcast to integer
+ 800005b8: 07 fe br %r14
+ 800005ba: 07 07 bcr 0,%r7
+00000000800005bc <main>:
+main(int argc,char *argv[])
+ 800005bc: eb bf f0 58 00 24 stmg %r11,%r15,88(%r15)
+ 800005c2: b9 04 00 1f lgr %r1,%r15
+ 800005c6: a7 fb ff 60 aghi %r15,-160
+ 800005ca: e3 10 f0 00 00 24 stg %r1,0(%r15)
+ return(test(5));
+ 800005d0: a7 29 00 05 lghi %r2,5
+ # brasl allows jumps > 64k & is overkill here bras would do fune
+ 800005d4: c0 e5 ff ff ff ee brasl %r14,800005b0 <test>
+ 800005da: e3 40 f1 10 00 04 lg %r4,272(%r15)
+ 800005e0: eb bf f0 f8 00 04 lmg %r11,%r15,248(%r15)
+ 800005e6: 07 f4 br %r4
+Compiling programs for debugging on Linux for s/390 & z/Architecture
+-gdwarf-2 now works it should be considered the default debugging
+format for s/390 & z/Architecture as it is more reliable for debugging
+shared libraries, normal -g debugging works much better now
+Thanks to the IBM java compiler developers bug reports.
+This is typically done adding/appending the flags -g or -gdwarf-2 to the
+CFLAGS & LDFLAGS variables Makefile of the program concerned.
+If using gdb & you would like accurate displays of registers &
+ stack traces compile without optimisation i.e make sure
+that there is no -O2 or similar on the CFLAGS line of the Makefile &
+the emitted gcc commands, obviously this will produce worse code
+( not advisable for shipment ) but it is an aid to the debugging process.
+This aids debugging because the compiler will copy parameters passed in
+in registers onto the stack so backtracing & looking at passed in
+parameters will work, however some larger programs which use inline functions
+will not compile without optimisation.
+Debugging with optimisation has since much improved after fixing
+some bugs, please make sure you are using gdb-5.0 or later developed
+after Nov'2000.
+Figuring out gcc compile errors
+If you are getting a lot of syntax errors compiling a program & the problem
+isn't blatantly obvious from the source.
+It often helps to just preprocess the file, this is done with the -E
+option in gcc.
+What this does is that it runs through the very first phase of compilation
+( compilation in gcc is done in several stages & gcc calls many programs to
+achieve its end result ) with the -E option gcc just calls the gcc preprocessor (cpp).
+The c preprocessor does the following, it joins all the files #included together
+recursively ( #include files can #include other files ) & also the c file you wish to compile.
+It puts a fully qualified path of the #included files in a comment & it
+does macro expansion.
+This is useful for debugging because
+1) You can double check whether the files you expect to be included are the ones
+that are being included ( e.g. double check that you aren't going to the i386 asm directory ).
+2) Check that macro definitions aren't clashing with typedefs,
+3) Check that definitons aren't being used before they are being included.
+4) Helps put the line emitting the error under the microscope if it contains macros.
+For convenience the Linux kernel's makefile will do preprocessing automatically for you
+by suffixing the file you want built with .i ( instead of .o )
+from the linux directory type
+make arch/s390/kernel/signal.i
+this will build
+s390-gcc -D__KERNEL__ -I/home1/barrow/linux/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer
+-fno-strict-aliasing -D__SMP__ -pipe -fno-strength-reduce -E arch/s390/kernel/signal.c
+> arch/s390/kernel/signal.i
+Now look at signal.i you should see something like.
+# 1 "/home1/barrow/linux/include/asm/types.h" 1
+typedef unsigned short umode_t;
+typedef __signed__ char __s8;
+typedef unsigned char __u8;
+typedef __signed__ short __s16;
+typedef unsigned short __u16;
+If instead you are getting errors further down e.g.
+unknown instruction:2515 "move.l" or better still unknown instruction:2515
+"Fixme not implemented yet, call Martin" you are probably are attempting to compile some code
+meant for another architecture or code that is simply not implemented, with a fixme statement
+stuck into the inline assembly code so that the author of the file now knows he has work to do.
+To look at the assembly emitted by gcc just before it is about to call gas ( the gnu assembler )
+use the -S option.
+Again for your convenience the Linux kernel's Makefile will hold your hand &
+do all this donkey work for you also by building the file with the .s suffix.
+from the Linux directory type
+make arch/s390/kernel/signal.s
+s390-gcc -D__KERNEL__ -I/home1/barrow/linux/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer
+-fno-strict-aliasing -D__SMP__ -pipe -fno-strength-reduce -S arch/s390/kernel/signal.c
+-o arch/s390/kernel/signal.s
+This will output something like, ( please note the constant pool & the useful comments
+in the prologue to give you a hand at interpreting it ).
+ .string "misaligned (__u16 *) in __xchg\n"
+ .string "misaligned (__u32 *) in __xchg\n"
+.L$PG1: # Pool sys_sigsuspend
+ .long -262401
+ .long -1
+ .long schedule-.L$PG1
+ .long do_signal-.L$PG1
+ .align 4
+.globl sys_sigsuspend
+ .type sys_sigsuspend,@function
+# leaf function 0
+# automatics 16
+# outgoing args 0
+# need frame pointer 0
+# call alloca 0
+# has varargs 0
+# incoming args (stack) 0
+# function length 168
+ STM 8,15,32(15)
+ LR 0,15
+ AHI 15,-112
+ BASR 13,0
+.L$CO1: AHI 13,.L$PG1-.L$CO1
+ ST 0,0(15)
+ LR 8,2
+ N 5,.LC192-.L$PG1(13)
+Adding -g to the above output makes the output even more useful
+e.g. typing
+make CC:="s390-gcc -g" kernel/sched.s
+which compiles.
+s390-gcc -g -D__KERNEL__ -I/home/barrow/linux-2.3/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -pipe -fno-strength-reduce -S kernel/sched.c -o kernel/sched.s
+also outputs stabs ( debugger ) info, from this info you can find out the
+offsets & sizes of various elements in structures.
+e.g. the stab for the structure
+struct rlimit {
+ unsigned long rlim_cur;
+ unsigned long rlim_max;
+.stabs "rlimit:T(151,2)=s8rlim_cur:(0,5),0,32;rlim_max:(0,5),32,32;;",128,0,0,0
+from this stab you can see that
+rlimit_cur starts at bit offset 0 & is 32 bits in size
+rlimit_max starts at bit offset 32 & is 32 bits in size.
+Debugging Tools:
+This is a tool with many options the most useful being ( if compiled with -g).
+objdump --source <victim program or object file> > <victims debug listing >
+The whole kernel can be compiled like this ( Doing this will make a 17MB kernel
+& a 200 MB listing ) however you have to strip it before building the image
+using the strip command to make it a more reasonable size to boot it.
+A source/assembly mixed dump of the kernel can be done with the line
+objdump --source vmlinux > vmlinux.lst
+Also if the file isn't compiled -g this will output as much debugging information
+as it can ( e.g. function names ), however, this is very slow as it spends lots
+of time searching for debugging info, the following self explanitory line should be used
+instead if the code isn't compiled -g.
+objdump --disassemble-all --syms vmlinux > vmlinux.lst
+as it is much faster
+As hard drive space is valuble most of us use the following approach.
+1) Look at the emitted psw on the console to find the crash address in the kernel.
+2) Look at the file ( in the linux directory ) produced when building
+the kernel to find the closest address less than the current PSW to find the
+offending function.
+3) use grep or similar to search the source tree looking for the source file
+ with this function if you don't know where it is.
+4) rebuild this object file with -g on, as an example suppose the file was
+( /arch/s390/kernel/signal.o )
+5) Assuming the file with the erroneous function is signal.c Move to the base of the
+Linux source tree.
+6) rm /arch/s390/kernel/signal.o
+7) make /arch/s390/kernel/signal.o
+8) watch the gcc command line emitted
+9) type it in again or alernatively cut & paste it on the console adding the -g option.
+10) objdump --source arch/s390/kernel/signal.o > signal.lst
+This will output the source & the assembly intermixed, as the snippet below shows
+This will unfortunately output addresses which aren't the same
+as the kernel ones you should be able to get around the mental arithmetic
+by playing with the --adjust-vma parameter to objdump.
+extern inline void spin_lock(spinlock_t *lp)
+ a0: 18 34 lr %r3,%r4
+ a2: a7 3a 03 bc ahi %r3,956
+ __asm__ __volatile(" lhi 1,-1\n"
+ a6: a7 18 ff ff lhi %r1,-1
+ aa: 1f 00 slr %r0,%r0
+ ac: ba 01 30 00 cs %r0,%r1,0(%r3)
+ b0: a7 44 ff fd jm aa <sys_sigsuspend+0x2e>
+ saveset = current->blocked;
+ b4: d2 07 f0 68 mvc 104(8,%r15),972(%r4)
+ b8: 43 cc
+ return (set->sig[0] & mask) != 0;
+6) If debugging under VM go down to that section in the document for more info.
+I now have a tool which takes the pain out of --adjust-vma
+& you are able to do something like
+make /arch/s390/kernel/traps.lst
+& it automatically generates the correctly relocated entries for
+the text segment in traps.lst.
+This tool is now standard in linux distro's in scripts/makelst
+Q. What is it ?
+A. It is a tool for intercepting calls to the kernel & logging them
+to a file & on the screen.
+Q. What use is it ?
+A. You can used it to find out what files a particular program opens.
+Example 1
+If you wanted to know does ping work but didn't have the source
+strace ping -c 1
+& then look at the man pages for each of the syscalls below,
+( In fact this is sometimes easier than looking at some spagetti
+source which conditionally compiles for several architectures )
+Not everything that it throws out needs to make sense immeadiately
+Just looking quickly you can see that it is making up a RAW socket
+for the ICMP protocol.
+Doing an alarm(10) for a 10 second timeout
+& doing a gettimeofday call before & after each read to see
+how long the replies took, & writing some text to stdout so the user
+has an idea what is going on.
+getuid() = 0
+setuid(0) = 0
+stat("/usr/share/locale/C/", 0xbffff134) = -1 ENOENT (No such file or directory)
+stat("/usr/share/locale/libc/C", 0xbffff134) = -1 ENOENT (No such file or directory)
+stat("/usr/local/share/locale/C/", 0xbffff134) = -1 ENOENT (No such file or directory)
+getpid() = 353
+setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
+setsockopt(3, SOL_SOCKET, SO_RCVBUF, [49152], 4) = 0
+fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(3, 1), ...}) = 0
+mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40008000
+ioctl(1, TCGETS, {B9600 opost isig icanon echo ...}) = 0
+write(1, "PING ( 56 d"..., 42PING ( 56 data bytes
+) = 42
+sigaction(SIGINT, {0x8049ba0, [], SA_RESTART}, {SIG_DFL}) = 0
+sigaction(SIGALRM, {0x8049600, [], SA_RESTART}, {SIG_DFL}) = 0
+gettimeofday({948904719, 138951}, NULL) = 0
+sendto(3, "\10\0D\201a\1\0\0\17#\2178\307\36"..., 64, 0, {sin_family=AF_INET,
+sin_port=htons(0), sin_addr=inet_addr("")}, 16) = 64
+sigaction(SIGALRM, {0x8049600, [], SA_RESTART}, {0x8049600, [], SA_RESTART}) = 0
+sigaction(SIGALRM, {0x8049ba0, [], SA_RESTART}, {0x8049600, [], SA_RESTART}) = 0
+alarm(10) = 0
+recvfrom(3, "E\0\0T\0005\0\0@\1|r\177\0\0\1\177"..., 192, 0,
+{sin_family=AF_INET, sin_port=htons(50882), sin_addr=inet_addr("")}, [16]) = 84
+gettimeofday({948904719, 160224}, NULL) = 0
+recvfrom(3, "E\0\0T\0006\0\0\377\1\275p\177\0"..., 192, 0,
+{sin_family=AF_INET, sin_port=htons(50882), sin_addr=inet_addr("")}, [16]) = 84
+gettimeofday({948904719, 166952}, NULL) = 0
+write(1, "64 bytes from icmp_se"...,
+5764 bytes from icmp_seq=0 ttl=255 time=28.0 ms
+Example 2
+strace passwd 2>&1 | grep open
+produces the following output
+open("/etc/", O_RDONLY) = 3
+open("/opt/kde/lib/", O_RDONLY) = -1 ENOENT (No such file or directory)
+open("/lib/", O_RDONLY) = 3
+open("/dev", O_RDONLY) = 3
+open("/var/run/utmp", O_RDONLY) = 3
+open("/etc/passwd", O_RDONLY) = 3
+open("/etc/shadow", O_RDONLY) = 3
+open("/etc/login.defs", O_RDONLY) = 4
+open("/dev/tty", O_RDONLY) = 4
+The 2>&1 is done to redirect stderr to stdout & grep is then filtering this input
+through the pipe for each line containing the string open.
+Example 3
+Getting sophistocated
+telnetd crashes on & I don't know why
+1) Replace the following line in /etc/inetd.conf
+telnet stream tcp nowait root /usr/sbin/in.telnetd -h
+telnet stream tcp nowait root /blah
+2) Create the file /blah with the following contents to start tracing telnetd
+/usr/bin/strace -o/t1 -f /usr/sbin/in.telnetd -h
+3) chmod 700 /blah to make it executable only to root
+killall -HUP inetd
+or ps aux | grep inetd
+get inetd's process id
+& kill -HUP inetd to restart it.
+Important options
+-o is used to tell strace to output to a file in our case t1 in the root directory
+-f is to follow children i.e.
+e.g in our case above telnetd will start the login process & subsequently a shell like bash.
+You will be able to tell which is which from the process ID's listed on the left hand side
+of the strace output.
+-p<pid> will tell strace to attach to a running process, yup this can be done provided
+ it isn't being traced or debugged already & you have enough privileges,
+the reason 2 processes cannot trace or debug the same program is that strace
+becomes the parent process of the one being debugged & processes ( unlike people )
+can have only one parent.
+However the file /t1 will get big quite quickly
+to test it telnet
+now look at what files in.telnetd execve'd
+413 execve("/usr/sbin/in.telnetd", ["/usr/sbin/in.telnetd", "-h"], [/* 17 vars */]) = 0
+414 execve("/bin/login", ["/bin/login", "-h", "localhost", "-p"], [/* 2 vars */]) = 0
+Whey it worked!.
+Other hints:
+If the program is not very interactive ( i.e. not much keyboard input )
+& is crashing in one architecture but not in another you can do
+an strace of both programs under as identical a scenario as you can
+on both architectures outputting to a file then.
+do a diff of the two traces using the diff program
+diff output1 output2
+& maybe you'll be able to see where the call paths differed, this
+is possibly near the cause of the crash.
+More info
+Look at man pages for strace & the various syscalls
+e.g. man strace, man alarm, man socket.
+Performance Debugging
+gcc is capible of compiling in profiling code just add the -p option
+to the CFLAGS, this obviously affects program size & performance.
+This can be used by the gprof gnu profiling tool or the
+gcov the gnu code coverage tool ( code coverage is a means of testing
+code quality by checking if all the code in an executable in exercised by
+a tester ).
+Using top to find out where processes are sleeping in the kernel
+To do this copy the from the root directory where
+the linux kernel was built to the /boot directory on your
+linux machine.
+Start top
+Now type fU<return>
+You should see a new field called WCHAN which
+tells you where each process is sleeping here is a typical output.
+ 6:59pm up 41 min, 1 user, load average: 0.00, 0.00, 0.00
+28 processes: 27 sleeping, 1 running, 0 zombie, 0 stopped
+CPU states: 0.0% user, 0.1% system, 0.0% nice, 99.8% idle
+Mem: 254900K av, 45976K used, 208924K free, 0K shrd, 28636K buff
+Swap: 0K av, 0K used, 0K free 8620K cached
+ 750 root 12 0 848 848 700 do_select S 0 0.1 0.3 0:00 in.telnetd
+ 767 root 16 0 1140 1140 964 R 0 0.1 0.4 0:00 top
+ 1 root 8 0 212 212 180 do_select S 0 0.0 0.0 0:00 init
+ 2 root 9 0 0 0 0 down_inte SW 0 0.0 0.0 0:00 kmcheck
+The time command
+Another related command is the time command which gives you an indication
+of where a process is spending the majority of its time.
+time ping -c 5 nc
+real 0m4.054s
+user 0m0.010s
+sys 0m0.010s
+Debugging under VM
+Addresses & values in the VM debugger are always hex never decimal
+Address ranges are of the format <HexValue1>-<HexValue2> or <HexValue1>.<HexValue2>
+e.g. The address range 0x2000 to 0x3000 can be described described as
+2000-3000 or 2000.1000
+The VM Debugger is case insensitive.
+VM's strengths are usually other debuggers weaknesses you can get at any resource
+no matter how sensitive e.g. memory management resources,change address translation
+in the PSW. For kernel hacking you will reap dividends if you get good at it.
+The VM Debugger displays operators but not operands, probably because some
+of it was written when memory was expensive & the programmer was probably proud that
+it fitted into 2k of memory & the programmers & didn't want to shock hardcore VM'ers by
+changing the interface :-), also the debugger displays useful information on the same line &
+the author of the code probably felt that it was a good idea not to go over
+the 80 columns on the screen.
+As some of you are probably in a panic now this isn't as unintuitive as it may seem
+as the 390 instructions are easy to decode mentally & you can make a good guess at a lot
+of them as all the operands are nibble ( half byte aligned ) & if you have an objdump listing
+also it is quite easy to follow, if you don't have an objdump listing keep a copy of
+the s/390 Reference Summary & look at between pages 2 & 7 or alternatively the
+s/390 principles of operation.
+e.g. even I can guess that
+0001AFF8' LR 180F CC 0
+is a ( load register ) lr r0,r15
+Also it is very easy to tell the length of a 390 instruction from the 2 most significant
+bits in the instruction ( not that this info is really useful except if you are trying to
+make sense of a hexdump of code ).
+Here is a table
+Bits Instruction Length
+00 2 Bytes
+01 4 Bytes
+10 4 Bytes
+11 6 Bytes
+The debugger also displays other useful info on the same line such as the
+addresses being operated on destination addresses of branches & condition codes.
+00019736' AHI A7DAFF0E CC 1
+000198BA' BRC A7840004 -> 000198C2' CC 0
+000198CE' STM 900EF068 >> 0FA95E78 CC 2
+Useful VM debugger commands
+I suppose I'd better mention this before I start
+to list the current active traces do
+there can be a maximum of 255 of these per set
+( more about trace sets later ).
+To stop traces issue a
+To delete a particular breakpoint issue
+TR DEL <breakpoint number>
+The PA1 key drops to CP mode so you can issue debugger commands,
+Doing alt c (on my 3270 console at least ) clears the screen.
+hitting b <enter> comes back to the running operating system
+from cp mode ( in our case linux ).
+It is typically useful to add shortcuts to your profile.exec file
+if you have one ( this is roughly equivalent to autoexec.bat in DOS ).
+file here are a few from mine.
+/* this gives me command history on issuing f12 */
+set pf12 retrieve
+/* this continues */
+set pf8 imm b
+/* goes to trace set a */
+set pf1 imm tr goto a
+/* goes to trace set b */
+set pf2 imm tr goto b
+/* goes to trace set c */
+set pf3 imm tr goto c
+Instruction Tracing
+Setting a simple breakpoint
+TR I PSWA <address>
+To debug a particular function try
+TR I R <function address range>
+TR I on its own will single step.
+TR I DATA <MNEMONIC> <OPTIONAL RANGE> will trace for particular mnemonics
+TR I DATA 4D R 0197BC.4000
+will trace for BAS'es ( opcode 4D ) in the range 0197BC.4000
+if you were inclined you could add traces for all branch instructions &
+suffix them with the run prefix so you would have a backtrace on screen
+when a program crashes.
+TR BR <INTO OR FROM> will trace branches into or out of an address.
+TR BR INTO 0 is often quite useful if a program is getting awkward & deciding
+to branch to 0 & crashing as this will stop at the address before in jumps to 0.
+TR I R <address range> RUN cmd d g
+single steps a range of addresses but stays running &
+displays the gprs on each step.
+Displaying & modifying Registers
+D G will display all the gprs
+Adding a extra G to all the commands is necessary to access the full 64 bit
+content in VM on z/Architecture obviously this isn't required for access registers
+as these are still 32 bit.
+e.g. DGG instead of DG
+D X will display all the control registers
+D AR will display all the access registers
+D AR4-7 will display access registers 4 to 7
+CPU ALL D G will display the GRPS of all CPUS in the configuration
+D PSW will display the current PSW
+st PSW 2000 will put the value 2000 into the PSW &
+cause crash your machine.
+D PREFIX displays the prefix offset
+Displaying Memory
+To display memory mapped using the current PSW's mapping try
+D <range>
+To make VM display a message each time it hits a particular address & continue try
+D I<range> will disassemble/display a range of instructions.
+ST addr 32 bit word will store a 32 bit aligned address
+D T<range> will display the EBCDIC in an address ( if you are that way inclined )
+D R<range> will display real addresses ( without DAT ) but with prefixing.
+There are other complex options to display if you need to get at say home space
+but are in primary space the easiest thing to do is to temporarily
+modify the PSW to the other addressing mode, display the stuff & then
+restore it.
+If you want to issue a debugger command without halting your virtual machine with the
+PA1 key try prefixing the command with #CP e.g.
+#cp tr i pswa 2000
+also suffixing most debugger commands with RUN will cause them not
+to stop just display the mnemonic at the current instruction on the console.
+If you have several breakpoints you want to put into your program &
+you get fed up of cross referencing with
+you can do the following trick for several symbols.
+grep do_signal
+which emits the following among other things
+0001f4e0 T do_signal
+now you can do
+TR I PSWA 0001f4e0 cmd msg * do_signal
+This sends a message to your own console each time do_signal is entered.
+( As an aside I wrote a perl script once which automatically generated a REXX
+script with breakpoints on every kernel procedure, this isn't a good idea
+because there are thousands of these routines & VM can only set 255 breakpoints
+at a time so you nearly had to spend as long pruning the file down as you would
+entering the msg's by hand ),however, the trick might be useful for a single object file.
+On linux'es 3270 emulator x3270 there is a very useful option under the file ment
+Save Screens In File this is very good of keeping a copy of traces.
+From CMS help <command name> will give you online help on a particular command.
+Also CP has a file called profile.exec which automatically gets called
+on startup of CMS ( like autoexec.bat ), keeping on a DOS analogy session
+CP has a feature similar to doskey, it may be useful for you to
+use profile.exec to define some keystrokes.
+This does a single step in VM on pressing F8.
+SET PF10 ^
+This sets up the ^ key.
+which can be used for ^c (ctrl-c),^z (ctrl-z) which can't be typed directly into some 3270 consoles.
+SET PF11 ^-
+This types the starting keystrokes for a sysrq see SysRq below.
+This retrieves command history on pressing F12.
+Sometimes in VM the display is set up to scroll automatically this
+can be very annoying if there are messages you wish to look at
+to stop this do
+TERM MORE 255 255
+This will nearly stop automatic screen updates, however it will
+cause a denial of service if lots of messages go to the 3270 console,
+so it would be foolish to use this as the default on a production machine.
+Tracing particular processes
+The kernel's text segment is intentionally at an address in memory that it will
+very seldom collide with text segments of user programs ( thanks Martin ),
+this simplifies debugging the kernel.
+However it is quite common for user processes to have addresses which collide
+this can make debugging a particular process under VM painful under normal
+circumstances as the process may change when doing a
+TR I R <address range>.
+Thankfully after reading VM's online help I figured out how to debug
+I particular process.
+Your first problem is to find the STD ( segment table designation )
+of the program you wish to debug.
+There are several ways you can do this here are a few
+1) objdump --syms <program to be debugged> | grep main
+To get the address of main in the program.
+tr i pswa <address of main>
+Start the program, if VM drops to CP on what looks like the entry
+point of the main function this is most likely the process you wish to debug.
+Now do a D X13 or D XG13 on z/Architecture.
+On 31 bit the STD is bits 1-19 ( the STO segment table origin )
+& 25-31 ( the STL segment table length ) of CR13.
+now type
+TR I R STD <CR13's value> 0.7fffffff
+TR I R STD 8F32E1FF 0.7fffffff
+Another very useful variation is
+TR STORE INTO STD <CR13's value> <address range>
+for finding out when a particular variable changes.
+An alternative way of finding the STD of a currently running process
+is to do the following, ( this method is more complex but
+could be quite convient if you aren't updating the kernel much &
+so your kernel structures will stay constant for a reasonable period of
+time ).
+grep task /proc/<pid>/status
+from this you should see something like
+task: 0f160000 ksp: 0f161de8 pt_regs: 0f161f68
+This now gives you a pointer to the task structure.
+Now make CC:="s390-gcc -g" kernel/sched.s
+To get the task_struct stabinfo.
+( task_struct is defined in include/linux/sched.h ).
+Now we want to look at
+on my machine the active_mm in the task structure stab is
+its offset is 672/8=84=0x54
+the pgd member in the mm_struct stab is
+so its offset is 96/8=12=0xc
+so we'll
+hexdump -s 0xf160054 /dev/mem | more
+i.e. task_struct+active_mm offset
+to look at the active_mm member
+f160054 0fee cc60 0019 e334 0000 0000 0000 0011
+hexdump -s 0x0feecc6c /dev/mem | more
+i.e. active_mm+pgd offset
+feecc6c 0f2c 0000 0000 0001 0000 0001 0000 0010
+we get something like
+now do
+TR I R STD <pgd|0x7f> 0.7fffffff
+i.e. the 0x7f is added because the pgd only
+gives the page table origin & we need to set the low bits
+to the maximum possible segment table length.
+TR I R STD 0f2c007f 0.7fffffff
+on z/Architecture you'll probably need to do
+TR I R STD <pgd|0x7> 0.ffffffffffffffff
+to set the TableType to 0x1 & the Table length to 3.
+Tracing Program Exceptions
+If you get a crash which says something like
+illegal operation or specification exception followed by a register dump
+You can restart linux & trace these using the tr prog <range or value> trace option.
+The most common ones you will normally be tracing for is
+1=operation exception
+2=privileged operation exception
+4=protection exception
+5=addressing exception
+6=specification exception
+10=segment translation exception
+11=page translation exception
+The full list of these is on page 22 of the current s/390 Reference Summary.
+tr prog 10 will trace segment translation exceptions.
+tr prog on its own will trace all program interruption codes.
+Trace Sets
+On starting VM you are initially in the INITIAL trace set.
+You can do a Q TR to verify this.
+If you have a complex tracing situation where you wish to wait for instance
+till a driver is open before you start tracing IO, but know in your
+heart that you are going to have to make several runs through the code till you
+have a clue whats going on.
+What you can do is
+TR I PSWA <Driver open address>
+hit b to continue till breakpoint
+reach the breakpoint
+now do your
+TR IO 7c08-7c09 inst int run
+or whatever the IO channels you wish to trace are & hit b
+To got back to the initial trace set do
+& the TR I PSWA <Driver open address> will be the only active breakpoint again.
+Tracing linux syscalls under VM
+Syscalls are implemented on Linux for S390 by the Supervisor call instruction (SVC) there 256
+possibilities of these as the instruction is made up of a 0xA opcode & the second byte being
+the syscall number. They are traced using the simple command.
+TR SVC <Optional value or range>
+the syscalls are defined in linux/include/asm-s390/unistd.h
+e.g. to trace all file opens just do
+TR SVC 5 ( as this is the syscall number of open )
+SMP Specific commands
+To find out how many cpus you have
+Q CPUS displays all the CPU's available to your virtual machine
+To find the cpu that the current cpu VM debugger commands are being directed at do
+Q CPU to change the current cpu cpu VM debugger commands are being directed at do
+CPU <desired cpu no>
+On a SMP guest issue a command to all CPUs try prefixing the command with cpu all.
+To issue a command to a particular cpu try cpu <cpu number> e.g.
+CPU 01 TR I R 2000.3000
+If you are running on a guest with several cpus & you have a IO related problem
+& cannot follow the flow of code but you know it isnt smp related.
+from the bash prompt issue
+shutdown -h now or halt.
+do a Q CPUS to find out how many cpus you have
+detach each one of them from cp except cpu 0
+by issuing a
+DETACH CPU 01-(number of cpus in configuration)
+& boot linux again.
+TR SIGP will trace inter processor signal processor instructions.
+DEFINE CPU 01-(number in configuration)
+will get your guests cpus back.
+Help for displaying ascii textstrings
+On the very latest VM Nucleus'es VM can now display ascii
+( thanks Neale for the hint ) by doing
+D TX<lowaddr>.<len>
+D TX0.100
+Under older VM debuggers ( I love EBDIC too ) you can use this little program I wrote which
+will convert a command line of hex digits to ascii text which can be compiled under linux &
+you can copy the hex digits from your x3270 terminal to your xterm if you are debugging
+from a linuxbox.
+This is quite useful when looking at a parameter passed in as a text string
+under VM ( unless you are good at decoding ASCII in your head ).
+e.g. consider tracing an open syscall
+We have stopped at a breakpoint
+000151B0' SVC 0A05 -> 0001909A' CC 0
+D 20.8 to check the SVC old psw in the prefix area & see was it from userspace
+( for the layout of the prefix area consult P18 of the s/390 390 Reference Summary
+if you have it available ).
+V00000020 070C2000 800151B2
+The problem state bit wasn't set & it's also too early in the boot sequence
+for it to be a userspace SVC if it was we would have to temporarily switch the
+psw to user space addressing so we could get at the first parameter of the open in
+Next do a
+D G2
+GPR 2 = 00014CB4
+Now display what gpr2 is pointing to
+D 00014CB4.20
+V00014CB4 2F646576 2F636F6E 736F6C65 00001BF5
+V00014CC4 FC00014C B4001001 E0001000 B8070707
+Now copy the text till the first 00 hex ( which is the end of the string
+to an xterm & do hex2ascii on it.
+hex2ascii 2F646576 2F636F6E 736F6C65 00
+Decoded Hex:=/ d e v / c o n s o l e 0x00
+We were opening the console device,
+You can compile the code below yourself for practice :-),
+ * hex2ascii.c
+ * a useful little tool for converting a hexadecimal command line to ascii
+ *
+ * Author(s): Denis Joseph Barrow (,
+ * (C) 2000 IBM Deutschland Entwicklung GmbH, IBM Corporation.
+ */
+#include <stdio.h>
+int main(int argc,char *argv[])
+ int cnt1,cnt2,len,toggle=0;
+ int startcnt=1;
+ unsigned char c,hex;
+ if(argc>1&&(strcmp(argv[1],"-a")==0))
+ startcnt=2;
+ printf("Decoded Hex:=");
+ for(cnt1=startcnt;cnt1<argc;cnt1++)
+ {
+ len=strlen(argv[cnt1]);
+ for(cnt2=0;cnt2<len;cnt2++)
+ {
+ c=argv[cnt1][cnt2];
+ if(c>='0'&&c<='9')
+ c=c-'0';
+ if(c>='A'&&c<='F')
+ c=c-'A'+10;
+ if(c>='a'&&c<='f')
+ c=c-'a'+10;
+ switch(toggle)
+ {
+ case 0:
+ hex=c<<4;
+ toggle=1;
+ break;
+ case 1:
+ hex+=c;
+ if(hex<32||hex>127)
+ {
+ if(startcnt==1)
+ printf("0x%02X ",(int)hex);
+ else
+ printf(".");
+ }
+ else
+ {
+ printf("%c",hex);
+ if(startcnt==1)
+ printf(" ");
+ }
+ toggle=0;
+ break;
+ }
+ }
+ }
+ printf("\n");
+Stack tracing under VM
+A basic backtrace
+Here are the tricks I use 9 out of 10 times it works pretty well,
+When your backchain reaches a dead end
+This can happen when an exception happens in the kernel & the kernel is entered twice
+if you reach the NULL pointer at the end of the back chain you should be
+able to sniff further back if you follow the following tricks.
+1) A kernel address should be easy to recognise since it is in
+primary space & the problem state bit isn't set & also
+The Hi bit of the address is set.
+2) Another backchain should also be easy to recognise since it is an
+address pointing to another address approximately 100 bytes or 0x70 hex
+behind the current stackpointer.
+Here is some practice.
+boot the kernel & hit PA1 at some random time
+d g to display the gprs, this should display something like
+GPR 0 = 00000001 00156018 0014359C 00000000
+GPR 4 = 00000001 001B8888 000003E0 00000000
+GPR 8 = 00100080 00100084 00000000 000FE000
+GPR 12 = 00010400 8001B2DC 8001B36A 000FFED8
+Note that GPR14 is a return address but as we are real men we are going to
+trace the stack.
+display 0x40 bytes after the stack pointer.
+V000FFED8 000FFF38 8001B838 80014C8E 000FFF38
+V000FFEE8 00000000 00000000 000003E0 00000000
+V000FFEF8 00100080 00100084 00000000 000FE000
+V000FFF08 00010400 8001B2DC 8001B36A 000FFED8
+Ah now look at whats in sp+56 (sp+0x38) this is 8001B36A our saved r14 if
+you look above at our stackframe & also agrees with GPR14.
+now backchain
+d 000FFF38.40
+we now are taking the contents of SP to get our first backchain.
+V000FFF38 000FFFA0 00000000 00014995 00147094
+V000FFF48 00147090 001470A0 000003E0 00000000
+V000FFF58 00100080 00100084 00000000 001BF1D0
+V000FFF68 00010400 800149BA 80014CA6 000FFF38
+This displays a 2nd return address of 80014CA6
+now do d 000FFFA0.40 for our 3rd backchain
+V000FFFA0 04B52002 0001107F 00000000 00000000
+V000FFFB0 00000000 00000000 FF000000 0001107F
+V000FFFC0 00000000 00000000 00000000 00000000
+V000FFFD0 00010400 80010802 8001085A 000FFFA0
+our 3rd return address is 8001085A
+as the 04B52002 looks suspiciously like rubbish it is fair to assume that the kernel entry routines
+for the sake of optimisation dont set up a backchain.
+now look at to see if the addresses make any sense.
+grep -i 0001b3
+outputs among other things
+0001b304 T cpu_idle
+so 8001B36A
+is cpu_idle+0x66 ( quiet the cpu is asleep, don't wake it )
+grep -i 00014
+produces among other things
+00014a78 T start_kernel
+so 0014CA6 is start_kernel+some hex number I can't add in my head.
+grep -i 00108
+this produces
+00010800 T _stext
+so 8001085A is _stext+0x5a
+Congrats you've done your first backchain.
+s/390 & z/Architecture IO Overview
+I am not going to give a course in 390 IO architecture as this would take me quite a
+while & I'm no expert. Instead I'll give a 390 IO architecture summary for Dummies if you have
+the s/390 principles of operation available read this instead. If nothing else you may find a few
+useful keywords in here & be able to use them on a web search engine like altavista to find
+more useful information.
+Unlike other bus architectures modern 390 systems do their IO using mostly
+fibre optics & devices such as tapes & disks can be shared between several mainframes,
+also S390 can support upto 65536 devices while a high end PC based system might be choking
+with around 64. Here is some of the common IO terminology
+This is the logical number most IO commands use to talk to an IO device there can be upto
+0x10000 (65536) of these in a configuration typically there is a few hundred. Under VM
+for simplicity they are allocated contiguously, however on the native hardware they are not
+they typically stay consistent between boots provided no new hardware is inserted or removed.
+Under Linux for 390 we use these as IRQ's & also when issuing an IO command (CLEAR SUBCHANNEL,
+TEST SUBCHANNEL ) we use this as the ID of the device we wish to talk to, the most
+important of these instructions are START SUBCHANNEL ( to start IO ), TEST SUBCHANNEL ( to check
+whether the IO completed successfully ), & HALT SUBCHANNEL ( to kill IO ), a subchannel
+can have up to 8 channel paths to a device this offers redunancy if one is not available.
+Device Number:
+This number remains static & Is closely tied to the hardware, there are 65536 of these
+also they are made up of a CHPID ( Channel Path ID, the most significant 8 bits )
+& another lsb 8 bits. These remain static even if more devices are inserted or removed
+from the hardware, there is a 1 to 1 mapping between Subchannels & Device Numbers provided
+devices arent inserted or removed.
+Channel Control Words:
+CCWS are linked lists of instructions initially pointed to by an operation request block (ORB),
+which is initially given to Start Subchannel (SSCH) command along with the subchannel number
+for the IO subsystem to process while the CPU continues executing normal code.
+These come in two flavours, Format 0 ( 24 bit for backward )
+compatibility & Format 1 ( 31 bit ). These are typically used to issue read & write
+( & many other instructions ) they consist of a length field & an absolute address field.
+For each IO typically get 1 or 2 interrupts one for channel end ( primary status ) when the
+channel is idle & the second for device end ( secondary status ) sometimes you get both
+concurrently, you check how the IO went on by issuing a TEST SUBCHANNEL at each interrupt,
+from which you receive an Interruption response block (IRB). If you get channel & device end
+status in the IRB without channel checks etc. your IO probably went okay. If you didn't you
+probably need a doctorto examine the IRB & extended status word etc.
+If an error occurs more sophistocated control units have a facitity known as
+concurrent sense this means that if an error occurs Extended sense information will
+be presented in the Extended status word in the IRB if not you have to issue a
+subsequent SENSE CCW command after the test subchannel.
+TPI( Test pending interrupt) can also be used for polled IO but in multitasking multiprocessor
+systems it isn't recommended except for checking special cases ( i.e. non looping checks for
+pending IO etc. ).
+Store Subchannel & Modify Subchannel can be used to examine & modify operating characteristics
+of a subchannel ( e.g. channel paths ).
+Other IO related Terms:
+Sysplex: S390's Clustering Technology
+QDIO: S390's new high speed IO architecture to support devices such as gigabit ethernet,
+this architecture is also designed to be forward compatible with up & coming 64 bit machines.
+General Concepts
+Input Output Processors (IOP's) are responsible for communicating between
+the mainframe CPU's & the channel & relieve the mainframe CPU's from the
+burden of communicating with IO devices directly, this allows the CPU's to
+concentrate on data processing.
+IOP's can use one or more links ( known as channel paths ) to talk to each
+IO device. It first checks for path availability & chooses an available one,
+then starts ( & sometimes terminates IO ).
+There are two types of channel path ESCON & the Paralell IO interface.
+IO devices are attached to control units, control units provide the
+logic to interface the channel paths & channel path IO protocols to
+the IO devices, they can be integrated with the devices or housed separately
+& often talk to several similar devices ( typical examples would be raid
+controllers or a control unit which connects to 1000 3270 terminals ).
+ +---------------------------------------------------------------+
+ | +-----+ +-----+ +-----+ +-----+ +----------+ +----------+ |
+ | | CPU | | CPU | | CPU | | CPU | | Main | | Expanded | |
+ | | | | | | | | | | Memory | | Storage | |
+ | +-----+ +-----+ +-----+ +-----+ +----------+ +----------+ |
+ |---------------------------------------------------------------+
+ | IOP | IOP | IOP |
+ |---------------------------------------------------------------
+ | C | C | C | C | C | C | C | C | C | C | C | C | C | C | C | C |
+ ----------------------------------------------------------------
+ || ||
+ || Bus & Tag Channel Path || ESCON
+ || ====================== || Channel
+ || || || || Path
+ +----------+ +----------+ +----------+
+ | | | | | |
+ | CU | | CU | | CU |
+ | | | | | |
+ +----------+ +----------+ +----------+
+ | | | | |
++----------+ +----------+ +----------+ +----------+ +----------+
+|I/O Device| |I/O Device| |I/O Device| |I/O Device| |I/O Device|
++----------+ +----------+ +----------+ +----------+ +----------+
+ CPU = Central Processing Unit
+ C = Channel
+ IOP = IP Processor
+ CU = Control Unit
+The 390 IO systems come in 2 flavours the current 390 machines support both
+The Older 360 & 370 Interface,sometimes called the paralell I/O interface,
+sometimes called Bus-and Tag & sometimes Original Equipment Manufacturers
+Interface (OEMI).
+This byte wide paralell channel path/bus has parity & data on the "Bus" cable
+& control lines on the "Tag" cable. These can operate in byte multiplex mode for
+sharing between several slow devices or burst mode & monopolize the channel for the
+whole burst. Upto 256 devices can be addressed on one of these cables. These cables are
+about one inch in diameter. The maximum unextended length supported by these cables is
+125 Meters but this can be extended up to 2km with a fibre optic channel extended
+such as a 3044. The maximum burst speed supported is 4.5 megabytes per second however
+some really old processors support only transfer rates of 3.0, 2.0 & 1.0 MB/sec.
+One of these paths can be daisy chained to up to 8 control units.
+ESCON if fibre optic it is also called FICON
+Was introduced by IBM in 1990. Has 2 fibre optic cables & uses either leds or lasers
+for communication at a signaling rate of upto 200 megabits/sec. As 10bits are transferred
+for every 8 bits info this drops to 160 megabits/sec & to 18.6 Megabytes/sec once
+control info & CRC are added. ESCON only operates in burst mode.
+ESCONs typical max cable length is 3km for the led version & 20km for the laser version
+known as XDF ( extended distance facility ). This can be further extended by using an
+ESCON director which triples the above mentioned ranges. Unlike Bus & Tag as ESCON is
+serial it uses a packet switching architecture the standard Bus & Tag control protocol
+is however present within the packets. Upto 256 devices can be attached to each control
+unit that uses one of these interfaces.
+Common 390 Devices include:
+Network adapters typically OSA2,3172's,2116's & OSA-E gigabit ethernet adapters,
+Consoles 3270 & 3215 ( a teletype emulated under linux for a line mode console ).
+DASD's direct access storage devices ( otherwise known as hard disks ).
+Tape Drives.
+CTC ( Channel to Channel Adapters ),
+ESCON or Paralell Cables used as a very high speed serial link
+between 2 machines. We use 2 cables under linux to do a bi-directional serial link.
+Debugging IO on s/390 & z/Architecture under VM
+Now we are ready to go on with IO tracing commands under VM
+A few self explanatory queries:
+Q DISK ( This command is CMS specific )
+Q OSA on my machine returns
+If you have a guest with certain priviliges you may be able to see devices
+which don't belong to you to avoid this do add the option V.
+Now using the device numbers returned by this command we will
+Trace the io starting up on the first device 7c08 & 7c09
+In our simplest case we can trace the
+start subchannels
+like TR SSCH 7C08-7C09
+or the halt subchannels
+or TR HSCH 7C08-7C09
+MSCH's ,STSCH's I think you can guess the rest
+Ingo's favourite trick is tracing all the IO's & CCWS & spooling them into the reader of another
+VM guest so he can ftp the logfile back to his own machine.I'll do a small bit of this & give you
+ a look at the output.
+1) Spool stdout to VM reader
+SP PRT TO (another vm guest ) or * for the local vm guest
+2) Fill the reader with the trace
+3) Start up linux
+i 00c
+4) Finish the trace
+5) close the reader
+6) list reader contents
+7) copy it to linux4's minidisk
+RECEIVE / LOG TXT A1 ( replace
+filel & press F11 to look at it
+You should see someting like.
+00020942' SSCH B2334000 0048813C CC 0 SCH 0000 DEV 7C08
+ CPA 000FFDF0 PARM 00E2C9C4 KEY 0 FPI C0 LPM 80
+ CCW 000FFDF0 E4200100 00487FE8 0000 E4240100 ........
+ IDAL 0FB76000
+00020B0A' I/O DEV 7C08 -> 000197BC' SCH 0000 PARM 00E2C9C4
+00021628' TSCH B2354000 >> 00488164 CC 0 SCH 0000 DEV 7C08
+ KEY 0 FPI C0 CC 0 CTLS 4007
+00022238' STSCH B2344000 >> 00488108 CC 0 SCH 0000 DEV 7C08
+If you don't like messing up your readed ( because you possibly booted from it )
+you can alternatively spool it to another readers guest.
+Other common VM device related commands
+These commands are listed only because they have
+been of use to me in the past & may be of use to
+you too. For more complete info on each of the commands
+use type HELP <command> from CMS.
+detaching devices
+DET <devno range>
+ATT <devno range> <guest>
+attach a device to guest * for your own guest
+READY <devno> cause VM to issue a fake interrupt.
+The VARY command is normally only available to VM administrators.
+VARY ON PATH <path> TO <devno range>
+VARY OFF PATH <PATH> FROM <devno range>
+This is used to switch on or off channel paths to devices.
+Q CHPID <channel path ID>
+This displays state of devices using this channel path
+D SCHIB <subchannel>
+This displays the subchannel information SCHIB block for the device.
+this I believe is also only available to administrators.
+DEFINE CTC <devno>
+defines a virtual CTC channel to channel connection
+2 need to be defined on each guest for the CTC driver to use.
+COUPLE devno userid remote devno
+Joins a local virtual device to a remote virtual device
+( commonly used for the CTC driver ).
+Building a VM ramdisk under CMS which linux can use
+def vfb-<blocksize> <subchannel> <number blocks>
+blocksize is commonly 4096 for linux.
+Formatting it
+format <subchannel> <driver letter e.g. x> (blksize <blocksize>
+Sharing a disk between multiple guests
+LINK userid devno1 devno2 mode password
+GDB on S390
+N.B. if compiling for debugging gdb works better without optimisation
+( see Compiling programs for debugging )
+gdb <victim program> <optional corefile>
+Online help
+help: gives help on commands
+help display
+Note gdb's online help is very good use it.
+info registers: displays registers other than floating point.
+info all-registers: displays floating points as well.
+disassemble: dissassembles
+disassemble without parameters will disassemble the current function
+disassemble $pc $pc+10
+Viewing & modifying variables
+print or p: displays variable or register
+e.g. p/x $sp will display the stack pointer
+display: prints variable or register each time program stops
+display/x $pc will display the program counter
+display argc
+undisplay : undo's display's
+info breakpoints: shows all current breakpoints
+info stack: shows stack back trace ( if this dosent work too well, I'll show you the
+stacktrace by hand below ).
+info locals: displays local variables.
+info args: display current procedure arguments.
+set args: will set argc & argv each time the victim program is invoked.
+set <variable>=value
+set argc=100
+set $pc=0
+Modifying execution
+step: steps n lines of sourcecode
+step steps 1 line.
+step 100 steps 100 lines of code.
+next: like step except this will not step into subroutines
+stepi: steps a single machine code instruction.
+e.g. stepi 100
+nexti: steps a single machine code instruction but will not step into subroutines.
+finish: will run until exit of the current routine
+run: (re)starts a program
+cont: continues a program
+quit: exits gdb.
+sets a breakpoint
+break main
+break *$pc
+break *0x400618
+heres a really useful one for large programs
+Set a breakpoint for all functions matching REGEXP
+rbr 390
+will set a breakpoint with all functions with 390 in their name.
+info breakpoints
+lists all breakpoints
+delete: delete breakpoint by number or delete them all
+delete 1 will delete the first breakpoint
+delete will delete them all
+watch: This will set a watchpoint ( usually hardware assisted ),
+This will watch a variable till it changes
+watch cnt, will watch the variable cnt till it changes.
+As an aside unfortunately gdb's, architecture independent watchpoint code
+is inconsistent & not very good, watchpoints usually work but not always.
+info watchpoints: Display currently active watchpoints
+condition: ( another useful one )
+Specify breakpoint number N to break only if COND is true.
+Usage is `condition N COND', where N is an integer and COND is an
+expression to be evaluated whenever breakpoint N is reached.
+User defined functions/macros
+define: ( Note this is very very useful,simple & powerful )
+usage define <name> <list of commands> end
+examples which you should consider putting into .gdbinit in your home directory
+define d
+disassemble $pc $pc+10
+define e
+disassemble $pc $pc+10
+Other hard to classify stuff
+signal n:
+sends the victim program a signal.
+e.g. signal 3 will send a SIGQUIT.
+info signals:
+what gdb does when the victim receives certain signals.
+list lists current function source
+list 1,10 list first 10 lines of curret file.
+list test.c:1,10
+Adds directories to be searched for source if gdb cannot find the source.
+(note it is a bit sensititive about slashes )
+e.g. To add the root of the filesystem to the searchpath do
+directory //
+call <function>
+This calls a function in the victim program, this is pretty powerful
+(gdb) call printf("hello world")
+$1 = 11
+You might now be thinking that the line above didn't work, something extra had to be done.
+(gdb) call fflush(stdout)
+hello world$2 = 0
+As an aside the debugger also calls malloc & free under the hood
+to make space for the "hello world" string.
+1) command completion works just like bash
+( if you are a bad typist like me this really helps )
+e.g. hit br <TAB> & cursor up & down :-).
+2) if you have a debugging problem that takes a few steps to recreate
+put the steps into a file called .gdbinit in your current working directory
+if you have defined a few extra useful user defined commands put these in
+your home directory & they will be read each time gdb is launched.
+A typical .gdbinit file might be.
+break main
+break runtime_exception
+stack chaining in gdb by hand
+This is done using a the same trick described for VM
+p/x (*($sp+56))&0x7fffffff get the first backchain.
+For z/Architecture
+Replace 56 with 112 & ignore the &0x7fffffff
+in the macros below & do nasty casts to longs like the following
+as gdb unfortunately deals with printed arguments as ints which
+messes up everything.
+i.e. here is a 3rd backchain dereference
+p/x *(long *)(***(long ***)$sp+112)
+this outputs
+$5 = 0x528f18
+on my machine.
+Now you can use
+info symbol (*($sp+56))&0x7fffffff
+you might see something like.
+rl_getc + 36 in section .text telling you what is located at address 0x528f18
+Now do.
+p/x (*(*$sp+56))&0x7fffffff
+This outputs
+$6 = 0x528ed0
+Now do.
+info symbol (*(*$sp+56))&0x7fffffff
+rl_read_key + 180 in section .text
+now do
+p/x (*(**$sp+56))&0x7fffffff
+& so on.
+Disassembling instructions without debug info
+gdb typically compains if there is a lack of debugging
+symbols in the disassemble command with
+"No function contains specified address." to get around
+this do
+x/<number lines to disassemble>xi <address>
+x/20xi 0x400730
+Note: Remember gdb has history just like bash you don't need to retype the
+whole line just use the up & down arrows.
+For more info
+From your linuxbox do
+man gdb or info gdb.
+core dumps
+What a core dump ?,
+A core dump is a file generated by the kernel ( if allowed ) which contains the registers,
+& all active pages of the program which has crashed.
+From this file gdb will allow you to look at the registers & stack trace & memory of the
+program as if it just crashed on your system, it is usually called core & created in the
+current working directory.
+This is very useful in that a customer can mail a core dump to a technical support department
+& the technical support department can reconstruct what happened.
+Provided the have an identical copy of this program with debugging symbols compiled in &
+the source base of this build is available.
+In short it is far more useful than something like a crash log could ever hope to be.
+In theory all that is missing to restart a core dumped program is a kernel patch which
+will do the following.
+1) Make a new kernel task structure
+2) Reload all the dumped pages back into the kernel's memory management structures.
+3) Do the required clock fixups
+4) Get all files & network connections for the process back into an identical state ( really difficult ).
+5) A few more difficult things I haven't thought of.
+Why have I never seen one ?.
+Probably because you haven't used the command
+ulimit -c unlimited in bash
+to allow core dumps, now do
+ulimit -a
+to verify that the limit was accepted.
+A sample core dump
+To create this I'm going to do
+ulimit -c unlimited
+to launch gdb (my victim app. ) now be bad & do the following from another
+telnet/xterm session to the same machine
+ps -aux | grep gdb
+kill -SIGSEGV <gdb's pid>
+or alternatively use killall -SIGSEGV gdb if you have the killall command.
+Now look at the core dump.
+./gdb ./gdb core
+Displays the following
+GNU gdb 4.18
+Copyright 1998 Free Software Foundation, Inc.
+GDB is free software, covered by the GNU General Public License, and you are
+welcome to change it and/or distribute copies of it under certain conditions.
+Type "show copying" to see the conditions.
+There is absolutely no warranty for GDB. Type "show warranty" for details.
+This GDB was configured as "s390-ibm-linux"...
+Core was generated by `./gdb'.
+Program terminated with signal 11, Segmentation fault.
+Reading symbols from /usr/lib/
+Reading symbols from /lib/
+Reading symbols from /lib/
+Reading symbols from /lib/
+#0 0x40126d1a in read () from /lib/
+Setting up the environment for debugging gdb.
+Breakpoint 1 at 0x4dc6f8: file utils.c, line 471.
+Breakpoint 2 at 0x4d87a4: file top.c, line 2609.
+(top-gdb) info stack
+#0 0x40126d1a in read () from /lib/
+#1 0x528f26 in rl_getc (stream=0x7ffffde8) at input.c:402
+#2 0x528ed0 in rl_read_key () at input.c:381
+#3 0x5167e6 in readline_internal_char () at readline.c:454
+#4 0x5168ee in readline_internal_charloop () at readline.c:507
+#5 0x51692c in readline_internal () at readline.c:521
+#6 0x5164fe in readline (prompt=0x7ffff810 "\177x\177\177x")
+ at readline.c:349
+#7 0x4d7a8a in command_line_input (prrompt=0x564420 "(gdb) ", repeat=1,
+ annotation_suffix=0x4d6b44 "prompt") at top.c:2091
+#8 0x4d6cf0 in command_loop () at top.c:1345
+#9 0x4e25bc in main (argc=1, argv=0x7ffffdf4) at main.c:635
+This is a program which lists the shared libraries which a library needs,
+Note you also get the relocations of the shared library text segments which
+help when using objdump --source.
+ ldd ./gdb
+outputs => /usr/lib/ (0x40018000) => /lib/ (0x4005e000) => /lib/ (0x40084000)
+/lib/ => /lib/ (0x40000000)
+Debugging shared libraries
+Most programs use shared libraries, however it can be very painful
+when you single step instruction into a function like printf for the
+first time & you end up in functions like _dl_runtime_resolve this is
+the doing lazy binding, lazy binding is a concept in ELF where
+shared library functions are not loaded into memory unless they are
+actually used, great for saving memory but a pain to debug.
+To get around this either relink the program -static or exit gdb type
+export LD_BIND_NOW=true this will stop lazy binding & restart the gdb'ing
+the program in question.
+Debugging modules
+As modules are dynamically loaded into the kernel their address can be
+anywhere to get around this use the -m option with insmod to emit a load
+map which can be piped into a file if required.
+The proc file system
+What is it ?.
+It is a filesystem created by the kernel with files which are created on demand
+by the kernel if read, or can be used to modify kernel parameters,
+it is a powerful concept.
+cat /proc/sys/net/ipv4/ip_forward
+On my machine outputs
+telling me ip_forwarding is not on to switch it on I can do
+echo 1 > /proc/sys/net/ipv4/ip_forward
+cat it again
+cat /proc/sys/net/ipv4/ip_forward
+On my machine now outputs
+IP forwarding is on.
+There is a lot of useful info in here best found by going in & having a look around,
+so I'll take you through some entries I consider important.
+All the processes running on the machine have there own entry defined by
+So lets have a look at the init process
+cd /proc/1
+cat cmdline
+init [2]
+cd /proc/1/fd
+This contains numerical entries of all the open files,
+some of these you can cat e.g. stdout (2)
+cat /proc/29/maps
+on my machine emits
+00400000-00478000 r-xp 00000000 5f:00 4103 /bin/bash
+00478000-0047e000 rw-p 00077000 5f:00 4103 /bin/bash
+0047e000-00492000 rwxp 00000000 00:00 0
+40000000-40015000 r-xp 00000000 5f:00 14382 /lib/
+40015000-40016000 rw-p 00014000 5f:00 14382 /lib/
+40016000-40017000 rwxp 00000000 00:00 0
+40017000-40018000 rw-p 00000000 00:00 0
+40018000-4001b000 r-xp 00000000 5f:00 14435 /lib/
+4001b000-4001c000 rw-p 00002000 5f:00 14435 /lib/
+4001c000-4010d000 r-xp 00000000 5f:00 14387 /lib/
+4010d000-40111000 rw-p 000f0000 5f:00 14387 /lib/
+40111000-40114000 rw-p 00000000 00:00 0
+40114000-4011e000 r-xp 00000000 5f:00 14408 /lib/
+4011e000-4011f000 rw-p 00009000 5f:00 14408 /lib/
+7fffd000-80000000 rwxp ffffe000 00:00 0
+Showing us the shared libraries init uses where they are in memory
+& memory access permissions for each virtual memory area.
+/proc/1/cwd is a softlink to the current working directory.
+/proc/1/root is the root of the filesystem for this process.
+/proc/1/mem is the current running processes memory which you
+can read & write to like a file.
+strace uses this sometimes as it is a bit faster than the
+rather inefficent ptrace interface for peeking at DATA.
+cat status
+Name: init
+State: S (sleeping)
+Pid: 1
+PPid: 0
+Uid: 0 0 0 0
+Gid: 0 0 0 0
+VmSize: 408 kB
+VmLck: 0 kB
+VmRSS: 208 kB
+VmData: 24 kB
+VmStk: 8 kB
+VmExe: 368 kB
+VmLib: 0 kB
+SigPnd: 0000000000000000
+SigBlk: 0000000000000000
+SigIgn: 7fffffffd7f0d8fc
+SigCgt: 00000000280b2603
+CapInh: 00000000fffffeff
+CapPrm: 00000000ffffffff
+CapEff: 00000000fffffeff
+User PSW: 070de000 80414146
+task: 004b6000 tss: 004b62d8 ksp: 004b7ca8 pt_regs: 004b7f68
+User GPRS:
+00000400 00000000 0000000b 7ffffa90
+00000000 00000000 00000000 0045d9f4
+0045cafc 7ffffa90 7fffff18 0045cb08
+00010400 804039e8 80403af8 7ffff8b0
+User ACRS:
+00000000 00000000 00000000 00000000
+00000001 00000000 00000000 00000000
+00000000 00000000 00000000 00000000
+00000000 00000000 00000000 00000000
+Kernel BackChain CallChain BackChain CallChain
+ 004b7ca8 8002bd0c 004b7d18 8002b92c
+ 004b7db8 8005cd50 004b7e38 8005d12a
+ 004b7f08 80019114
+Showing among other things memory usage & status of some signals &
+the processes'es registers from the kernel task_structure
+as well as a backchain which may be useful if a process crashes
+in the kernel for some unknown reason.
+Some driver debugging techniques
+debug feature
+Some of our drivers now support a "debug feature" in
+/proc/s390dbf see s390dbf.txt in the linux/Documentation directory
+for more info.
+to switch on the lcs "debug feature"
+echo 5 > /proc/s390dbf/lcs/level
+& then after the error occurred.
+cat /proc/s390dbf/lcs/sprintf >/logfile
+the logfile now contains some information which may help
+tech support resolve a problem in the field.
+high level debugging network drivers
+ifconfig is a quite useful command
+it gives the current state of network drivers.
+If you suspect your network device driver is dead
+one way to check is type
+ifconfig <network device>
+e.g. tr0
+You should see something like
+tr0 Link encap:16/4 Mbps Token Ring (New) HWaddr 00:04:AC:20:8E:48
+ inet addr: Bcast: Mask:
+ RX packets:246134 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:5 errors:0 dropped:0 overruns:0 carrier:0
+ collisions:0 txqueuelen:100
+if the device doesn't say up
+/etc/rc.d/init.d/network start
+( this starts the network stack & hopefully calls ifconfig tr0 up ).
+ifconfig looks at the output of /proc/net/dev & presents it in a more presentable form
+Now ping the device from a machine in the same subnet.
+if the RX packets count & TX packets counts don't increment you probably
+have problems.
+cat /proc/net/arp
+Do you see any hardware addresses in the cache if not you may have problems.
+Next try
+ping -c 5 <broadcast_addr> i.e. the Bcast field above in the output of
+ifconfig. Do you see any replies from machines other than the local machine
+if not you may have problems. also if the TX packets count in ifconfig
+hasn't incremented either you have serious problems in your driver
+(e.g. the txbusy field of the network device being stuck on )
+or you may have multiple network devices connected.
+There is a new device layer for channel devices, some
+drivers e.g. lcs are registered with this layer.
+If the device uses the channel device layer you'll be
+able to find what interrupts it uses & the current state
+of the device.
+See the manpage chandev.8 &type cat /proc/chandev for more info.
+Starting points for debugging scripting languages etc.
+bash -x <scriptname>
+e.g. bash -x /usr/bin/bashbug
+displays the following lines as it executes them.
++ MACHINE=i586
++ OS=linux-gnu
++ CC=gcc
++ CFLAGS= -DPROGRAM='bash' -DHOSTTYPE='i586' -DOSTYPE='linux-gnu' -DMACHTYPE='i586-pc-linux-gnu' -DSHELL -DHAVE_CONFIG_H -I. -I. -I./lib -O2 -pipe
++ RELEASE=2.01
++ RELSTATUS=release
++ MACHTYPE=i586-pc-linux-gnu
+perl -d <scriptname> runs the perlscript in a fully intercative debugger
+<like gdb>.
+Type 'h' in the debugger for help.
+for debugging java type
+jdb <filename> another fully interactive gdb style debugger.
+& type ? in the debugger for help.
+Dumptool & Lcrash ( lkcd )
+Michael Holzheu & others here at IBM have a fairly mature port of
+SGI's lcrash tool which allows one to look at kernel structures in a
+running kernel.
+It also complements a tool called dumptool which dumps all the kernel's
+memory pages & registers to either a tape or a disk.
+This can be used by tech support or an ambitious end user do
+post mortem debugging of a machine like gdb core dumps.
+Going into how to use this tool in detail will be explained
+in other documentation supplied by IBM with the patches & the
+lcrash homepage & the lcrash manpage.
+How they work
+Lcrash is a perfectly normal program,however, it requires 2
+additional files, Kerntypes which is built using a patch to the
+linux kernel sources in the linux root directory & the
+Kerntypes is an an objectfile whose sole purpose in life
+is to provide stabs debug info to lcrash, to do this
+Kerntypes is built from kerntypes.c which just includes the most commonly
+referenced header files used when debugging, lcrash can then read the
+.stabs section of this file.
+Debugging a live system it uses /dev/mem
+alternatively for post mortem debugging it uses the data
+collected by dumptool.
+This is now supported by linux for s/390 & z/Architecture.
+To enable it do compile the kernel with
+Kernel Hacking -> Magic SysRq Key Enabled
+echo "1" > /proc/sys/kernel/sysrq
+also type
+echo "8" >/proc/sys/kernel/printk
+To make printk output go to console.
+On 390 all commands are prefixed with
+^-t will show tasks.
+^-? or some unknown command will display help.
+The sysrq key reading is very picky ( I have to type the keys in an
+ xterm session & paste them into the x3270 console )
+& it may be wise to predefine the keys as described in the VM hints above
+This is particularly useful for syncing disks unmounting & rebooting
+if the machine gets partially hung.
+Read Documentation/sysrq.txt for more info
+Enterprise Systems Architecture Reference Summary
+Enterprise Systems Architecture Principles of Operation
+Hartmut Penners s390 stack frame sheet.
+IBM Mainframe Channel Attachment a technology brief from a CISCO webpage
+Various bits of man & info pages of Linux.
+Linux & GDB source.
+Various info & man pages.
+CMS Help on tracing commands.
+Linux for s/390 Elf Application Binary Interface
+Linux for z/Series Elf Application Binary Interface ( Both Highly Recommended )
+z/Architecture Principles of Operation SA22-7832-00
+Enterprise Systems Architecture/390 Reference Summary SA22-7209-01 & the
+Enterprise Systems Architecture/390 Principles of Operation SA22-7201-05
+Special Thanks
+Special thanks to Neale Ferguson who maintains a much
+prettier HTML version of this page at
+Bob Grainger Stefan Bader & others for reporting bugs
diff --git a/Documentation/s390/TAPE b/Documentation/s390/TAPE
new file mode 100644
index 000000000000..c639aa5603ff
--- /dev/null
+++ b/Documentation/s390/TAPE
@@ -0,0 +1,122 @@
+Channel attached Tape device driver
+This driver is considered to be EXPERIMENTAL. Do NOT use it in
+production environments. Feel free to test it and report problems back to us.
+The LINUX for zSeries tape device driver manages channel attached tape drives
+which are compatible to IBM 3480 or IBM 3490 magnetic tape subsystems. This
+includes various models of these devices (for example the 3490E).
+Tape driver features
+The device driver supports a maximum of 128 tape devices.
+No official LINUX device major number is assigned to the zSeries tape device
+driver. It allocates major numbers dynamically and reports them on system
+Typically it will get major number 254 for both the character device front-end
+and the block device front-end.
+The tape device driver needs no kernel parameters. All supported devices
+present are detected on driver initialization at system startup or module load.
+The devices detected are ordered by their subchannel numbers. The device with
+the lowest subchannel number becomes device 0, the next one will be device 1
+and so on.
+Tape character device front-end
+The usual way to read or write to the tape device is through the character
+device front-end. The zSeries tape device driver provides two character devices
+for each physical device -- the first of these will rewind automatically when
+it is closed, the second will not rewind automatically.
+The character device nodes are named /dev/rtibm0 (rewinding) and /dev/ntibm0
+(non-rewinding) for the first device, /dev/rtibm1 and /dev/ntibm1 for the
+second, and so on.
+The character device front-end can be used as any other LINUX tape device. You
+can write to it and read from it using LINUX facilities such as GNU tar. The
+tool mt can be used to perform control operations, such as rewinding the tape
+or skipping a file.
+Most LINUX tape software should work with either tape character device.
+Tape block device front-end
+The tape device may also be accessed as a block device in read-only mode.
+This could be used for software installation in the same way as it is used with
+other operation systems on the zSeries platform (and most LINUX
+distributions are shipped on compact disk using ISO9660 filesystems).
+One block device node is provided for each physical device. These are named
+/dev/btibm0 for the first device, /dev/btibm1 for the second and so on.
+You should only use the ISO9660 filesystem on LINUX for zSeries tapes because
+the physical tape devices cannot perform fast seeks and the ISO9660 system is
+optimized for this situation.
+Tape block device example
+In this example a tape with an ISO9660 filesystem is created using the first
+tape device. ISO9660 filesystem support must be built into your system kernel
+for this.
+The mt command is used to issue tape commands and the mkisofs command to
+create an ISO9660 filesystem:
+- create a LINUX directory (somedir) with the contents of the filesystem
+ mkdir somedir
+ cp contents somedir
+- insert a tape
+- ensure the tape is at the beginning
+ mt -f /dev/ntibm0 rewind
+- set the blocksize of the character driver. The blocksize 2048 bytes
+ is commonly used on ISO9660 CD-Roms
+ mt -f /dev/ntibm0 setblk 2048
+- write the filesystem to the character device driver
+ mkisofs -o /dev/ntibm0 somedir
+- rewind the tape again
+ mt -f /dev/ntibm0 rewind
+- Now you can mount your new filesystem as a block device:
+ mount -t iso9660 -o ro,block=2048 /dev/btibm0 /mnt
+TODO List
+ - Driver has to be stabilized still
+This driver is considered BETA, which means some weaknesses may still
+be in it.
+If an error occurs which cannot be handled by the code you will get a
+sense-data dump.In that case please do the following:
+1. set the tape driver debug level to maximum:
+ echo 6 >/proc/s390dbf/tape/level
+2. re-perform the actions which produced the bug. (Hopefully the bug will
+ reappear.)
+3. get a snapshot from the debug-feature:
+ cat /proc/s390dbf/tape/hex_ascii >somefile
+4. Now put the snapshot together with a detailed description of the situation
+ that led to the bug:
+ - Which tool did you use?
+ - Which hardware do you have?
+ - Was your tape unit online?
+ - Is it a shared tape unit?
+5. Send an email with your bug report to:
diff --git a/Documentation/s390/cds.txt b/Documentation/s390/cds.txt
new file mode 100644
index 000000000000..d9397170fb36
--- /dev/null
+++ b/Documentation/s390/cds.txt
@@ -0,0 +1,513 @@
+Linux for S/390 and zSeries
+Common Device Support (CDS)
+Device Driver I/O Support Routines
+Authors : Ingo Adlung
+ Cornelia Huck
+Copyright, IBM Corp. 1999-2002
+This document describes the common device support routines for Linux/390.
+Different than other hardware architectures, ESA/390 has defined a unified
+I/O access method. This gives relief to the device drivers as they don't
+have to deal with different bus types, polling versus interrupt
+processing, shared versus non-shared interrupt processing, DMA versus port
+I/O (PIO), and other hardware features more. However, this implies that
+either every single device driver needs to implement the hardware I/O
+attachment functionality itself, or the operating system provides for a
+unified method to access the hardware, providing all the functionality that
+every single device driver would have to provide itself.
+The document does not intend to explain the ESA/390 hardware architecture in
+every detail.This information can be obtained from the ESA/390 Principles of
+Operation manual (IBM Form. No. SA22-7201).
+In order to build common device support for ESA/390 I/O interfaces, a
+functional layer was introduced that provides generic I/O access methods to
+the hardware.
+The common device support layer comprises the I/O support routines defined
+below. Some of them implement common Linux device driver interfaces, while
+some of them are ESA/390 platform specific.
+In order to write a driver for S/390, you also need to look into the interface
+described in Documentation/s390/driver-model.txt.
+Note for porting drivers from 2.4:
+The major changes are:
+* The functions use a ccw_device instead of an irq (subchannel).
+* All drivers must define a ccw_driver (see driver-model.txt) and the associated
+ functions.
+* request_irq() and free_irq() are no longer done by the driver.
+* The oper_handler is (kindof) replaced by the probe() and set_online() functions
+ of the ccw_driver.
+* The not_oper_handler is (kindof) replaced by the remove() and set_offline()
+ functions of the ccw_driver.
+* The channel device layer is gone.
+* The interrupt handlers must be adapted to use a ccw_device as argument.
+ Moreover, they don't return a devstat, but an irb.
+* Before initiating an io, the options must be set via ccw_device_set_options().
+ read device characteristics
+ read configuration data.
+ get commands from extended sense data.
+ initiate an I/O request.
+ resume channel program execution.
+ terminate the current I/O request processed on the device.
+ generic interrupt routine. This function is called by the interrupt entry
+ routine whenever an I/O interrupt is presented to the system. The do_IRQ()
+ routine determines the interrupt status and calls the device specific
+ interrupt handler according to the rules (flags) defined during I/O request
+ initiation with do_IO().
+The next chapters describe the functions other than do_IRQ() in more details.
+The do_IRQ() interface is not described, as it is called from the Linux/390
+first level interrupt handler only and does not comprise a device driver
+callable interface. Instead, the functional description of do_IO() also
+describes the input to the device specific interrupt handler.
+Note: All explanations apply also to the 64 bit architecture s390x.
+Common Device Support (CDS) for Linux/390 Device Drivers
+General Information
+The following chapters describe the I/O related interface routines the
+Linux/390 common device support (CDS) provides to allow for device specific
+driver implementations on the IBM ESA/390 hardware platform. Those interfaces
+intend to provide the functionality required by every device driver
+implementaion to allow to drive a specific hardware device on the ESA/390
+platform. Some of the interface routines are specific to Linux/390 and some
+of them can be found on other Linux platforms implementations too.
+Miscellaneous function prototypes, data declarations, and macro definitions
+can be found in the architecture specific C header file
+Overview of CDS interface concepts
+Different to other hardware platforms, the ESA/390 architecture doesn't define
+interrupt lines managed by a specific interrupt controller and bus systems
+that may or may not allow for shared interrupts, DMA processing, etc.. Instead,
+the ESA/390 architecture has implemented a so called channel subsystem, that
+provides a unified view of the devices physically attached to the systems.
+Though the ESA/390 hardware platform knows about a huge variety of different
+peripheral attachments like disk devices (aka. DASDs), tapes, communication
+controllers, etc. they can all by accessed by a well defined access method and
+they are presenting I/O completion a unified way : I/O interruptions. Every
+single device is uniquely identified to the system by a so called subchannel,
+where the ESA/390 architecture allows for 64k devices be attached.
+Linux, however, was first built on the Intel PC architecture, with its two
+cascaded 8259 programmable interrupt controllers (PICs), that allow for a
+maximum of 15 different interrupt lines. All devices attached to such a system
+share those 15 interrupt levels. Devices attached to the ISA bus system must
+not share interrupt levels (aka. IRQs), as the ISA bus bases on edge triggered
+interrupts. MCA, EISA, PCI and other bus systems base on level triggered
+interrupts, and therewith allow for shared IRQs. However, if multiple devices
+present their hardware status by the same (shared) IRQ, the operating system
+has to call every single device driver registered on this IRQ in order to
+determine the device driver owning the device that raised the interrupt.
+In order not to introduce a new I/O concept to the common Linux code,
+Linux/390 preserves the IRQ concept and semantically maps the ESA/390
+subchannels to Linux as IRQs. This allows Linux/390 to support up to 64k
+different IRQs, uniquely representig a single device each.
+Up to kernel 2.4, Linux/390 used to provide interfaces via the IRQ (subchannel).
+For internal use of the common I/O layer, these are still there. However,
+device drivers should use the new calling interface via the ccw_device only.
+During its startup the Linux/390 system checks for peripheral devices. Each
+of those devices is uniquely defined by a so called subchannel by the ESA/390
+channel subsystem. While the subchannel numbers are system generated, each
+subchannel also takes a user defined attribute, the so called device number.
+Both subchannel number and device number can not exceed 65535. During driverfs
+initialisation, the information about control unit type and device types that
+imply specific I/O commands (channel command words - CCWs) in order to operate
+the device are gathered. Device drivers can retrieve this set of hardware
+information during their initialization step to recognize the devices they
+support using the information saved in the struct ccw_device given to them.
+This methods implies that Linux/390 doesn't require to probe for free (not
+armed) interrupt request lines (IRQs) to drive its devices with. Where
+applicable, the device drivers can use the read_dev_chars() to retrieve device
+characteristics. This can be done without having to request device ownership
+In order to allow for easy I/O initiation the CDS layer provides a
+ccw_device_start() interface that takes a device specific channel program (one
+or more CCWs) as input sets up the required architecture specific control blocks
+and initiates an I/O request on behalf of the device driver. The
+ccw_device_start() routine allows to specify whether it expects the CDS layer
+to notify the device driver for every interrupt it observes, or with final status
+only. See ccw_device_start() for more details. A device driver must never issue
+ESA/390 I/O commands itself, but must use the Linux/390 CDS interfaces instead.
+For long running I/O request to be canceled, the CDS layer provides the
+ccw_device_halt() function. Some devices require to initially issue a HALT
+SUBCHANNEL (HSCH) command without having pending I/O requests. This function is
+also covered by ccw_device_halt().
+read_dev_chars() - Read Device Characteristics
+This routine returns the characteristics for the device specified.
+The function is meant to be called with an irq handler in place; that is,
+at earliest during set_online() processing.
+While the request is procesed synchronously, the device interrupt
+handler is called for final ending status. In case of error situations the
+interrupt handler may recover appropriately. The device irq handler can
+recognize the corresponding interrupts by the interruption parameter be
+0x00524443.The ccw_device must not be locked prior to calling read_dev_chars().
+The function may be called enabled or disabled.
+int read_dev_chars(struct ccw_device *cdev, void **buffer, int length );
+cdev - the ccw_device the information is requested for.
+buffer - pointer to a buffer pointer. The buffer pointer itself
+ must contain a valid buffer area.
+length - length of the buffer provided.
+The read_dev_chars() function returns :
+ 0 - successful completion
+-ENODEV - cdev invalid
+-EINVAL - an invalid parameter was detected, or the function was called early.
+-EBUSY - an irrecoverable I/O error occurred or the device is not
+ operational.
+read_conf_data() - Read Configuration Data
+Retrieve the device dependent configuration data. Please have a look at your
+device dependent I/O commands for the device specific layout of the node
+descriptor elements.
+The function is meant to be called with an irq handler in place; that is,
+at earliest during set_online() processing.
+The function may be called enabled or disabled, but the device must not be
+int read_conf_data(struct ccw_device, void **buffer, int *length, __u8 lpm);
+cdev - the ccw_device the data is requested for.
+buffer - Pointer to a buffer pointer. The read_conf_data() routine
+ will allocate a buffer and initialize the buffer pointer
+ accordingly. It's the device driver's responsibility to
+ release the kernel memory if no longer needed.
+length - Length of the buffer allocated and retrieved.
+lpm - Logical path mask to be used for retrieving the data. If
+ zero the data is retrieved on the next path available.
+The read_conf_data() function returns :
+ 0 - Successful completion
+-ENODEV - cdev invalid.
+-EINVAL - An invalid parameter was detected, or the function was called early.
+-EIO - An irrecoverable I/O error occurred or the device is
+ not operational.
+-ENOMEM - The read_conf_data() routine couldn't obtain storage.
+-EOPNOTSUPP - The device doesn't support the read configuration
+ data command.
+get_ciw() - get command information word
+This call enables a device driver to get information about supported commands
+from the extended SenseID data.
+struct ciw *
+ccw_device_get_ciw(struct ccw_device *cdev, __u32 cmd);
+cdev - The ccw_device for which the command is to be retrieved.
+cmd - The command type to be retrieved.
+ccw_device_get_ciw() returns:
+NULL - No extended data available, invalid device or command not found.
+!NULL - The command requested.
+ccw_device_start() - Initiate I/O Request
+The ccw_device_start() routines is the I/O request front-end processor. All
+device driver I/O requests must be issued using this routine. A device driver
+must not issue ESA/390 I/O commands itself. Instead the ccw_device_start()
+routine provides all interfaces required to drive arbitrary devices.
+This description also covers the status information passed to the device
+driver's interrupt handler as this is related to the rules (flags) defined
+with the associated I/O request when calling ccw_device_start().
+int ccw_device_start(struct ccw_device *cdev,
+ struct ccw1 *cpa,
+ unsigned long intparm,
+ __u8 lpm,
+ unsigned long flags);
+cdev : ccw_device the I/O is destined for
+cpa : logical start address of channel program
+user_intparm : user specific interrupt information; will be presented
+ back to the device driver's interrupt handler. Allows a
+ device driver to associate the interrupt with a
+ particular I/O request.
+lpm : defines the channel path to be used for a specific I/O
+ request. A value of 0 will make cio use the opm.
+flag : defines the action to be performed for I/O processing
+Possible flag values are :
+DOIO_ALLOW_SUSPEND - channel program may become suspended
+DOIO_DENY_PREFETCH - don't allow for CCW prefetch; usually
+ this implies the channel program might
+ become modified
+DOIO_SUPPRESS_INTER - don't call the handler on intermediate status
+The cpa parameter points to the first format 1 CCW of a channel program :
+struct ccw1 {
+ __u8 cmd_code;/* command code */
+ __u8 flags; /* flags, like IDA addressing, etc. */
+ __u16 count; /* byte count */
+ __u32 cda; /* data address */
+} __attribute__ ((packed,aligned(8)));
+with the following CCW flags values defined :
+CCW_FLAG_DC - data chaining
+CCW_FLAG_CC - command chaining
+CCW_FLAG_SLI - suppress incorrct length
+CCW_FLAG_IDA - indirect addressing
+Via ccw_device_set_options(), the device driver may specify the following
+options for the device:
+DOIO_EARLY_NOTIFICATION - allow for early interrupt notification
+DOIO_REPORT_ALL - report all interrupt conditions
+The ccw_device_start() function returns :
+ 0 - successful completion or request successfully initiated
+-EBUSY - The device is currently processing a previous I/O request, or ther is
+ a status pending at the device.
+-ENODEV - cdev is invalid, the device is not operational or the ccw_device is
+ not online.
+When the I/O request completes, the CDS first level interrupt handler will
+accumalate the status in a struct irb and then call the device interrupt handler.
+The intparm field will contain the value the device driver has associated with a
+particular I/O request. If a pending device status was recognized,
+intparm will be set to 0 (zero). This may happen during I/O initiation or delayed
+by an alert status notification. In any case this status is not related to the
+current (last) I/O request. In case of a delayed status notification no special
+interrupt will be presented to indicate I/O completion as the I/O request was
+never started, even though ccw_device_start() returned with successful completion.
+If the concurrent sense flag in the extended status word in the irb is set, the
+field irb->scsw.count describes the numer of device specific sense bytes
+available in the extended control word irb->scsw.ecw[0]. No device sensing by
+the device driver itself is required.
+The device interrupt handler can use the following definitions to investigate
+the primary unit check source coded in sense byte 0 :
+Depending on the device status, multiple of those values may be set together.
+Please refer to the device specific documentation for details.
+The irb->scsw.cstat field provides the (accumulated) subchannel status :
+SCHN_STAT_PCI - program controlled interrupt
+SCHN_STAT_INCORR_LEN - incorrect length
+SCHN_STAT_PROG_CHECK - program check
+SCHN_STAT_PROT_CHECK - protection check
+SCHN_STAT_CHN_DATA_CHK - channel data check
+SCHN_STAT_CHN_CTRL_CHK - channel control check
+SCHN_STAT_INTF_CTRL_CHK - interface control check
+SCHN_STAT_CHAIN_CHECK - chaining check
+The irb->scsw.dstat field provides the (accumulated) device status :
+DEV_STAT_STAT_MOD - status modifier
+DEV_STAT_CU_END - control unit end
+DEV_STAT_CHN_END - channel end
+DEV_STAT_DEV_END - device end
+DEV_STAT_UNIT_CHECK - unit check
+DEV_STAT_UNIT_EXCEP - unit exception
+Please see the ESA/390 Principles of Operation manual for details on the
+individual flag meanings.
+Usage Notes :
+Prior to call ccw_device_start() the device driver must assure disabled state,
+i.e. the I/O mask value in the PSW must be disabled. This can be accomplished
+by calling local_save_flags( flags). The current PSW flags are preserved and
+can be restored by local_irq_restore( flags) at a later time.
+If the device driver violates this rule while running in a uni-processor
+environment an interrupt might be presented prior to the ccw_device_start()
+routine returning to the device driver main path. In this case we will end in a
+deadlock situation as the interrupt handler will try to obtain the irq
+lock the device driver still owns (see below) !
+The driver must assure to hold the device specific lock. This can be
+accomplished by
+(i) spin_lock(get_ccwdev_lock(cdev)), or
+(ii) spin_lock_irqsave(get_ccwdev_lock(cdev), flags)
+Option (i) should be used if the calling routine is running disabled for
+I/O interrupts (see above) already. Option (ii) obtains the device gate und
+puts the CPU into I/O disabled state by preserving the current PSW flags.
+The device driver is allowed to issue the next ccw_device_start() call from
+within its interrupt handler already. It is not required to schedule a
+bottom-half, unless an non deterministicly long running error recovery procedure
+or similar needs to be scheduled. During I/O processing the Linux/390 generic
+I/O device driver support has already obtained the IRQ lock, i.e. the handler
+must not try to obtain it again when calling ccw_device_start() or we end in a
+deadlock situation!
+If a device driver relies on an I/O request to be completed prior to start the
+next it can reduce I/O processing overhead by chaining a NoOp I/O command
+CCW_CMD_NOOP to the end of the submitted CCW chain. This will force Channel-End
+and Device-End status to be presented together, with a single interrupt.
+However, this should be used with care as it implies the channel will remain
+busy, not being able to process I/O requests for other devices on the same
+channel. Therefore e.g. read commands should never use this technique, as the
+result will be presented by a single interrupt anyway.
+In order to minimize I/O overhead, a device driver should use the
+DOIO_REPORT_ALL only if the device can report intermediate interrupt
+information prior to device-end the device driver urgently relies on. In this
+case all I/O interruptions are presented to the device driver until final
+status is recognized.
+If a device is able to recover from asynchronosly presented I/O errors, it can
+perform overlapping I/O using the DOIO_EARLY_NOTIFICATION flag. While some
+devices always report channel-end and device-end together, with a single
+interrupt, others present primary status (channel-end) when the channel is
+ready for the next I/O request and secondary status (device-end) when the data
+transmission has been completed at the device.
+Above flag allows to exploit this feature, e.g. for communication devices that
+can handle lost data on the network to allow for enhanced I/O processing.
+Unless the channel subsystem at any time presents a secondary status interrupt,
+exploiting this feature will cause only primary status interrupts to be
+presented to the device driver while overlapping I/O is performed. When a
+secondary status without error (alert status) is presented, this indicates
+successful completion for all overlapping ccw_device_start() requests that have
+been issued since the last secondary (final) status.
+Channel programs that intend to set the suspend flag on a channel command word
+(CCW) must start the I/O operation with the DOIO_ALLOW_SUSPEND option or the
+suspend flag will cause a channel program check. At the time the channel program
+becomes suspended an intermediate interrupt will be generated by the channel
+ccw_device_resume() - Resume Channel Program Execution
+If a device driver chooses to suspend the current channel program execution by
+setting the CCW suspend flag on a particular CCW, the channel program execution
+is suspended. In order to resume channel program execution the CIO layer
+provides the ccw_device_resume() routine.
+int ccw_device_resume(struct ccw_device *cdev);
+cdev - ccw_device the resume operation is requested for
+The resume_IO() function returns:
+ 0 - suspended channel program is resumed
+-EBUSY - status pending
+-ENODEV - cdev invalid or not-operational subchannel
+-EINVAL - resume function not applicable
+-ENOTCONN - there is no I/O request pending for completion
+Usage Notes:
+Please have a look at the ccw_device_start() usage notes for more details on
+suspended channel programs.
+ccw_device_halt() - Halt I/O Request Processing
+Sometimes a device driver might need a possibility to stop the processing of
+a long-running channel program or the device might require to initially issue
+a halt subchannel (HSCH) I/O command. For those purposes the ccw_device_halt()
+command is provided.
+int ccw_device_halt(struct ccw_device *cdev,
+ unsigned long intparm);
+cdev : ccw_device the halt operation is requested for
+intparm : interruption parameter; value is only used if no I/O
+ is outstanding, otherwise the intparm associated with
+ the I/O request is returned
+The ccw_device_halt() function returns :
+ 0 - successful completion or request successfully initiated
+-EBUSY - the device is currently busy, or status pending.
+-ENODEV - cdev invalid.
+-EINVAL - The device is not operational or the ccw device is not online.
+Usage Notes :
+A device driver may write a never-ending channel program by writing a channel
+program that at its end loops back to its beginning by means of a transfer in
+channel (TIC) command (CCW_CMD_TIC). Usually this is performed by network
+device drivers by setting the PCI CCW flag (CCW_FLAG_PCI). Once this CCW is
+executed a program controlled interrupt (PCI) is generated. The device driver
+can then perform an appropriate action. Prior to interrupt of an outstanding
+read to a network device (with or without PCI flag) a ccw_device_halt()
+is required to end the pending operation.
+Miscellaneous Support Routines
+This chapter describes various routines to be used in a Linux/390 device
+driver programming environment.
+Get the address of the device specific lock. This is then used in
+spin_lock() / spin_unlock() calls.
+__u8 ccw_device_get_path_mask(struct ccw_device *cdev);
+Get the mask of the path currently available for cdev.
diff --git a/Documentation/s390/ b/Documentation/s390/
new file mode 100644
index 000000000000..515e2f431487
--- /dev/null
+++ b/Documentation/s390/
@@ -0,0 +1,76 @@
+# config3270 -- Autoconfigure /dev/3270/* and /etc/inittab
+# Usage:
+# config3270
+# Output:
+# /tmp/mkdev3270
+# Operation:
+# 1. Run this script
+# 2. Run the script it produces: /tmp/mkdev3270
+# 3. Issue "telinit q" or reboot, as appropriate.
+ADDNOTE=\\"# Additional mingettys for the 3270/tty* driver, tub3270 ---\\"
+if ! ls $P > /dev/null 2>&1; then
+ modprobe tub3270 > /dev/null 2>&1
+ls $P > /dev/null 2>&1 || exit 1
+# Initialize two files, one for /dev/3270 commands and one
+# to replace the /etc/inittab file (old one saved in OLDinittab)
+echo "#!/bin/sh" > $SCR || exit 1
+echo " " >> $SCR
+echo "# Script built by /sbin/config3270" >> $SCR
+if [ ! -d /dev/dasd ]; then
+ echo rm -rf "$D/$SUBD/*" >> $SCR
+echo "grep -v $TTY $INITTAB > $NINITTAB" > $SCRTMP || exit 1
+echo "echo $ADDNOTE >> $NINITTAB" >> $SCRTMP
+if [ ! -d /dev/dasd ]; then
+ echo mkdir -p $D/$SUBD >> $SCR
+# Now query the tub3270 driver for 3270 device information
+# and add appropriate mknod and mingetty lines to our files
+echo what=config > $P
+while read devno maj min;do
+ if [ $min = 0 ]; then
+ fsmaj=$maj
+ if [ ! -d /dev/dasd ]; then
+ echo mknod $D/$TUB c $fsmaj 0 >> $SCR
+ echo chmod 666 $D/$TUB >> $SCR
+ fi
+ elif [ $maj = CONSOLE ]; then
+ if [ ! -d /dev/dasd ]; then
+ echo mknod $D/$TUB$devno c $fsmaj $min >> $SCR
+ fi
+ else
+ if [ ! -d /dev/dasd ]; then
+ echo mknod $D/$TTY$devno c $maj $min >>$SCR
+ echo mknod $D/$TUB$devno c $fsmaj $min >> $SCR
+ fi
+ echo "echo t$min$GETTYLINE $TTY$devno >> $NINITTAB" >> $SCRTMP
+ fi
+done < $P
+echo mv $INITTAB $OINITTAB >> $SCRTMP || exit 1
+cat $SCRTMP >> $SCR
+exit 0
diff --git a/Documentation/s390/crypto/crypto-API.txt b/Documentation/s390/crypto/crypto-API.txt
new file mode 100644
index 000000000000..78a77624a716
--- /dev/null
+++ b/Documentation/s390/crypto/crypto-API.txt
@@ -0,0 +1,83 @@
+crypto-API support for z990 Message Security Assist (MSA) instructions
+AUTHOR: Thomas Spatzier (
+1. Introduction crypto-API
+See Documentation/crypto/api-intro.txt for an introduction/description of the
+kernel crypto API.
+According to api-intro.txt support for z990 crypto instructions has been added
+in the algorithm api layer of the crypto API. Several files containing z990
+optimized implementations of crypto algorithms are placed in the
+arch/s390/crypto directory.
+2. Probing for availability of MSA
+It should be possible to use Kernels with the z990 crypto implementations both
+on machines with MSA available an on those without MSA (pre z990 or z990
+without MSA). Therefore a simple probing mechanisms has been implemented:
+In the init function of each crypto module the availability of MSA and of the
+respective crypto algorithm in particular will be tested. If the algorithm is
+available the module will load and register its algorithm with the crypto API.
+If the respective crypto algorithm is not available, the init function will
+return -ENOSYS. In that case a fallback to the standard software implementation
+of the crypto algorithm must be taken ( -> the standard crypto modules are
+also build when compiling the kernel).
+3. Ensuring z990 crypto module preference
+If z990 crypto instructions are available the optimized modules should be
+preferred instead of standard modules.
+3.1. compiled-in modules
+For compiled-in modules it has to be ensured that the z990 modules are linked
+before the standard crypto modules. Then, on system startup the init functions
+of z990 crypto modules will be called first and query for availability of z990
+crypto instructions. If instruction is available, the z990 module will register
+its crypto algorithm implementation -> the load of the standard module will fail
+since the algorithm is already registered.
+If z990 crypto instruction is not available the load of the z990 module will
+fail -> the standard module will load and register its algorithm.
+3.2. dynamic modules
+A system administrator has to take care of giving preference to z990 crypto
+modules. If MSA is available appropriate lines have to be added to
+Example: z990 crypto instruction for SHA1 algorithm is available
+ add the following line to /etc/modprobe.conf (assuming the
+ z990 crypto modules for SHA1 is called sha1_z990):
+ alias sha1 sha1_z990
+ -> when the sha1 algorithm is requested through the crypto API
+ (which has a module autoloader) the z990 module will be loaded.
+TBD: a userspace module probin mechanism
+ something like 'probe sha1 sha1_z990 sha1' in modprobe.conf
+ -> try module sha1_z990, if it fails to load load standard module sha1
+ the 'probe' statement is currently not supported in modprobe.conf
+4. Currently implemented z990 crypto algorithms
+The following crypto algorithms with z990 MSA support are currently implemented.
+The name of each algorithm under which it is registered in crypto API and the
+name of the respective module is given in square brackets.
+- SHA1 Digest Algorithm [sha1 -> sha1_z990]
+- DES Encrypt/Decrypt Algorithm (64bit key) [des -> des_z990]
+- Tripple DES Encrypt/Decrypt Algorithm (128bit key) [des3_ede128 -> des_z990]
+- Tripple DES Encrypt/Decrypt Algorithm (192bit key) [des3_ede -> des_z990]
+In order to load, for example, the sha1_z990 module when the sha1 algorithm is
+requested (see 3.2.) add 'alias sha1 sha1_z990' to /etc/modprobe.conf.
diff --git a/Documentation/s390/driver-model.txt b/Documentation/s390/driver-model.txt
new file mode 100644
index 000000000000..19461958e2bd
--- /dev/null
+++ b/Documentation/s390/driver-model.txt
@@ -0,0 +1,265 @@
+S/390 driver model interfaces
+1. CCW devices
+All devices which can be addressed by means of ccws are called 'CCW devices' -
+even if they aren't actually driven by ccws.
+All ccw devices are accessed via a subchannel, this is reflected in the
+structures under root/:
+ - sys
+ - legacy
+ - css0/
+ - 0.0.0000/0.0.0815/
+ - 0.0.0001/0.0.4711/
+ - 0.0.0002/
+ ...
+In this example, device 0815 is accessed via subchannel 0, device 4711 via
+subchannel 1, and subchannel 2 is a non-I/O subchannel.
+You should address a ccw device via its bus id (e.g. 0.0.4711); the device can
+be found under bus/ccw/devices/.
+All ccw devices export some data via sysfs.
+cutype: The control unit type / model.
+devtype: The device type / model, if applicable.
+availability: Can be 'good' or 'boxed'; 'no path' or 'no device' for
+ disconnected devices.
+online: An interface to set the device online and offline.
+ In the special case of the device being disconnected (see the
+ notify function under 1.2), piping 0 to online will focibly delete
+ the device.
+The device drivers can add entries to export per-device data and interfaces.
+There is also some data exported on a per-subchannel basis (see under
+chpids: Via which chpids the device is connected.
+pimpampom: The path installed, path available and path operational masks.
+There also might be additional data, for example for block devices.
+1.1 Bringing up a ccw device
+This is done in several steps.
+a. Each driver can provide one or more parameter interfaces where parameters can
+ be specified. These interfaces are also in the driver's responsibility.
+b. After a. has been performed, if necessary, the device is finally brought up
+ via the 'online' interface.
+1.2 Writing a driver for ccw devices
+The basic struct ccw_device and struct ccw_driver data structures can be found
+under include/asm/ccwdev.h.
+struct ccw_device {
+ spinlock_t *ccwlock;
+ struct ccw_device_private *private;
+ struct ccw_device_id id;
+ struct ccw_driver *drv;
+ struct device dev;
+ int online;
+ void (*handler) (struct ccw_device *dev, unsigned long intparm,
+ struct irb *irb);
+struct ccw_driver {
+ struct module *owner;
+ struct ccw_device_id *ids;
+ int (*probe) (struct ccw_device *);
+ int (*remove) (struct ccw_device *);
+ int (*set_online) (struct ccw_device *);
+ int (*set_offline) (struct ccw_device *);
+ int (*notify) (struct ccw_device *, int);
+ struct device_driver driver;
+ char *name;
+The 'private' field contains data needed for internal i/o operation only, and
+is not available to the device driver.
+Each driver should declare in a MODULE_DEVICE_TABLE into which CU types/models
+and/or device types/models it is interested. This information can later be found
+found in the struct ccw_device_id fields:
+struct ccw_device_id {
+ __u16 match_flags;
+ __u16 cu_type;
+ __u16 dev_type;
+ __u8 cu_model;
+ __u8 dev_model;
+ unsigned long driver_info;
+The functions in ccw_driver should be used in the following way:
+probe: This function is called by the device layer for each device the driver
+ is interested in. The driver should only allocate private structures
+ to put in dev->driver_data and create attributes (if needed). Also,
+ the interrupt handler (see below) should be set here.
+int (*probe) (struct ccw_device *cdev);
+Parameters: cdev - the device to be probed.
+remove: This function is called by the device layer upon removal of the driver,
+ the device or the module. The driver should perform cleanups here.
+int (*remove) (struct ccw_device *cdev);
+Parameters: cdev - the device to be removed.
+set_online: This function is called by the common I/O layer when the device is
+ activated via the 'online' attribute. The driver should finally
+ setup and activate the device here.
+int (*set_online) (struct ccw_device *);
+Parameters: cdev - the device to be activated. The common layer has
+ verified that the device is not already online.
+set_offline: This function is called by the common I/O layer when the device is
+ de-activated via the 'online' attribute. The driver should shut
+ down the device, but not de-allocate its private data.
+int (*set_offline) (struct ccw_device *);
+Parameters: cdev - the device to be deactivated. The common layer has
+ verified that the device is online.
+notify: This function is called by the common I/O layer for some state changes
+ of the device.
+ Signalled to the driver are:
+ * In online state, device detached (CIO_GONE) or last path gone
+ (CIO_NO_PATH). The driver must return !0 to keep the device; for
+ return code 0, the device will be deleted as usual (also when no
+ notify function is registerd). If the driver wants to keep the
+ device, it is moved into disconnected state.
+ * In disconnected state, device operational again (CIO_OPER). The
+ common I/O layer performs some sanity checks on device number and
+ Device / CU to be reasonably sure if it is still the same device.
+ If not, the old device is removed and a new one registered. By the
+ return code of the notify function the device driver signals if it
+ wants the device back: !0 for keeping, 0 to make the device being
+ removed and re-registered.
+int (*notify) (struct ccw_device *, int);
+Parameters: cdev - the device whose state changed.
+ event - the event that happened. This can be one of CIO_GONE,
+The handler field of the struct ccw_device is meant to be set to the interrupt
+handler for the device. In order to accommodate drivers which use several
+distinct handlers (e.g. multi subchannel devices), this is a member of ccw_device
+instead of ccw_driver.
+The handler is registered with the common layer during set_online() processing
+before the driver is called, and is deregistered during set_offline() after the
+driver has been called. Also, after registering / before deregistering, path
+grouping resp. disbanding of the path group (if applicable) are performed.
+void (*handler) (struct ccw_device *dev, unsigned long intparm, struct irb *irb);
+Parameters: dev - the device the handler is called for
+ intparm - the intparm which allows the device driver to identify
+ the i/o the interrupt is associated with, or to recognize
+ the interrupt as unsolicited.
+ irb - interruption response block which contains the accumulated
+ status.
+The device driver is called from the common ccw_device layer and can retrieve
+information about the interrupt from the irb parameter.
+1.3 ccwgroup devices
+The ccwgroup mechanism is designed to handle devices consisting of multiple ccw
+devices, like lcs or ctc.
+The ccw driver provides a 'group' attribute. Piping bus ids of ccw devices to
+this attributes creates a ccwgroup device consisting of these ccw devices (if
+possible). This ccwgroup device can be set online or offline just like a normal
+ccw device.
+Each ccwgroup device also provides an 'ungroup' attribute to destroy the device
+again (only when offline). This is a generic ccwgroup mechanism (the driver does
+not need to implement anything beyond normal removal routines).
+To implement a ccwgroup driver, please refer to include/asm/ccwgroup.h. Keep in
+mind that most drivers will need to implement both a ccwgroup and a ccw driver
+(unless you have a meta ccw driver, like cu3088 for lcs and ctc).
+2. Channel paths
+Channel paths show up, like subchannels, under the channel subsystem root (css0)
+and are called 'chp0.<chpid>'. They have no driver and do not belong to any bus.
+Please note, that unlike /proc/chpids in 2.4, the channel path objects reflect
+only the logical state and not the physical state, since we cannot track the
+latter consistently due to lacking machine support (we don't need to be aware
+of anyway).
+status - Can be 'online' or 'offline'.
+ Piping 'on' or 'off' sets the chpid logically online/offline.
+ Piping 'on' to an online chpid triggers path reprobing for all devices
+ the chpid connects to. This can be used to force the kernel to re-use
+ a channel path the user knows to be online, but the machine hasn't
+ created a machine check for.
+3. System devices
+Note: cpus may yet be added here.
+3.1 xpram
+xpram shows up under sys/ as 'xpram'.
+4. Other devices
+4.1 Netiucv
+The netiucv driver creates an attribute 'connection' under
+bus/iucv/drivers/netiucv. Piping to this attibute creates a new netiucv
+connection to the specified host.
+Netiucv connections show up under devices/iucv/ as "netiucv<ifnum>". The interface
+number is assigned sequentially to the connections defined via the 'connection'
+user - shows the connection partner.
+buffer - maximum buffer size.
+ Pipe to it to change buffer size.
diff --git a/Documentation/s390/monreader.txt b/Documentation/s390/monreader.txt
new file mode 100644
index 000000000000..d843bb04906e
--- /dev/null
+++ b/Documentation/s390/monreader.txt
@@ -0,0 +1,197 @@
+Date : 2004-Nov-26
+Author: Gerald Schaefer (
+ Linux API for read access to z/VM Monitor Records
+ =================================================
+This item delivers a new Linux API in the form of a misc char device that is
+useable from user space and allows read access to the z/VM Monitor Records
+collected by the *MONITOR System Service of z/VM.
+User Requirements
+The z/VM guest on which you want to access this API needs to be configured in
+order to allow IUCV connections to the *MONITOR service, i.e. it needs the
+IUCV *MONITOR statement in its user entry. If the monitor DCSS to be used is
+restricted (likely), you also need the NAMESAVE <DCSS NAME> statement.
+This item will use the IUCV device driver to access the z/VM services, so you
+need a kernel with IUCV support. You also need z/VM version 4.4 or 5.1.
+There are two options for being able to load the monitor DCSS (examples assume
+that the monitor DCSS begins at 144 MB and ends at 152 MB). You can query the
+location of the monitor DCSS with the Class E privileged CP command Q NSS MAP
+(the values BEGPAG and ENDPAG are given in units of 4K pages).
+See also "CP Command and Utility Reference" (SC24-6081-00) for more information
+on the DEF STOR and Q NSS MAP commands, as well as "Saved Segments Planning
+and Administration" (SC24-6116-00) for more information on DCSSes.
+1st option:
+You can use the CP command DEF STOR CONFIG to define a "memory hole" in your
+guest virtual storage around the address range of the DCSS.
+Example: DEF STOR CONFIG 0.140M 200M.200M
+This defines two blocks of storage, the first is 140MB in size an begins at
+address 0MB, the second is 200MB in size and begins at address 200MB,
+resulting in a total storage of 340MB. Note that the first block should
+always start at 0 and be at least 64MB in size.
+2nd option:
+Your guest virtual storage has to end below the starting address of the DCSS
+and you have to specify the "mem=" kernel parameter in your parmfile with a
+value greater than the ending address of the DCSS.
+Example: DEF STOR 140M
+This defines 140MB storage size for your guest, the parameter "mem=160M" is
+added to the parmfile.
+User Interface
+The char device is implemented as a kernel module named "monreader",
+which can be loaded via the modprobe command, or it can be compiled into the
+kernel instead. There is one optional module (or kernel) parameter, "mondcss",
+to specify the name of the monitor DCSS. If the module is compiled into the
+kernel, the kernel parameter "monreader.mondcss=<DCSS NAME>" can be specified
+in the parmfile.
+The default name for the DCSS is "MONDCSS" if none is specified. In case that
+there are other users already connected to the *MONITOR service (e.g.
+Performance Toolkit), the monitor DCSS is already defined and you have to use
+the same DCSS. The CP command Q MONITOR (Class E privileged) shows the name
+of the monitor DCSS, if already defined, and the users connected to the
+*MONITOR service.
+Refer to the "z/VM Performance" book (SC24-6109-00) on how to create a monitor
+DCSS if your z/VM doesn't have one already, you need Class E privileges to
+define and save a DCSS.
+modprobe monreader mondcss=MYDCSS
+This loads the module and sets the DCSS name to "MYDCSS".
+This API provides no interface to control the *MONITOR service, e.g. specifiy
+which data should be collected. This can be done by the CP command MONITOR
+(Class E privileged), see "CP Command and Utility Reference".
+Device nodes with udev:
+After loading the module, a char device will be created along with the device
+node /<udev directory>/monreader.
+Device nodes without udev:
+If your distribution does not support udev, a device node will not be created
+automatically and you have to create it manually after loading the module.
+Therefore you need to know the major and minor numbers of the device. These
+numbers can be found in /sys/class/misc/monreader/dev.
+Typing cat /sys/class/misc/monreader/dev will give an output of the form
+<major>:<minor>. The device node can be created via the mknod command, enter
+mknod <name> c <major> <minor>, where <name> is the name of the device node
+to be created.
+# modprobe monreader
+# cat /sys/class/misc/monreader/dev
+# mknod /dev/monreader c 10 63
+This loads the module with the default monitor DCSS (MONDCSS) and creates a
+device node.
+File operations:
+The following file operations are supported: open, release, read, poll.
+There are two alternative methods for reading: either non-blocking read in
+conjunction with polling, or blocking read without polling. IOCTLs are not
+Reading from the device provides a 12 Byte monitor control element (MCE),
+followed by a set of one or more contiguous monitor records (similar to the
+output of the CMS utility MONWRITE without the 4K control blocks). The MCE
+contains information on the type of the following record set (sample/event
+data), the monitor domains contained within it and the start and end address
+of the record set in the monitor DCSS. The start and end address can be used
+to determine the size of the record set, the end address is the address of the
+last byte of data. The start address is needed to handle "end-of-frame" records
+correctly (domain 1, record 13), i.e. it can be used to determine the record
+start offset relative to a 4K page (frame) boundary.
+See "Appendix A: *MONITOR" in the "z/VM Performance" document for a description
+of the monitor control element layout. The layout of the monitor records can
+be found here (z/VM 5.1):
+The layout of the data stream provided by the monreader device is as follows:
+<0 byte read>
+<first MCE> \
+<first set of records> |
+... |- data set
+<last MCE> |
+<last set of records> /
+<0 byte read>
+There may be more than one combination of MCE and corresponding record set
+within one data set and the end of each data set is indicated by a successful
+read with a return value of 0 (0 byte read).
+Any received data must be considered invalid until a complete set was
+read successfully, including the closing 0 byte read. Therefore you should
+always read the complete set into a buffer before processing the data.
+The maximum size of a data set can be as large as the size of the
+monitor DCSS, so design the buffer adequately or use dynamic memory allocation.
+The size of the monitor DCSS will be printed into syslog after loading the
+module. You can also use the (Class E privileged) CP command Q NSS MAP to
+list all available segments and information about them.
+As with most char devices, error conditions are indicated by returning a
+negative value for the number of bytes read. In this case, the errno variable
+indicates the error condition:
+EIO: reply failed, read data is invalid and the application
+ should discard the data read since the last successful read with 0 size.
+EFAULT: copy_to_user failed, read data is invalid and the application should
+ discard the data read since the last successful read with 0 size.
+EAGAIN: occurs on a non-blocking read if there is no data available at the
+ moment. There is no data missing or corrupted, just try again or rather
+ use polling for non-blocking reads.
+EOVERFLOW: message limit reached, the data read since the last successful
+ read with 0 size is valid but subsequent records may be missing.
+In the last case (EOVERFLOW) there may be missing data, in the first two cases
+(EIO, EFAULT) there will be missing data. It's up to the application if it will
+continue reading subsequent data or rather exit.
+Only one user is allowed to open the char device. If it is already in use, the
+open function will fail (return a negative value) and set errno to EBUSY.
+The open function may also fail if an IUCV connection to the *MONITOR service
+cannot be established. In this case errno will be set to EIO and an error
+message with an IPUSER SEVER code will be printed into syslog. The IPUSER SEVER
+codes are described in the "z/VM Performance" book, Appendix A.
+As soon as the device is opened, incoming messages will be accepted and they
+will account for the message limit, i.e. opening the device without reading
+from it will provoke the "message limit reached" error (EOVERFLOW error code)
diff --git a/Documentation/s390/s390dbf.txt b/Documentation/s390/s390dbf.txt
new file mode 100644
index 000000000000..2d1cd939b4df
--- /dev/null
+++ b/Documentation/s390/s390dbf.txt
@@ -0,0 +1,615 @@
+S390 Debug Feature
+files: arch/s390/kernel/debug.c
+ include/asm-s390/debug.h
+The goal of this feature is to provide a kernel debug logging API
+where log records can be stored efficiently in memory, where each component
+(e.g. device drivers) can have one separate debug log.
+One purpose of this is to inspect the debug logs after a production system crash
+in order to analyze the reason for the crash.
+If the system still runs but only a subcomponent which uses dbf failes,
+it is possible to look at the debug logs on a live system via the Linux proc
+The debug feature may also very useful for kernel and driver development.
+Kernel components (e.g. device drivers) can register themselves at the debug
+feature with the function call debug_register(). This function initializes a
+debug log for the caller. For each debug log exists a number of debug areas
+where exactly one is active at one time. Each debug area consists of contiguous
+pages in memory. In the debug areas there are stored debug entries (log records)
+which are written by event- and exception-calls.
+An event-call writes the specified debug entry to the active debug
+area and updates the log pointer for the active area. If the end
+of the active debug area is reached, a wrap around is done (ring buffer)
+and the next debug entry will be written at the beginning of the active
+debug area.
+An exception-call writes the specified debug entry to the log and
+switches to the next debug area. This is done in order to be sure
+that the records which describe the origin of the exception are not
+overwritten when a wrap around for the current area occurs.
+The debug areas itselve are also ordered in form of a ring buffer.
+When an exception is thrown in the last debug area, the following debug
+entries are then written again in the very first area.
+There are three versions for the event- and exception-calls: One for
+logging raw data, one for text and one for numbers.
+Each debug entry contains the following data:
+- Timestamp
+- Cpu-Number of calling task
+- Level of debug entry (0...6)
+- Return Address to caller
+- Flag, if entry is an exception or not
+The debug logs can be inspected in a live system through entries in
+the proc-filesystem. Under the path /proc/s390dbf there is
+a directory for each registered component, which is named like the
+corresponding component.
+The content of the directories are files which represent different views
+to the debug log. Each component can decide which views should be
+used through registering them with the function debug_register_view().
+Predefined views for hex/ascii, sprintf and raw binary data are provided.
+It is also possible to define other views. The content of
+a view can be inspected simply by reading the corresponding proc file.
+All debug logs have an an actual debug level (range from 0 to 6).
+The default level is 3. Event and Exception functions have a 'level'
+parameter. Only debug entries with a level that is lower or equal
+than the actual level are written to the log. This means, when
+writing events, high priority log entries should have a low level
+value whereas low priority entries should have a high one.
+The actual debug level can be changed with the help of the proc-filesystem
+through writing a number string "x" to the 'level' proc file which is
+provided for every debug log. Debugging can be switched off completely
+by using "-" on the 'level' proc file.
+> echo "-" > /proc/s390dbf/dasd/level
+It is also possible to deactivate the debug feature globally for every
+debug log. You can change the behavior using 2 sysctl parameters in
+There are currently 2 possible triggers, which stop the debug feature
+globally. The first possbility is to use the "debug_active" sysctl. If
+set to 1 the debug feature is running. If "debug_active" is set to 0 the
+debug feature is turned off.
+The second trigger which stops the debug feature is an kernel oops.
+That prevents the debug feature from overwriting debug information that
+happened before the oops. After an oops you can reactivate the debug feature
+by piping 1 to /proc/sys/s390dbf/debug_active. Nevertheless, its not
+suggested to use an oopsed kernel in an production environment.
+If you want to disallow the deactivation of the debug feature, you can use
+the "debug_stoppable" sysctl. If you set "debug_stoppable" to 0 the debug
+feature cannot be stopped. If the debug feature is already stopped, it
+will stay deactivated.
+Kernel Interfaces:
+debug_info_t *debug_register(char *name, int pages_index, int nr_areas,
+ int buf_size);
+Parameter: name: Name of debug log (e.g. used for proc entry)
+ pages_index: 2^pages_index pages will be allocated per area
+ nr_areas: number of debug areas
+ buf_size: size of data area in each debug entry
+Return Value: Handle for generated debug area
+ NULL if register failed
+Description: Allocates memory for a debug log
+ Must not be called within an interrupt handler
+void debug_unregister (debug_info_t * id);
+Parameter: id: handle for debug log
+Return Value: none
+Description: frees memory for a debug log
+ Must not be called within an interrupt handler
+void debug_set_level (debug_info_t * id, int new_level);
+Parameter: id: handle for debug log
+ new_level: new debug level
+Return Value: none
+Description: Sets new actual debug level if new_level is valid.
++void debug_stop_all(void);
+Parameter: none
+Return Value: none
+Description: stops the debug feature if stopping is allowed. Currently
+ used in case of a kernel oops.
+debug_entry_t* debug_event (debug_info_t* id, int level, void* data,
+ int length);
+Parameter: id: handle for debug log
+ level: debug level
+ data: pointer to data for debug entry
+ length: length of data in bytes
+Return Value: Address of written debug entry
+Description: writes debug entry to active debug area (if level <= actual
+ debug level)
+debug_entry_t* debug_int_event (debug_info_t * id, int level,
+ unsigned int data);
+debug_entry_t* debug_long_event(debug_info_t * id, int level,
+ unsigned long data);
+Parameter: id: handle for debug log
+ level: debug level
+ data: integer value for debug entry
+Return Value: Address of written debug entry
+Description: writes debug entry to active debug area (if level <= actual
+ debug level)
+debug_entry_t* debug_text_event (debug_info_t * id, int level,
+ const char* data);
+Parameter: id: handle for debug log
+ level: debug level
+ data: string for debug entry
+Return Value: Address of written debug entry
+Description: writes debug entry in ascii format to active debug area
+ (if level <= actual debug level)
+debug_entry_t* debug_sprintf_event (debug_info_t * id, int level,
+ char* string,...);
+Parameter: id: handle for debug log
+ level: debug level
+ string: format string for debug entry
+ ...: varargs used as in sprintf()
+Return Value: Address of written debug entry
+Description: writes debug entry with format string and varargs (longs) to
+ active debug area (if level $<=$ actual debug level).
+ floats and long long datatypes cannot be used as varargs.
+debug_entry_t* debug_exception (debug_info_t* id, int level, void* data,
+ int length);
+Parameter: id: handle for debug log
+ level: debug level
+ data: pointer to data for debug entry
+ length: length of data in bytes
+Return Value: Address of written debug entry
+Description: writes debug entry to active debug area (if level <= actual
+ debug level) and switches to next debug area
+debug_entry_t* debug_int_exception (debug_info_t * id, int level,
+ unsigned int data);
+debug_entry_t* debug_long_exception(debug_info_t * id, int level,
+ unsigned long data);
+Parameter: id: handle for debug log
+ level: debug level
+ data: integer value for debug entry
+Return Value: Address of written debug entry
+Description: writes debug entry to active debug area (if level <= actual
+ debug level) and switches to next debug area
+debug_entry_t* debug_text_exception (debug_info_t * id, int level,
+ const char* data);
+Parameter: id: handle for debug log
+ level: debug level
+ data: string for debug entry
+Return Value: Address of written debug entry
+Description: writes debug entry in ascii format to active debug area
+ (if level <= actual debug level) and switches to next debug
+ area
+debug_entry_t* debug_sprintf_exception (debug_info_t * id, int level,
+ char* string,...);
+Parameter: id: handle for debug log
+ level: debug level
+ string: format string for debug entry
+ ...: varargs used as in sprintf()
+Return Value: Address of written debug entry
+Description: writes debug entry with format string and varargs (longs) to
+ active debug area (if level $<=$ actual debug level) and
+ switches to next debug area.
+ floats and long long datatypes cannot be used as varargs.
+int debug_register_view (debug_info_t * id, struct debug_view *view);
+Parameter: id: handle for debug log
+ view: pointer to debug view struct
+Return Value: 0 : ok
+ < 0: Error
+Description: registers new debug view and creates proc dir entry
+int debug_unregister_view (debug_info_t * id, struct debug_view *view);
+Parameter: id: handle for debug log
+ view: pointer to debug view struct
+Return Value: 0 : ok
+ < 0: Error
+Description: unregisters debug view and removes proc dir entry
+Predefined views:
+extern struct debug_view debug_hex_ascii_view;
+extern struct debug_view debug_raw_view;
+extern struct debug_view debug_sprintf_view;
+ * hex_ascii- + raw-view Example
+ */
+#include <linux/init.h>
+#include <asm/debug.h>
+static debug_info_t* debug_info;
+static int init(void)
+ /* register 4 debug areas with one page each and 4 byte data field */
+ debug_info = debug_register ("test", 0, 4, 4 );
+ debug_register_view(debug_info,&debug_hex_ascii_view);
+ debug_register_view(debug_info,&debug_raw_view);
+ debug_text_event(debug_info, 4 , "one ");
+ debug_int_exception(debug_info, 4, 4711);
+ debug_event(debug_info, 3, &debug_info, 4);
+ return 0;
+static void cleanup(void)
+ debug_unregister (debug_info);
+ * sprintf-view Example
+ */
+#include <linux/init.h>
+#include <asm/debug.h>
+static debug_info_t* debug_info;
+static int init(void)
+ /* register 4 debug areas with one page each and data field for */
+ /* format string pointer + 2 varargs (= 3 * sizeof(long)) */
+ debug_info = debug_register ("test", 0, 4, sizeof(long) * 3);
+ debug_register_view(debug_info,&debug_sprintf_view);
+ debug_sprintf_event(debug_info, 2 , "first event in %s:%i\n",__FILE__,__LINE__);
+ debug_sprintf_exception(debug_info, 1, "pointer to debug info: %p\n",&debug_info);
+ return 0;
+static void cleanup(void)
+ debug_unregister (debug_info);
+ProcFS Interface
+Views to the debug logs can be investigated through reading the corresponding
+> ls /proc/s390dbf/dasd
+flush hex_ascii level raw
+> cat /proc/s390dbf/dasd/hex_ascii | sort +1
+00 00974733272:680099 2 - 02 0006ad7e 07 ea 4a 90 | ....
+00 00974733272:682210 2 - 02 0006ade6 46 52 45 45 | FREE
+00 00974733272:682213 2 - 02 0006adf6 07 ea 4a 90 | ....
+00 00974733272:682281 1 * 02 0006ab08 41 4c 4c 43 | EXCP
+01 00974733272:682284 2 - 02 0006ab16 45 43 4b 44 | ECKD
+01 00974733272:682287 2 - 02 0006ab28 00 00 00 04 | ....
+01 00974733272:682289 2 - 02 0006ab3e 00 00 00 20 | ...
+01 00974733272:682297 2 - 02 0006ad7e 07 ea 4a 90 | ....
+01 00974733272:684384 2 - 00 0006ade6 46 52 45 45 | FREE
+01 00974733272:684388 2 - 00 0006adf6 07 ea 4a 90 | ....
+See section about predefined views for explanation of the above output!
+Changing the debug level
+> cat /proc/s390dbf/dasd/level
+> echo "5" > /proc/s390dbf/dasd/level
+> cat /proc/s390dbf/dasd/level
+Flushing debug areas
+Debug areas can be flushed with piping the number of the desired
+area (0...n) to the proc file "flush". When using "-" all debug areas
+are flushed.
+1. Flush debug area 0:
+> echo "0" > /proc/s390dbf/dasd/flush
+2. Flush all debug areas:
+> echo "-" > /proc/s390dbf/dasd/flush
+Stooping the debug feature
+1. Check if stopping is allowed
+> cat /proc/sys/s390dbf/debug_stoppable
+2. Stop debug feature
+> echo 0 > /proc/sys/s390dbf/debug_active
+lcrash Interface
+It is planned that the dump analysis tool lcrash gets an additional command
+'s390dbf' to display all the debug logs. With this tool it will be possible
+to investigate the debug logs on a live system and with a memory dump after
+a system crash.
+Investigating raw memory
+One last possibility to investigate the debug logs at a live
+system and after a system crash is to look at the raw memory
+under VM or at the Service Element.
+It is possible to find the anker of the debug-logs through
+the 'debug_area_first' symbol in the System map. Then one has
+to follow the correct pointers of the data-structures defined
+in debug.h and find the debug-areas in memory.
+Normally modules which use the debug feature will also have
+a global variable with the pointer to the debug-logs. Following
+this pointer it will also be possible to find the debug logs in
+For this method it is recommended to use '16 * x + 4' byte (x = 0..n)
+for the length of the data field in debug_register() in
+order to see the debug entries well formatted.
+Predefined Views
+There are three predefined views: hex_ascii, raw and sprintf.
+The hex_ascii view shows the data field in hex and ascii representation
+(e.g. '45 43 4b 44 | ECKD').
+The raw view returns a bytestream as the debug areas are stored in memory.
+The sprintf view formats the debug entries in the same way as the sprintf
+function would do. The sprintf event/expection fuctions write to the
+debug entry a pointer to the format string (size = sizeof(long))
+and for each vararg a long value. So e.g. for a debug entry with a format
+string plus two varargs one would need to allocate a (3 * sizeof(long))
+byte data area in the debug_register() function.
+NOTE: If using the sprintf view do NOT use other event/exception functions
+than the sprintf-event and -exception functions.
+The format of the hex_ascii and sprintf view is as follows:
+- Number of area
+- Timestamp (formatted as seconds and microseconds since 00:00:00 Coordinated
+ Universal Time (UTC), January 1, 1970)
+- level of debug entry
+- Exception flag (* = Exception)
+- Cpu-Number of calling task
+- Return Address to caller
+- data field
+The format of the raw view is:
+- Header as described in debug.h
+- datafield
+A typical line of the hex_ascii view will look like the following (first line
+is only for explanation and will not be displayed when 'cating' the view):
+area time level exception cpu caller data (hex + ascii)
+00 00964419409:440690 1 - 00 88023fe
+Defining views
+Views are specified with the 'debug_view' structure. There are defined
+callback functions which are used for reading and writing the proc files:
+struct debug_view {
+ char name[DEBUG_MAX_PROCF_LEN];
+ debug_prolog_proc_t* prolog_proc;
+ debug_header_proc_t* header_proc;
+ debug_format_proc_t* format_proc;
+ debug_input_proc_t* input_proc;
+ void* private_data;
+typedef int (debug_header_proc_t) (debug_info_t* id,
+ struct debug_view* view,
+ int area,
+ debug_entry_t* entry,
+ char* out_buf);
+typedef int (debug_format_proc_t) (debug_info_t* id,
+ struct debug_view* view, char* out_buf,
+ const char* in_buf);
+typedef int (debug_prolog_proc_t) (debug_info_t* id,
+ struct debug_view* view,
+ char* out_buf);
+typedef int (debug_input_proc_t) (debug_info_t* id,
+ struct debug_view* view,
+ struct file* file, const char* user_buf,
+ size_t in_buf_size, loff_t* offset);
+The "private_data" member can be used as pointer to view specific data.
+It is not used by the debug feature itself.
+The output when reading a debug-proc file is structured like this:
+"prolog_proc output"
+"header_proc output 1" "format_proc output 1"
+"header_proc output 2" "format_proc output 2"
+"header_proc output 3" "format_proc output 3"
+When a view is read from the proc fs, the Debug Feature calls the
+'prolog_proc' once for writing the prolog.
+Then 'header_proc' and 'format_proc' are called for each
+existing debug entry.
+The input_proc can be used to implement functionality when it is written to
+the view (e.g. like with 'echo "0" > /proc/s390dbf/dasd/level).
+For header_proc there can be used the default function
+debug_dflt_header_fn() which is defined in in debug.h.
+and which produces the same header output as the predefined views.
+00 00964419409:440761 2 - 00 88023ec
+In order to see how to use the callback functions check the implementation
+of the default views!
+#include <asm/debug.h>
+#define UNKNOWNSTR "data: %08x"
+const char* messages[] =
+{"This error...........\n",
+ "That error...........\n",
+ "Problem..............\n",
+ "Something went wrong.\n",
+ "Everything ok........\n",
+static int debug_test_format_fn(
+ debug_info_t * id, struct debug_view *view,
+ char *out_buf, const char *in_buf
+ int i, rc = 0;
+ if(id->buf_size >= 4) {
+ int msg_nr = *((int*)in_buf);
+ if(msg_nr < sizeof(messages)/sizeof(char*) - 1)
+ rc += sprintf(out_buf, "%s", messages[msg_nr]);
+ else
+ rc += sprintf(out_buf, UNKNOWNSTR, msg_nr);
+ }
+ out:
+ return rc;
+struct debug_view debug_test_view = {
+ "myview", /* name of view */
+ NULL, /* no prolog */
+ &debug_dflt_header_fn, /* default header for each entry */
+ &debug_test_format_fn, /* our own format function */
+ NULL, /* no input function */
+ NULL /* no private data */
+debug_info_t *debug_info;
+debug_info = debug_register ("test", 0, 4, 4 ));
+debug_register_view(debug_info, &debug_test_view);
+for(i = 0; i < 10; i ++) debug_int_event(debug_info, 1, i);
+> cat /proc/s390dbf/test/myview
+00 00964419734:611402 1 - 00 88042ca This error...........
+00 00964419734:611405 1 - 00 88042ca That error...........
+00 00964419734:611408 1 - 00 88042ca Problem..............
+00 00964419734:611411 1 - 00 88042ca Something went wrong.
+00 00964419734:611414 1 - 00 88042ca Everything ok........
+00 00964419734:611417 1 - 00 88042ca data: 00000005
+00 00964419734:611419 1 - 00 88042ca data: 00000006
+00 00964419734:611422 1 - 00 88042ca data: 00000007
+00 00964419734:611425 1 - 00 88042ca data: 00000008
+00 00964419734:611428 1 - 00 88042ca data: 00000009