return to OCLUG Web Site
A Django site.
October 15, 2012

Pythian
pythian
» Adding Networks to Exadata: Fun with policy routing

I’ve noticed that Exadata servers are now configured to use Linux policy routing. Peeking at My Oracle Support, note 1306154.1 goes in a bit more detail about this configuration. It’s apparently delivered by default with factory build images 11.2.2.3.0 and later. The note goes on to explain that this configuration was implemented because of asymetric routing problems associated with the management network:

Database servers are deployed with 3 logical network interfaces configured: management network (typically eth0), client access network (typically bond1 or bondeth0), and private network (typically bond0 or bondib0). The default route for the system uses the client access network and the gateway for that network. All outbound traffic that is not destined for an IP address on the management or private networks is sent out via the client access network. This poses a problem for some connections to the management network in some customer environments.


It goes onto mention a bug where this was reported:

@ BUG:11725389 – TRACK112230: MARTIAN SOURCE REPORTED ON DB NODES BONDETH0 INTERFACE

The bug is not public, but the title does show the type of error messages that would appear if a packet with a non-local source address comes out.

This configuration is implemented using RedHat Oracle Linux-style /etc/sysconfig/network-scripts files, with matched rule- and route- files for each interface.

A sample configuration, where the management network is in the 10.10.10/24 subnet, is:

[root@exa1db01 network-scripts]# cat rule-eth0
from 10.10.10.93 table 220
to 10.10.10.93 table 220
[root@exa1db01 network-scripts]# cat route-eth0
10.10.10.0/24 dev eth0 table 220
default via 10.10.10.1 dev eth0 table 220

This configuration tells traffic originating from the 10.10.10.93 IP (which is the management interface IP on this particular machine), and also traffic destined to this address, to be directed away from the regular system routing table, to a special routing table 220. route-eth0 configures table 220 with two router: one for the local network, and a default route through a router on the 10.10.10.1 network.

This contrasts with default gateway of the machine itself:

[root@exa1db01 network-scripts]# grep GATEWAY /etc/sysconfig/network
GATEWAYDEV=bondeth0
GATEWAY=10.50.50.1

The difference between this type of policy routing and regular routing is that traffic with the _source_ address of 10.10.10.93 will automatically go through default gateway 10.10.10.1, regardless of the destination. (The bible for Linux routing configuration is the Linux Advanced Routing and Traffic Control HOWTO, for those looking for more details)

I ran into an issue with this configuration when adding a second external network on the bondeth1 interface. I set up the additional interface configuration for a network, 10.50.52.0/24:

[root@exa1db01 network-scripts]# cat ifcfg-bondeth1
DEVICE=bondeth1
USERCTL=no
BOOTPROTO=none
ONBOOT=yes
IPADDR=10.50.52.104
NETMASK=255.255.255.0
NETWORK=10.50.52.0
BROADCAST=10.50.52.255
BONDING_OPTS="mode=active-backup miimon=100 downdelay=5000 updelay=5000 num_grat_arp=100"
IPV6INIT=no
GATEWAY=10.50.52.1

I also added rule and route entries:

[root@exa1db01 network-scripts]# cat rule-bondeth1
from 10.50.52.104 table 211
to 10.50.52.104 table 211
[root@exa1db01 network-scripts]# cat route-bondeth1
10.50.52.0/24 dev bondeth1 table 211
10.100.52.0/24 via 10.50.52.1 dev bondeth1 table 211
default via 10.50.52.1 dev bondeth1 table 211

This was a dedicated data guard network to a remote server, IP 10.100.52.10.

The problem with this configuration was: it didn’t work. Using tcpdump, I could see incoming requests come in on the bondeth1 interface, but the replies come out the system default route on bondeth0, and not reaching their destination. After some digging, I did find the cause of the problem: in order to determine the packet source IP, the kernel was looking up the destination in the default routing table (table 255). And the route for the 10.100.52.0 network was in non-default table 211. So the packet followed the default route instead, got a source address in the client-access network, and never matched any of the routing rules for the data guard network.

The solution ended up being rather simple: taking out the “table 211″ for the data guard network route, effectively putting it in the default routing table:

[root@exa1db01 network-scripts]# cat route-bondeth1
10.50.52.0/24 dev bondeth1 table 211
default via 10.50.52.1 dev bondeth1 table 211
10.100.52.0/24 via 10.50.52.1 dev bondeth1

And then we ran into a second issue: the main interface IP could now be reached, but not the virtual IP (VIP). This is because the rule configuration, taken from the samples, doesn’t list the VIP address at all. To avoid this issue, and in case of VIP addresses migrating from other cluster nodes, we set up a netmask in the rule file, making all addresses in the data guard network use this particular routing rule:

[root@exa1db01 network-scripts]# cat rule-bondeth1
from 10.50.52.0/24 table 211
to 10.50.52.0/24 table 211

So to sum up, when setting up interfaces in a policy-routed Exadata system remember to:

  • Set up the interface itself and any bonds using ifcfg- files
  • Create a rule- file for the interface, encompassing every possible address the interface could have. I added the entire IP subnet. Add “from” and “to” lines with a unique routing table number
  • Create a route- file for the interface, listing a local network route and a default route with the default router of the subnet, all using the table number defined on the previous step
  • Add to the route- file any static routes required on this interface, but don’t add a table qualifier

The final configuration:

[root@exa1db01 network-scripts]# cat ifcfg-eth8
DEVICE=eth8
HOTPLUG=no
IPV6INIT=no
HWADDR=00:1b:21:xx:xx:xx
ONBOOT=yes
MASTER=bondeth1
SLAVE=yes
BOOTPROTO=none
[root@exa1db01 network-scripts]# cat ifcfg-eth12
DEVICE=eth12
HOTPLUG=no
IPV6INIT=no
HWADDR=00:1b:21:xx:xx:xx
ONBOOT=yes
MASTER=bondeth1
SLAVE=yes
BOOTPROTO=none
[root@exa1db01 network-scripts]# cat ifcfg-bondeth1
DEVICE=bondeth1
USERCTL=no
BOOTPROTO=none
ONBOOT=yes
IPADDR=10.50.52.104
NETMASK=255.255.255.0
NETWORK=10.50.52.0
BROADCAST=10.50.52.255
BONDING_OPTS="mode=active-backup miimon=100 downdelay=5000 updelay=5000 num_grat_arp=100"
IPV6INIT=no
GATEWAY=10.50.52.1
[root@exa1db01 network-scripts]# cat rule-bondeth1
from 10.50.52.0/24 table 211
to 10.50.52.0/24 table 211
[root@exa1db01 network-scripts]# cat route-bondeth1
10.50.52.0/24 dev bondeth1 table 211
default via 10.50.52.1 dev bondeth1 table 211
10.100.52.0/24 via 10.50.52.1 dev bondeth1

September 20, 2012

Pythian
pythian
» Troubleshooting ORA-27090 async I/O errors with systemtap

Last week I ran into an issue on a high-volume Oracle database, whereby sessions were periodically failing with ORA-27090 errors. Job queue processes were also seeing this error, and showing messages like this in the database alert log:

Tue Sep 11 20:56:15 2012
Errors in file /orahome/oracle/base/diag/rdbms/dbm/dbm1/trace/dbm1_j001_126103.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 11: Resource temporarily unavailable
Additional information: 3
Additional information: 128
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 11: Resource temporarily unavailable
Additional information: 3
Additional information: 128


The tracefile just showed the same ORA-27090 messages, so nothing particularly useful there. oerr is of no help:

$ oerr ora 27090
27090, 00000, "Unable to reserve kernel resources for asynchronous disk I/O"
// *Cause: The system call to reserve kernel resources for asynchronous I/O
// has failed.
// *Action: Check errno

There’s a known bug, 7306820 “ORA-7445 [krhahw] / ORA-27090 during file header read. Instance may crash”, but this bug is fixed in 11.2.0.1, and this database is running 11.2.0.3.

And on top of that, it’s an Exadata system, so I/O to storage servers goes over the InfiniBand network rather than using async I/O (AIO) calls.

A web search turned up a blog entry from DBHK’s blog, pointing to a value of aio-max-nr being set too low. However aio-max-nr is actually set to the same level as the recommended value, and matching Exadata defaults as well:

# cat /proc/sys/fs/aio-max-nr
3145728

The Linux kernel documentation has a brief but meaty description of this parameter:

aio-nr & aio-max-nr:

aio-nr is the running total of the number of events specified on the io_setup system call for all currently active aio contexts. If aio-nr reaches aio-max-nr then io_setup will fail with EAGAIN. Note that raising aio-max-nr does not result in the pre-allocation or re-sizing of any kernel data structures.

Having a peek at aio-nr:

# cat /proc/sys/fs/aio-nr
3145726

We’re within 2 of the absolute limit, so it looks highly likely that this limit is indeed the problem. However, the question is: who is using these AIO events? This DB is a huge session hog (8000+ concurrent sessions) but even there, 3m is a pretty high limit. And at this point we can’t even be sure that it’s database processes using up the AIO events.

The only AIO-related information in /proc (or /sys for that matter) is the two files in /proc/sys/fs. To go into more detail requires some more tools.

Solaris admins will no doubt be familiar with DTrace, a kernel tracing framework that can expose all kinds of information in the OS kernel, among other things. Oracle has ported DTrace to Linux, but it requires the latest-and-greatest UEK2 kernel, and not yet supported on Exadata.

I came across another tool that also allows kernel inspection, and _is_ available in Oracle Linux 5: systemtap. systemtap hooks into the call stack, allowing function calls to be traced, arguments captured, and if you’re really brave, actually modified.

With dependencies, I ended up needing to add four packages. As this machine doesn’t (yet) have a working yum repository, I used public-yum.oracle.com to obtain the following:

avahi-0.6.16-10.el5_6.x86_64.rpm
dbus-python-0.70-9.el5_4.x86_64.rpm
systemtap-1.6-7.el5_8.x86_64.rpm
systemtap-runtime-1.6-7.el5_8.x86_64.rpm

The avahi package is tool for plug-and-play networking that I don’t exactly want running on a server, but the systemtap binary is linked to it for remote copmilation capability. Avahi configures itself to auto-start itself on next boot, so I disabled that:

# chkconfig avahi-daemon off
# chkconfig avahi-dnsconfd off

The systemtap packages complained about missing kernel package depedencies, since this system is running’s UEK kernel, naming the kernel package kernel-u2k instead. I ended up installing when with the –nodeps option to skip dependency checking.

I couldn’t find any pre-made scripts to monitor AIO, but in a 2008 presentation from Oracle Linux engineering does have a bullet point on it:

• Tracking resources tuned via aio_nr and aio_max_nr

So based on some of the many example scripts I set out to build a script to monitor AIO calls. Here is the end result:

stap -ve '
global allocated, allocatedctx, freed

probe syscall.io_setup {
  allocatedctx[pid()] += maxevents; allocated[pid()]++;
  printf("%d AIO events requested by PID %d (%s)\n",
  	maxevents, pid(), cmdline_str());
}
probe syscall.io_destroy {freed[pid()]++}

probe kprocess.exit {
  if (allocated[pid()]) {
     printf("PID %d exited\n", pid());
     delete allocated[pid()];
     delete allocatedctx[pid()];
     delete freed[pid()];
  }
}

probe end {
foreach (pid in allocated) {
   printf("PID %d allocated=%d allocated events=%d freed=%d\n",
      pid, allocated[pid], allocatedctx[pid], freed[pid]);
}
}
'

Sample output (using sytemtap’s -v verbose option to see compilation details):

Pass 1: parsed user script and 76 library script(s) using 147908virt/22876res/2992shr kb, in 130usr/10sys/146real ms.
Pass 2: analyzed script: 4 probe(s), 10 function(s), 3 embed(s), 4 global(s) using 283072virt/49864res/4052shr kb, in 450usr/140sys/586real ms.
Pass 3: using cached /root/.systemtap/cache/11/stap_111c870f2747cede20e6a0e2f0a1b1ae_6256.c
Pass 4: using cached /root/.systemtap/cache/11/stap_111c870f2747cede20e6a0e2f0a1b1ae_6256.ko
Pass 5: starting run.
128 AIO events requested by PID 32885 (oracledbm1 (LOCAL=NO))
4096 AIO events requested by PID 32885 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 69099 (oracledbm1 (LOCAL=NO))
4096 AIO events requested by PID 69099 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 69142 (oracledbm1 (LOCAL=NO))
4096 AIO events requested by PID 69142 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 69099 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 69142 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 32885 (oracledbm1 (LOCAL=NO))
4096 AIO events requested by PID 69142 (oracledbm1 (LOCAL=NO))
4096 AIO events requested by PID 69099 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 69142 (oracledbm1 (LOCAL=NO))
128 AIO events requested by PID 69099 (oracledbm1 (LOCAL=NO))
...
(and when control-C is pressed):

PID 99043 allocated=6 allocatedevents=12672 freed=3
PID 37074 allocated=12 allocatedevents=25344 freed=6
PID 99039 allocated=18 allocatedevents=38016 freed=9
PID 69142 allocated=24 allocatedevents=50688 freed=12
PID 32885 allocated=36 allocatedevents=76032 freed=18
PID 69099 allocated=6 allocatedevents=12672 freed=3
Pass 5: run completed in 0usr/50sys/9139real ms.

It’s quite obvious here that the AIO allocations are all happening from oracle database processes.

From the summary output we can see that each process seems to run io_setup twice as much as io_destroy; kernel gurus may have an answer to this, but I suspect it has more to do with the data gathering than a massive leak in AIO events.

But the more interesting result is the frequent allocation of 4096 AIO events at a time. On a database with 8000 connections, that would be over 10 times the current limit.

The only major downside of increasing this limit seems to be th avoid exhausting kernel memory. From a 2009 post to the linux-kernel mailing list:

Each time io_setup is called, a ring buffer is allocated that can hold nr_events I/O completions. That ring buffer is then mapped into the process’ address space, and the pages are pinned in memory. So, the reason for this upper limit (I believe) is to keep a malicious user from pinning all of kernel memory.

And with consultation with Oracle support, we set aio-max-nr to 50 million, enough to accommodate three databases with 16k connections all allocating 4096 AIO events. Or in other words, way more than we ever expect to use.

# sysctl -w fs.aio-max-nr=50000000

And since this change, the ORA-27090 errors have gone away.