AFDB logo
mn :: comp :: net

LRP QoS HOWTO


Jack Coates, jack@monkeynoodle.org

REVISION HISTORY


11-18-01 Added a link
9-04-01 Expanded QoS explanation, removed kernel compilation tip
4-29-01 Fixed a broken URL, modified recommendation for directions (Sec. 2.2)
3-17-01 First pass.

0. Document Status Note

This document is still unfinished -- the implementation of QoS on LRP turned out to be a bigger hassle than I expected. I have it working properly on my own router, but there are a couple of parameters I don't understand how to properly tune yet.

1. What's all this?

Quality of Service (QoS), Fair Queuing, bandwidth throttling/control/regulation, policy-based routing -- in one way or another all of these phrases have to do with the same concept: prioritizing one type of traffic over another. Imagine a shared network with two workstations and a router. If the left hand workstation downloads an ISO image of the latest Linux distribution, what happens to the right hand workstation's SSH and HTTP sessions? Without QoS, those sessions will be bandwidth starved. With QoS, the right hand station will still get acceptable performance because the left hand station's relatively low-priority traffic will be throttled back.

Beginning with kernel 2.1.105, Linux systems can implement QoS using a variety of algorithms. See references at the end of this document for details. From a 50,000 foot level, it works something like this:

Packets enter the router on one interface, are evaluated for destination, matched to the routing table, and then directed to the outbound interface. If QoS is turned on, at the outbound interface the packets are evaluated into "classes." Class in this case refers to the type and importance of the traffic -- in EigerStein, the default classes are Interactive, Critical, and Bulk. Each class has a "queue" associated with it, and the evaluated packets are placed into the proper queue. Once enqueued, the packets are moved along by the kernel traffic control, which enforces the policy for each queue and the interface.

In the simplest QoS algorithms, you can think of the queues as leaky buckets. Critical and Interactive have big holes, Bulk has a small hole. As traffic comes in and is queued up, it goes into the buckets. Each bucket is then allowed to leak in turn into the outbound interface; Critical and Interactive can therefore leak a lot through their big holes, and Bulk can leak a little through its small hole. If Critical and Interactive are empty, Bulk can use the whole circuit. The key thing to understand here is that "bandwidth" is a misnomer; the circuit isn't any wider, it's just faster. A bit going over a DSL line ties that line up just as thoroughly as a bit going over a modem line ties that line up; the difference is that the bit takes more time to clear out of the modem line. So, here's what leaky buckets do for you:

A=1500 byte FTP packets
B=25 byte SSH packets

without QoS
INTERNAL INTERFACE RECEIVES: AAABAAAABAABBBA
EXTERNAL INTERFACE SENDS:
-------------------------------------------------------------
| A | A | A | B | A | A | A | A | B | A | A | B | B | B | A |
-------------------------------------------------------------

with QoS
INTERNAL INTERFACE RECEIVES: AAABAAAABAABBBA
QoS ENQUEUES TO INTERACTIVE: BBBBB
QoS ENQUEUES TO BULK: AAAAAAAAAA
EXTERNAL INTERFACE SENDS: BBBBABAAAAAA
-------------------------------------------------------------
| B | B | B | B | A | B | A | A | A | A | A | A | A | A | A |
-------------------------------------------------------------

By separating out the interactive SSH traffic before sending anything out, QoS allows you to send a few SSH packets for every FTP packet, and make sure they go first. SSH will still be slower than if you weren't doing FTP at the same time, but performance should be greatly improved over the non-QoS setup.

2. How do I do it?

The examples in this document assume you're using EigerSteinBETA2 (http://lrp.steinkuehler.net) as your LRP platform, but any modern LRP distribution (LRP 2.9.8, Oxygen) should support doing this - simply recreate the relevant sections of Charles' scripts, or download the entire script package as a .lrp file from http://lrp.steinkuehler.net. If you happen to be using the Materhorn distribution, these files will look familiar -- that said, this is a really good time to go upgrade your router to something newer. Materhorn is susceptible to three major security bugs and is no longer supported by anyone.

2.1. files required

EigerStein comes with the scripts, but not the tools -- you'll need to install bwidth22.lrp to get the tc, shapecfg, rtmon, and rtacct programs. This package is available from http://beta.linuxrouter.org/download/Materhorn/packages/?S=D. or
ftp://ftp.sourceforge.net/leaf/ELD_Eiger-3.1.0a_pkg_packages.tar.gz. Please note that if you do not apply this package, the configurations below will not take affect.

Additionally, you'll need kernel modules in your modules.lrp and loaded in /etc/conf.modules. Those can be downloaded from http://lrp.steinkuehler.net/files/kernels/Eiger/modules/misc/. There's a lot more information about how and why and what these do in the links at the end of this document.

sch_cbq.o       Class-Based Queuing scheduling algorithm
                        *required
sch_csz.o       Clark-Shenker-Zhang scheduling algorithm
                        *broken, not required
sch_prio.o      Prioritization algorithm
                        *useful as a leaf to sch_cbq.o, but not 
                         required
sch_red.o       Random Early Detection scheduling algorithm
                        *alternate scheduler, not required
sch_sfq.o       Stochastic Fairness Queueing algorithm
                        *required, as a leaf to sch_cbq.o
sch_teql.o      True Link Equalizer algorithm
                        *if you're using load-balancing on 
                         multiple interfaces, you'll need this as
                         a leaf to sch_cbq.o
sch_tbf.o       Token Bucket Filter algorithm
                        *optional, as a leaf to sch_cbq.o
cls_route4.o    route classifier, required
cls_fw.o        maps ipchains' fwmark to traffic class
                        *required
cls_u32.o       Packet classifier, alternate to RSVP
                        *required, preferred for load-balanced 
                         scenarios
cls_rsvp.o      RSVP resource reservation classification 
                protocol
                        *required
cls_rsvp6.o     required if you're using IPv6, otherwise no

2.2. network.conf

In lrcfg, go to Network settings and Network Configuration. You should now be editing /etc/network.conf. Scroll down to the Interfaces area and locate your external interface. LRP's QoS is implemented on a per-interface basis, and only applies to traffic EXITING the interface. So, for a typical dual-ethernet router, it looks like this:

internet ----------> eth0 -> kernel -> eth1 -> QUEUE -> LAN
internet <- QUEUE <- eth0 <- kernel <- eth1 <---------- LAN

That means if you're a home user trying to improve Internet speed, you just need to do this on your external interface (eth0 or ppp0 or isdn0). If you're a administrator trying to prioritize a service above all other traffic, then do your inside interface only, or both inside and outside interfaces. Of course, traffic with lots of bidirectional back and forth (e.g. first person shooter games) will benefit from QoS'ing both interfaces, inside and out.

On the interface where you're going to implement QoS, you want two lines like this: eth0_FAIRQ=YES
eth0_TXQLEN=100

FAIRQ turns on a somewhat preconfigured QoS mechanism, which should improve performance enough (or not) to let you know if further tweaking is worthwhile. The default behavior in EigerStein is Stochastic Fair Queueing (SFQ), which is sort of like Weighted Fair Queuing without the weights. After traffic has been classed and queued, SFQ processes the queues on a round-robin basis. This isn't pefect, but it is very efficient -- which is important on the low-powered boxes usually used as routers. To change it, you'll need to recompile the kernel. Other options are reviewed at http://snafu.freedom.org/linux2.2/iproute-notes.html.

Strictly speaking, TXQLEN isn't related to QoS. However, QoS is a waste of time if you haven't got the interfaces configured properly. TXQLEN is the transmit queue length. Modifying it modifies the size of the buffer in front of an interface. If the interface is slow speed or high latency, this number should be lower. The default value of 100 bytes per second should be low for many users -- I'd recommend a maximum buffer equal in size to your upstream bandwidth number, so that 1 second is sufficient to clear the buffer. In other words, between 26 and 46 kb for modem users, 128 kb for ADSL users, etc.

If you're impatient, you can stop here or skip to the class definitions. Enabling FAIRQ should help by prioritizing the usual suspects (interactive shell protocols and DNS lookups). However, there are more options to look at:

eth0_BNDWIDTH sets the real outbound bandwidth of the interface. This is very important for DSL users -- if your router is trying to send traffic at 10 Mbps through your 128 Kbps uplink, there will be problems. Syntax here as follows:


[n][speed] with no space between them, case-sensitive.
values for [speed]:
        none : bits per second, (n*1)/8
        kbit : kilobits per second, (n*1024)/8
        mbit : megabits per second, (n*1024*1024)/8
        bps  : bytes per second, n*1
        kbps : kilobytes per second, n*1024
        k    : kilobytes per second, n*1024
        kb   : kilobytes per second, n*1024
        mbps : megabytes per second, n*1024*1024
        m    : megabytes per second, n*1024*1024
        mb   : megabytes per second, n*1024*1024

Examples:
Ethernet = 10mbit
ISDN or ADSL upstream = 128kbit
56K modem = 53k

eth0_HNDL sets an interface identification handle -- if you enable FAIRQ on more than one interface, make sure that the handles are different.

eth0_IABURST sets the limit to which Interactive Queue traffic can burst in front of other traffic. The default should be fine, but experiment if you see timeouts or traffic is sluggish or jittery. If this is too high, high-priority traffic could prevent low-priority traffic from getting through at all.

eth0_IARATE Not sure what this does.

eth0_PXMTU sets the pysical maximum transmission unit size for this interface, including the link layer header. For DSL and cable modem users, the default of 1514 is correct. For analog modem users, the default of 1500 is correct.

3. Gotchas

Perhaps the most common mistake is one of design -- assuming that QoS will fix a larger problem. Once upon a time, I had a client with eleven remote offices and one central office, running a mix of TCP/IP and DLSW (an IBM protocol used to tunnel SNA and NetBIOS over IP for dumb terminals) on a Frame Relay network. Performance was poor for the DLSW protocol, which needed low latency, and they theorized that Internet browsing on regular TCP/IP was taking up too much bandwidth. So, they implemented Cisco Weighted Fair Queuing (WFQ) on all their routers, and by doing so caused their TCP and DLSW connections to fail more often than succeed. Why? There are two reasons:

  1. QoS doesn't create bandwidth. If you don't have enough bandwidth to support your traffic, QoS will not help the situation. In fact, if the bandwidth problem is extreme QoS will be the straw that breaks the camel's back -- beatdown will occur faster with QoS than without. Reliable transport protocols like TCP will retransmit packets if they don't make it through the first time. If the retransmits also don't make it through, they get retransmitted again. Eventually the sending computer gives up, but a circuit carrying lots of sessions can easily end up carrying so much retransmit traffic that it is causing congestion for new packets. That's beatdown.
  2. QoS is implemented in a router, not in a network. The network forwards or discards packets basked on congestion, and cares not for your prioritization. (Actually, there are networks that do care, see ATM and MPLS and tag switching, but if you're not on one of those (yes DSL networks are ATM, but no it doesn't count in this context), prioritization means nothing outside of your router.) If the network is congested and throws away your traffic, performance will get worse -- up to the point at which beatdown will occur.

This customer was generating an average of 30 kbps traffic per remote site, peaking to 60 or 70 kbps during the peak periods (8 to 10 am, 1 to 2 pm, and 4 to 5 pm). The circuits were mix of DS-0 (56kbps) and DS-1 (T1), but to save money all circuits had CIR of 0. In other words, the phone company made no guarantee that any of their traffic would get through the network, and sure enough, a lot of it got thrown away. During the average traffic periods, retransmits could take care of this. During peaks, beatdown would occur. When they added WFQ, they effectively divided each pipe into two smaller pipes (one for DLSW, one for other traffic). Since there was still no guarantee of delivery from the Frame Relay network, this just made beatdown occur faster.

The solution in this case was to increase CIR on their Frame Relay circuits to 56 kbps. Once that was done, both protocols performed fine during normal hours, and DLSW performed fine during peaks (regular TCP sessions were sluggish during traffic peaks).QoS alone didn't help, but QoS did let them get away with a 56K CIR instead of a 128K CIR.

4. Resources

What is it?
http://www.cisco.com/univercd/cc/td/doc/product/software/ios120/12cgcr/qos_c/qcintro.htm

Linux-specific documentation:

Last modified: Oct 24, 2008 2:28 pm.
Contact me.

Powered by Zope