Tough PMTUD nut to crack

Discussion in 'Cisco' started by John Caruso, May 17, 2005.

  1. John Caruso

    John Caruso Guest

    I've got an MTU-related problem for which I can't seem to find any good
    solutions. I've got two Cisco routers (both 3640s) at two different sites
    with a GRE tunnel between them. The GRE tunnel is being encapsulated via
    IPSEC by two PIXes that sit between the two routers. The routers are
    running IOS 12.3(12a) and the PIXes are on 6.3.3, if it matters. This is
    the basic diagram:

    R1 <---> PIX1 <-------> PIX2 <---> R2

    Generally speaking, hosts behind R1 perform large transfers to hosts behind
    R2, so R2 is usually receiving traffic.

    So here's the problem: if I don't turn on "tunnel path-mtu-discovery" on
    the GRE tunnel, R2's CPU will frequently be pegged at 100% as it reassembles
    fragmented packets from the tunnel. But if I *do* turn on "tunnel path-mtu-
    discovery", scp transfers that are initiated by a Solaris host behind R1 to
    hosts behind R2 will sometimes, *but not always*, hang. Maybe one transfer
    out of several hundred will hang in this way. When the connections hang,
    they stay alive for the length of tcp_keepalive_interval, at which point
    the scp transfer finally fails (with a "Connection reset by peer" message).

    I've tried many, many different things to address this:

    1) Lowering the MTU on the GRE interfaces on the Cisco routers (I've tried
    many different settings, as far down as 1320 on each side)
    2) Lowering the MTU on the Solaris host
    3) Changing tcp_mss_def_ipv4 on the Solaris host
    4) "ip tcp adjust-mss" on the GRE interfaces on the Cisco routers

    Etc, etc. Nothing helps. Unfortunately, I've been unable to reproduce
    this problem in a controlled fashion (by running a test script on the
    Solaris box that just does scp transfers in an infinite loop), nor have
    I been able to determine if it's specific to Solaris or might affect
    other hosts as well.

    So at this point I'm stuck with either a) a router whose CPU is constantly
    being pegged, or b) cron jobs that spew out 5-10 error emails a day, thanks
    to scp transfers that hang between the Solaris host and other hosts.

    Any ideas?

    Just in case, this is the tunnel configuration on Router1 (currently):

    interface Tunnel0
    description GRE tunnel to Router2
    bandwidth 100000
    ip address 10.1.2.3 255.255.255.252
    ip mtu 1476
    ip ospf message-digest-key 1 md5 7 deadbeef
    ip ospf cost 20
    tunnel source Loopback1
    tunnel destination 192.168.10.3
    tunnel path-mtu-discovery

    - John
    John Caruso, May 17, 2005
    #1
    1. Advertising

  2. John Caruso

    djd Guest

    It'd be really good to know exactly what's causing the session hang, but that
    can be a challenge. Running "snoop" on each endpoint (or tcpdump, ethereal,
    etc.) would probably be a good start. However, if setting the MTU down on each
    endpoint to a sufficiently small number didn't fix it, I'd suspect the problem
    is not really MTU related. Here's a very good white paper at Cisco that will
    tell you perhaps more than you ever wanted to know about PMTUD, and will help
    you pick good MTU values:

    <http://www.cisco.com/en/US/tech/tk827/tk369/technologies_white_paper09186a00800d6979.shtml>

    I'd also check the various interface stats in the involved devices (both tunnel
    and physical interfaces) for errors and/or drops, sine maybe something else is
    going on. Are you seeing any OSPF issues across the tunnel? Maybe it's a
    routing issue. As a test you could put some static routes in the routers to
    eliminate that possibility.

    HTH - Good luck!

    John Caruso wrote:

    > I've got an MTU-related problem for which I can't seem to find any good
    > solutions. I've got two Cisco routers (both 3640s) at two different sites
    > with a GRE tunnel between them. The GRE tunnel is being encapsulated via
    > IPSEC by two PIXes that sit between the two routers. The routers are
    > running IOS 12.3(12a) and the PIXes are on 6.3.3, if it matters. This is
    > the basic diagram:
    >
    > R1 <---> PIX1 <-------> PIX2 <---> R2
    >
    > Generally speaking, hosts behind R1 perform large transfers to hosts behind
    > R2, so R2 is usually receiving traffic.
    >
    > So here's the problem: if I don't turn on "tunnel path-mtu-discovery" on
    > the GRE tunnel, R2's CPU will frequently be pegged at 100% as it reassembles
    > fragmented packets from the tunnel. But if I *do* turn on "tunnel path-mtu-
    > discovery", scp transfers that are initiated by a Solaris host behind R1 to
    > hosts behind R2 will sometimes, *but not always*, hang. Maybe one transfer
    > out of several hundred will hang in this way. When the connections hang,
    > they stay alive for the length of tcp_keepalive_interval, at which point
    > the scp transfer finally fails (with a "Connection reset by peer" message).
    >
    > I've tried many, many different things to address this:
    >
    > 1) Lowering the MTU on the GRE interfaces on the Cisco routers (I've tried
    > many different settings, as far down as 1320 on each side)
    > 2) Lowering the MTU on the Solaris host
    > 3) Changing tcp_mss_def_ipv4 on the Solaris host
    > 4) "ip tcp adjust-mss" on the GRE interfaces on the Cisco routers
    >
    > Etc, etc. Nothing helps. Unfortunately, I've been unable to reproduce
    > this problem in a controlled fashion (by running a test script on the
    > Solaris box that just does scp transfers in an infinite loop), nor have
    > I been able to determine if it's specific to Solaris or might affect
    > other hosts as well.
    >
    > So at this point I'm stuck with either a) a router whose CPU is constantly
    > being pegged, or b) cron jobs that spew out 5-10 error emails a day, thanks
    > to scp transfers that hang between the Solaris host and other hosts.
    >
    > Any ideas?
    >
    > Just in case, this is the tunnel configuration on Router1 (currently):
    >
    > interface Tunnel0
    > description GRE tunnel to Router2
    > bandwidth 100000
    > ip address 10.1.2.3 255.255.255.252
    > ip mtu 1476
    > ip ospf message-digest-key 1 md5 7 deadbeef
    > ip ospf cost 20
    > tunnel source Loopback1
    > tunnel destination 192.168.10.3
    > tunnel path-mtu-discovery
    >
    > - John
    djd, Jul 3, 2005
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Treb
    Replies:
    6
    Views:
    1,434
    Pauline Johnson
    Nov 17, 2003
  2. George Orwell

    ** The Truth About Cotse.Nut **

    George Orwell, Jul 19, 2007, in forum: Computer Security
    Replies:
    2
    Views:
    561
    Cyberiade.it Anonymous Remailer
    Jul 19, 2007
  3. George Orwell

    ** The Truth About Cotse.Nut **

    George Orwell, Jul 20, 2007, in forum: Computer Security
    Replies:
    7
    Views:
    553
    George Orwell
    Jul 21, 2007
  4. Rita Ä Berkowitz

    Catching A Nut With Nikon!

    Rita Ä Berkowitz, Dec 9, 2007, in forum: Digital Photography
    Replies:
    3
    Views:
    326
    TommyTeaper
    Dec 9, 2007
  5. Collector»NZ

    Daves not the only Wifi Car Nut

    Collector»NZ, Apr 20, 2005, in forum: NZ Computing
    Replies:
    3
    Views:
    348
    Dave - Dave.net.nz
    Apr 21, 2005
Loading...

Share This Page