redundant switches / redundant server NICs

Discussion in 'Cisco' started by Stuart Kendrick, Aug 9, 2004.

  1. hi folks,

    i'm analyzing the interaction between Catalyst 4000s and servers
    configured with redundant NICs (Intel's TEAMing software).

    we install two C4K in each server room, and cable the "a" NIC of each
    server to the "a" ethernet switch, and the "b" NIC of each serfver to
    the "b" ethernet switch.

    when we unplug one of these cables, a continuous ping to the target
    typically shows 0-1 missed pings ... and syslog on the server shows
    the kernel detecting the loss of link, disengaging the primary NIC,
    and activating the standby NIC. when we plug the cable back in, the
    kernel detects this event and reverses the procedure. and we're back
    to where we started. cool.

    when we reboot a switch, we see the same behavior ... but about 30
    seconds after the reboot, the C4K brings link up on its ports ... the
    kernel obligingly changes its view of active and standby ... and the
    server has just isolated itself, but the C4K is in no position to
    forward traffic. in fact, it won't be ready to forward traffic for
    another couple minutes. (near the end ofthat time, it will take link
    down for ~30 seconds, before bringing it up just prior to going fully
    functional).

    i have a TAC case open on this -- the engineer says that the C4K
    raises link in order to perform hardware level testing on the port ...
    part of the power-on diagnostics ... if the testing fails, then the
    Sup card will log an error message. this is good stuff.

    however, in the meantime, the server sees link up, thinks it can use
    that NIC ... and forwards packets into oblivion.

    i can see various ways to solve this. i could disable diagnostics ...
    but then i miss the benefit of having the C4K identify failed ports
    for me. i could configure the servers not to failback ... but then,
    at any given moment, my servers are in an indeterminate state,
    network-wise (I won't know a priori which NIC is active).

    how have other folks handled this problem?

    --sk

    stuart kendrick
    fhcrc
     
    Stuart Kendrick, Aug 9, 2004
    #1
    1. Advertising

  2. "Stuart Kendrick" <> wrote in message
    news:...
    > hi folks,
    >
    > i'm analyzing the interaction between Catalyst 4000s and servers
    > configured with redundant NICs (Intel's TEAMing software).
    >
    > we install two C4K in each server room, and cable the "a" NIC of each
    > server to the "a" ethernet switch, and the "b" NIC of each serfver to
    > the "b" ethernet switch.
    >
    > when we unplug one of these cables, a continuous ping to the target
    > typically shows 0-1 missed pings ... and syslog on the server shows
    > the kernel detecting the loss of link, disengaging the primary NIC,
    > and activating the standby NIC. when we plug the cable back in, the
    > kernel detects this event and reverses the procedure. and we're back
    > to where we started. cool.
    >
    > when we reboot a switch, we see the same behavior ... but about 30
    > seconds after the reboot, the C4K brings link up on its ports ... the
    > kernel obligingly changes its view of active and standby ... and the
    > server has just isolated itself, but the C4K is in no position to
    > forward traffic. in fact, it won't be ready to forward traffic for
    > another couple minutes. (near the end ofthat time, it will take link
    > down for ~30 seconds, before bringing it up just prior to going fully
    > functional).


    This delay is probably spanning tree related: you could look at you
    portfast/uplinkfast configuration to bring this time down.

    > i have a TAC case open on this -- the engineer says that the C4K
    > raises link in order to perform hardware level testing on the port ...
    > part of the power-on diagnostics ... if the testing fails, then the
    > Sup card will log an error message. this is good stuff.
    >
    > however, in the meantime, the server sees link up, thinks it can use
    > that NIC ... and forwards packets into oblivion.
    >
    > i can see various ways to solve this. i could disable diagnostics ...
    > but then i miss the benefit of having the C4K identify failed ports
    > for me. i could configure the servers not to failback ... but then,
    > at any given moment, my servers are in an indeterminate state,
    > network-wise (I won't know a priori which NIC is active).
    >
    > how have other folks handled this problem?


    When I have configured Intel teaming in the past I've used the smart-switch
    feature which makes the active nic the current one until it fails. In other
    words, if the switch the active nic is connected to fails then the team
    switches to use the standby nic, but does not switch back once the 1st
    switch returns to active duty.

    >
    > --sk
    >
    > stuart kendrick
    > fhcrc


    BTW, this sort of redundancy was not designed to give instant failover with
    no dropped packets, but to allow the continued operation of a service after
    a failure. Losing a few seconds of availability is better than losing it for
    hours.

    BL
    --
    As the days go by, we face the increasing inevitability that we are alone in
    a godless, uninhabited, hostile and meaningless universe. Still, you've got
    to laugh, haven't you? - Holly
     
    Buzz Lightbeer, Aug 9, 2004
    #2
    1. Advertising

  3. Stuart Kendrick

    Hansang Bae Guest

    In article <>,
    says...
    > hi folks,
    >
    > i'm analyzing the interaction between Catalyst 4000s and servers
    > configured with redundant NICs (Intel's TEAMing software).
    >
    > we install two C4K in each server room, and cable the "a" NIC of each
    > server to the "a" ethernet switch, and the "b" NIC of each serfver to
    > the "b" ethernet switch.
    >
    > when we unplug one of these cables, a continuous ping to the target
    > typically shows 0-1 missed pings ... and syslog on the server shows
    > the kernel detecting the loss of link, disengaging the primary NIC,
    > and activating the standby NIC. when we plug the cable back in, the
    > kernel detects this event and reverses the procedure. and we're back
    > to where we started. cool.
    >
    > when we reboot a switch, we see the same behavior ... but about 30
    > seconds after the reboot, the C4K brings link up on its ports ... the
    > kernel obligingly changes its view of active and standby ... and the
    > server has just isolated itself, but the C4K is in no position to
    > forward traffic. in fact, it won't be ready to forward traffic for
    > another couple minutes. (near the end ofthat time, it will take link
    > down for ~30 seconds, before bringing it up just prior to going fully
    > functional).
    >
    > i have a TAC case open on this -- the engineer says that the C4K
    > raises link in order to perform hardware level testing on the port ...
    > part of the power-on diagnostics ... if the testing fails, then the
    > Sup card will log an error message. this is good stuff.
    >
    > however, in the meantime, the server sees link up, thinks it can use
    > that NIC ... and forwards packets into oblivion.
    >
    > i can see various ways to solve this. i could disable diagnostics ...
    > but then i miss the benefit of having the C4K identify failed ports
    > for me. i could configure the servers not to failback ... but then,
    > at any given moment, my servers are in an indeterminate state,
    > network-wise (I won't know a priori which NIC is active).
    >
    > how have other folks handled this problem?



    Are you sure you're not over thinking this problem with TAC? I.e. doing
    a "set port host x/y" will fix the 50 sec delay you're talking about.
    And when the Cat brings up the port for diags, I'm not sure that it
    would send out the necessary link pulse to negotiate with the other
    side. I could be worng, but I don't think it would do this.

    The delay you're talking about sounds like the result of Spanning tree
    calculation, trunking protocol and PaGP calculation. All of which can
    be turned off with "set port host"

    --

    hsb

    "Somehow I imagined this experience would be more rewarding" Calvin
    *************** USE ROT13 TO SEE MY EMAIL ADDRESS ****************
    ********************************************************************
    Due to the volume of email that I receive, I may not not be able to
    reply to emails sent to my account. Please post a followup instead.
    ********************************************************************
     
    Hansang Bae, Aug 10, 2004
    #3
  4. Hansang Bae <> wrote in message

    yes, it is quite possible that i'm making this harder than it really
    is ...

    however, i think i have the "set port host x/y" thing down ... i.e.
    portfast enabled, trunking disabled, channeling disabled, and so
    forth.

    mp-a-esx> sh port cap 6/27
    Model WS-X4448-GB-RJ45
    Port 6/27
    Type 10/100/1000
    Speed auto,10,100,1000
    Duplex half,full
    Trunk encap type 802.1Q
    Trunk mode on,off,desirable,auto,nonegotiate
    Channel 6/1-48
    Flow control
    receive-(off,on,desired),send-(off,on,desired)
    Security yes
    Dot1x yes
    Membership static,dynamic
    Fast start yes
    QOS scheduling rx-(none),tx-(2q1t)
    CoS rewrite no
    ToS rewrite no
    Rewrite no
    UDLD yes
    Inline power no
    AuxiliaryVlan 1..1000,1025..4094,untagged,none
    SPAN source,destination,reflector
    Link debounce timer yes
    IGMPFilter yes
    Dot1q-all-tagged no
    Jumbo frames no
    mp-a-esx>

    and from the config file:

    #module 6 : 48-port 10/100/1000 Ethernet
    set vlan 42 6/1-48
    set port auxiliaryvlan 6/1 642
    set port auxiliaryvlan 6/2 642
    [...]
    set port enable 6/1-48
    set port level 6/1-48 normal
    set port speed 6/1-48 auto
    set port clock 6/1-48 auto
    set port trap 6/1-48 disable
    set port name 6/1-48
    set port security 6/1-48 disable age 0 maximum 1 shutdown 0
    unicast-flood enable
    violation shutdown
    set port dot1x 6/1-48 port-control force-authorized
    set port dot1x 6/1-48 multiple-host disable
    set port dot1x 6/1-48 shutdown-timeout disable
    set port dot1x 6/1-48 re-authentication disable
    set port membership 6/1-48 static
    set port protocol 6/1-48 ip on
    set port protocol 6/1-48 ipx auto
    set port protocol 6/1-48 group auto
    set port flowcontrol 6/18-19 send desired
    set port flowcontrol 6/1-17,6/20-48 send on
    set port flowcontrol 6/1-48 receive desired
    set cdp enable 6/1-48
    set udld disable 6/1-48
    set udld aggressive-mode disable 6/1-48
    set trunk 6/1 off dot1q 1-1005,1025-4094
    set trunk 6/2 off dot1q 1-1005,1025-4094
    [...]
    set spantree portfast 6/1-48 enable
    set spantree bpdu-filter 6/1-48 default
    set spantree bpdu-guard 6/1-48 default
    set spantree mst link-type 6/1-48 auto
    set spantree portpri 6/1-48 32 mst
    set spantree portinstancepri 6/1 0 mst
    set spantree portinstancepri 6/2 0 mst
    [...]
    set spantree guard none 6/1-48
    set port gvrp 6/1-48 disable
    set gvrp registration normal 6/1-48
    set gvrp applicant normal 6/1-48
    set port gmrp 6/1-48 enable
    set gmrp registration normal 6/1-48
    set gmrp fwdall disable 6/1-48
    set port debounce 6/1 disable
    set port debounce 6/2 disable
    [...]
    set port unicast-flood 6/1-48 enable
    set port errdisable-timeout 6/1-48 enable
    set cam notification added disable 6/1-48
    set cam notification removed disable 6/1-48
    set port channel 6/33-34 mode on
    set port channel 6/1-32,6/35-48 mode off


    > Are you sure you're not over thinking this problem with TAC? I.e. doing
    > a "set port host x/y" will fix the 50 sec delay you're talking about.
    > And when the Cat brings up the port for diags, I'm not sure that it
    > would send out the necessary link pulse to negotiate with the other
    > side. I could be worng, but I don't think it would do this.
    >
    > The delay you're talking about sounds like the result of Spanning tree
    > calculation, trunking protocol and PaGP calculation. All of which can
    > be turned off with "set port host"
    >
    > --
    >
    > hsb
    >
     
    Stuart Kendrick, Aug 10, 2004
    #4
  5. yes, i can see myself going to this "don't switch back to active duty"
    approach, too. but before i go there, i want confidence that i
    understand what is happening, and that i'm not missing some cleaner
    solution. i guess what you're saying is that this is the cleanest
    solution you know of. thanx for the input!

    --sk

    > When I have configured Intel teaming in the past I've used the smart-switch
    > feature which makes the active nic the current one until it fails. In other
    > words, if the switch the active nic is connected to fails then the team
    > switches to use the standby nic, but does not switch back once the 1st
    > switch returns to active duty.
    >
    > >
    > > --sk
    > >
    > > stuart kendrick
    > > fhcrc

    >
    > BTW, this sort of redundancy was not designed to give instant failover with
    > no dropped packets, but to allow the continued operation of a service after
    > a failure. Losing a few seconds of availability is better than losing it for
    > hours.
    >
    > BL
     
    Stuart Kendrick, Aug 10, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Richard Graves

    Catalyst switches and Broadcom NICs

    Richard Graves, May 13, 2005, in forum: Cisco
    Replies:
    2
    Views:
    3,736
    PLANCKAERT Nicolas
    May 17, 2005
  2. lfnetworking
    Replies:
    2
    Views:
    2,030
    Scooby
    May 16, 2005
  3. Zandra
    Replies:
    2
    Views:
    771
    Ron Martell
    Jul 9, 2005
  4. John
    Replies:
    4
    Views:
    1,827
  5. kesavroop
    Replies:
    0
    Views:
    526
    kesavroop
    Sep 8, 2010
Loading...

Share This Page