L3 Unicast Routing

The hardware is capable of L3 forwarding at wire speed. The ip command is used to manage the routing configuration in the Linux kernel and is part of the iproute2 tool suite. ip makes it simple to configure the Linux Kernel, and offloading to hardware happens automatically.

However, the ip command exposes many configuration options which are not handled in hardware. The specific constraints will be described in the relevant sections below.

1. Basic settings

Forwarding must be enabled in the kernel. There are separate settings for IPv4 and IPv6:

$ sysctl -w net.ipv4.ip_forward=1
$ sysctl -w net.ipv6.conf.all.forwarding=1

The hardware only supports L3 forwarding on a VLAN-aware bridge. To enable L3 forwarding, create a VLAN-aware bridge.

$ ip link add name br0 type bridge vlan_filtering 1
$ bridge vlan
port              vlan-id
br0               1 PVID Egress Untagged

By default, all bridge ports are part of VID 1. If this is undesirable, you can create a bridge without default VLANs:

$ ip link add name br0 type bridge vlan_filtering 1 vlan_default_pvid 0

2. Router interfaces

To enable routing, specific interfaces must be created. We add VLAN 9 and VLAN 10 to the bridge. For each VLAN we add upper VLAN interfaces, br0.10 and br0.9, on top of the bridge. These will become our router interfaces.

$ bridge vlan add dev br0 vid 10 self
$ bridge vlan add dev br0 vid 9 self

$ ip link add link br0 name br0.10 type vlan id 10
$ ip link add link br0 name br0.9 type vlan id 9

$ ip link set up dev br0.10
$ ip link set up dev br0.9

A router interface is associated with a specific VLAN ID. With the example above, the interface br0.10 is associated with VLAN 10. Router interfaces are created in hardware by assigning IP addresses to these interfaces.

$ ip addr add 1.0.10.1/24 dev br0.10
$ ip addr add 1.0.9.1/24 dev br0.9
$ ip route
1.0.9.0/24 dev br0.9 proto kernel scope link src 1.0.9.1 rt_offload
1.0.10.0/24 dev br0.10 proto kernel scope link src 1.0.10.1 rt_offload

Routes that are offloaded are marked with rt_offload. Routes that are offloaded, but sent to the CPU, are marked with rt_trap. As an example, traffic sent to the device itself is trapped:

$ ip route show table local
local 1.0.9.1 dev br0.9 proto kernel scope host src 1.0.9.1 rt_offload rt_trap
local 1.0.10.1 dev br0.10 proto kernel scope host src 1.0.10.1 rt_offload rt_trap

See L3 Examples for full examples.

3. Nexthop routes and ECMP

Nexthop routes are added in the usual manner.

$ ip route replace 6.6.0.0/16 nexthop via 1.0.9.50 nexthop via 1.0.10.50
$ ip route
6.6.0.0/16 rt_offload
	nexthop via 1.0.9.50 dev br0.9 weight 1
	nexthop via 1.0.10.50 dev br0.10 weight 1

Not all possible nexthop routes are offloadable, due to hardware or driver constraints. The following requirements must be met:

Total number of nexthops is at most 16.
Only main and local routing tables are offloaded.
Nexthop objects are not offloaded.
Nexthop gateway egress device must be a router interface.
Nexthop weight must be 1.
Multicast not supported.

For the route to be offloaded, the requirements must be met for all nexthops in a route.

3.1. ECMP hashing algorithm

The hashing algorithm used in hardware, to select nexthop, is different from the kernel default algorithm. The following protocol fields contribute to the nexthop choice:

Source IP for IPv4 and IPv6 frames.
IPv6 flow label.
TCP/UDP source and destination ports for TCP/UDP frames.

All contributions are 4-bit XOR’ed. As an example suppose we receive a UDP frame with

SIP      = 0x01030507
SRC_PORT = 0x1004
DST_PORT = 0x1003

ECMP_CODE = (0x0 ^ 0x1 ^ 0x0 ^ 0x3 ^ 0x0 ^ 0x5 ^ 0x0 ^ 0x7) ^
            (0x1 ^ 0x0 ^ 0x0 ^ 0x4) ^
            (0x1 ^ 0x0 ^ 0x0 ^ 0x3)
          = 0x7

For a nexthop route with N nexthops, the final nexthop is found by taking this code modulo N.

4. Blackhole Routes

A blackhole route is a route where incoming traffic is silently discarded. The source is not notified in any way. Blackhole routes are sometimes called null routes or discard routes.

The hardware supports adding blackhole routes, which means frames are dropped efficiently in hardware, without putting pressure on the CPU system. The ip command makes it simple to add blackhole routes.

$ ip route add blackhole 6.7.0.0/16
$ ip route
blackhole 6.7.0.0/16 rt_offload

Here, any IP frame received which is addressed to the subnet 6.7.0.0/16, is dropped and never routed.

Likewise, the prohibit and unreachable route types are also supported. They also drop frames, but will respond with ICMP frames.

$ ip route replace prohibit 6.7.0.0/16
$ ip route
prohibit 6.7.0.0/16 rt_offload rt_trap
$ ip route replace unreachable 6.7.0.0/16
$ ip route
unreachable 6.7.0.0/16 rt_offload rt_trap

As is seen above, these types are trapped and thus handled by the CPU. This is necessary to respond with the correct ICMP frames.

For IPv4, these routes respond with the following ICMPv4 messages:

prohibit: type Destination Unreachable (0x03) and code Communication administratively prohibited (0x0d).
unreachable: type Destination Unreachable (0x03) and code Destination host unreachable (0x01).

For IPv6, these routes respond with the following ICMPv6 messages:

prohibit: type Destination Unreachable (0x01) and code Communication with destination administratively prohibited (0x01).
unreachable: type Destination Unreachable (0x01) and code No route to destination (0x00).

The kernel may ratelimit responses, so not all frames matching the route will receive an ICMP response.

5. Nexthop objects not supported

Nexthop objects can be managed with the ip nexthop subcommand. The device does not support nexthop objects, and routes added referring to nexthop objects will not be offloaded.

6. Neighbours

Neighbours discovered by the kernel, which reside in offloaded subnets are offloaded automatically. Offloaded neighbours are marked with offload:

$ ip nei
1.0.10.2 dev br0.10 lladdr 90:e2:ba:63:78:de offload STALE
1.0.9.2 dev br0.9 lladdr 90:e2:ba:63:78:dc offload STALE

It is possible to add permanent neighbours explicitly:

$ ip nei replace to 1.0.10.2 lladdr 90:e2:ba:63:78:de dev br0.10
$ ip nei replace to 1.0.9.2 lladdr 90:e2:ba:63:78:dc dev br0.9
$ ip nei
1.0.9.2 dev br0.9 lladdr 90:e2:ba:63:78:dc offload PERMANENT
1.0.10.2 dev br0.10 lladdr 90:e2:ba:63:78:de offload PERMANENT

When a nexthop route is added/replaced, the neighbours are added and the ARP/ND process is initiated.

$ ip route replace 100.99.0.0/16 \
>         nexthop via 1.0.10.101 \
>         nexthop via 1.0.10.103 \
>         nexthop via 1.0.10.104
$ ip nei
1.0.10.101 dev br0.10 INCOMPLETE
1.0.10.104 dev br0.10 INCOMPLETE
1.0.10.103 dev br0.10 INCOMPLETE

The neighbour state is updated as ARP/ND responses are processed by the kernel.

7. IPv6 vs IPv4 API differences

The kernel interface for IPv6 is slightly different from IPv4. For IPv4 you use the commands

ip route add
ip route replace
ip route delete

The entire nexthop group is always replaced/deleted. However, for IPv6 it is possible to add/remove single nexthops from a nexthop group using append/delete.

$ ip -6 route replace 2001:db8::1/56 \
>         nexthop via 2001:db8:1:20::100 \
>         nexthop via 2001:db8:1:10::100
$ ip -6 route
2001:db8::/56 metric 1024 rt_offload pref medium
	nexthop via 2001:db8:1:20::100 dev br0.9 weight 1
	nexthop via 2001:db8:1:10::100 dev br0.10 weight 1

Add two new nexthops to the route:

$ ip -6 route append 2001:db8::1/56 \
>         nexthop via 2001:db8:1:20::101 \
>         nexthop via 2001:db8:1:10::101
$ ip -6 route
2001:db8::/56 metric 1024 rt_offload pref medium
	nexthop via 2001:db8:1:20::100 dev br0.9 weight 1
	nexthop via 2001:db8:1:10::100 dev br0.10 weight 1
	nexthop via 2001:db8:1:20::101 dev br0.9 weight 1
	nexthop via 2001:db8:1:10::101 dev br0.10 weight 1

Delete two nexthops from the route:

$ ip -6 route delete 2001:db8::1/56 \
>         nexthop via 2001:db8:1:20::100 \
>         nexthop via 2001:db8:1:10::101
$ ip -6 route
2001:db8::/56 metric 1024 rt_offload pref medium
	nexthop via 2001:db8:1:10::100 dev br0.10 weight 1
	nexthop via 2001:db8:1:20::101 dev br0.9 weight 1

Replace the entire nexthop group:

$ ip -6 route replace 2001:db8::1/56 \
>         nexthop via 2001:db8:1:20::110 \
>         nexthop via 2001:db8:1:10::110
$ ip -6 route
2001:db8::/56 metric 1024 rt_offload pref medium
	nexthop via 2001:db8:1:20::110 dev br0.9 weight 1
	nexthop via 2001:db8:1:10::110 dev br0.10 weight 1

8. Router MAC

The VLAN interfaces on top of the bridge interface all share MAC addresses with the bridge interface. The specific MAC assigned to a bridge depends on configuration.

The default Linux behaviour is to use the minimal MAC for all interfaces joined to the bridge. This means the MAC changes dynamically as ports join and leave the bridge. Alternatively, the bridge MAC can be explicitly set by the user.

9. Useful sysctl options

Here is a list of options which are either necessary or useful. The kernel defaults may not be aligned with the default hardware behaviour.

9.1. Enable Forwarding

Enable IPv4 and IPv6 forwarding.

$ sysctl -w net.ipv4.ip_forward=1
$ sysctl -w net.ipv6.conf.all.forwarding=1
$ sysctl -w net.ipv6.conf.default.forwarding=1

9.2. IPv6 Interface down

The kernel flushes IPv6 addresses when an interface is brought down, which destroys the router interfaces in hardware. This behaviour differs from IPv4, where addresses are kept when an interface is brought down. To get the same behavior for IPv6 you can use:

$ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1
$ sysctl -w net.ipv6.conf.default.keep_addr_on_down=1

9.3. Neighbour advertisement on interface down

Generate unsolicited neighbour advertisements when a device is brought up or hardware address changes.

$ sysctl -w net.ipv6.conf.all.ndisc_notify=1
$ sysctl -w net.ipv6.conf.default.ndisc_notify=1

9.4. Reverse Path Filtering

The kernel IPv4 network stack supports three RP filtering modes:

No filtering, rp_filter=0.
Strict mode (RFC 3704), rp_filter=1.
Loose mode (RFC 3704), rp_filter=2.

The hardware can support all modes, however the driver does not implement this at the moment. In practice this means the source IP does not matter for the purpose of routing decisions in hardware. Disable rp_filter in the kernel, to be consistent with hardware behaviour.

$ sysctl -w net.ipv4.conf.all.rp_filter=0
$ sysctl -w net.ipv4.conf.default.rp_filter=0

9.5. IPv6 netlink events on interface down

Controls whether an RTM_DELROUTE message is generated for routes removed when a device is taken down or deleted. IPv4 does not generate this message; IPv6 does by default.

$ sysctl -w net.ipv6.route.skip_notify_on_dev_down=1

9.6. Neighbour GC thresholds

9.6.1. gc.thresh1

Minimum number of entries to keep. The garbage collector will not purge entries if there are fewer than this number. The default is 128.

$ sysctl -w net.ipv4.neigh.default.gc_thresh1=128
$ sysctl -w net.ipv6.neigh.default.gc_thresh1=128

9.6.2. gc.thresh2

Threshold when garbage collector becomes more aggressive about purging entries. Entries older than 5 seconds will be cleared when over this number. The default is 512.

$ sysctl -w net.ipv4.neigh.default.gc_thresh2=2048
$ sysctl -w net.ipv6.neigh.default.gc_thresh2=2048

9.6.3. gc.thresh3

Maximum number of non-PERMANENT neighbor entries allowed. Increase this when using large numbers of interfaces and when communicating with large numbers of directly connected peers. The default is 1024.

$ sysctl -w net.ipv4.neigh.default.gc_thresh3=4096
$ sysctl -w net.ipv6.neigh.default.gc_thresh3=4096

9.7. ICMP redirects

The kernel will not always determine if two hosts are in the same broadcast domain. When routing between such hosts, it will incorrectly dispatch an ICMP redirect. To avoid this redirects can be disabled.

9.7.1. IPv4

$ sysctl -w net.ipv4.conf.all.send_redirects=0
$ sysctl -w net.ipv4.conf.default.send_redirects=0
$ sysctl -w net.ipv4.conf.${INTERFACE}.send_redirects=0

9.7.2. IPv6

For IPv6 ICMP redirects can not be disabled with sysctl, but it is possible to use iptables:

$ ip6tables -A OUTPUT -p icmpv6 --icmpv6-type redirect -j DROP

10. Limitations to hardware routing

Routing in hardware is subject to limitations. The following frame types are not eligible for routing in hardware.

IPv4 packets must not contain options.
IPv6 packets must not contain hop-by-hop options.
IPv4 header checksum must be correct.

The default configuration will trap these frames to the CPU, leaving it up to the kernel configuration to decide how to handle them.

11. Debugging

Do not use the vcap tool to manipulate the LPM VCAP when routing is enabled. It is critical for proper functioning that the hardware state is in sync with the kernel state.

If the DebugFS is enabled, it is possible to inspect the hardware routing table.

$ cat /sys/kernel/debug/sparx5/vcaps/lpm_0

If a route has been hit in HW, the sticky bit will be set:

...
rule: 6, addr: [18426,18426], X1, ctr[0]: 0, hit: 1
  chain_id: 6000000
  user: 5
  priority: 96
  state: permanent
  keysets: VCAP_KFS_SGL_IP4
  keyset_sw: 1
  keyset_sw_regs: 2
    DST_FLAG: W1: 1/1
    IP4_XIP: W32: 1.0.10.2/255.255.255.255
  actionset: VCAP_AFS_ARP_ENTRY
  actionset_sw: 1
  actionset_sw_regs: 3
    ARP_ENA: W1: 1
    MAC_LSB: W32: 0xba6378de
    MAC_MSB: W16: 0x90e2
    TYPE: W2: 2
...

The VCAP counters do not exist for the LPM VCAP.

12. Full examples

12.1. IPv4

Suppose we have two VLANs configured on the switch, each with a subnet assigned. We wish to route L3 traffic between the VLANs.

       bridge                     host0
+-------------+ pvid 9      +-----------------------+
|             | untagged   ethp0    1.0.9.2         |
|  br0.9      +------------>|                       |
|            eth0           |1.0.10.0/24 via 1.0.9.1|
| 1.0.9.1/24  |             +-----------------------+
|             |
|             |
|   br0       |
|             |                     host1
|             | pvid 10     +-----------------------+
|             | untagged   ethp1     1.0.10.2       |
|  br0.10     +------------>|                       |
|             eth1          |1.0.9.0/24 via 1.0.10.1|
| 1.0.10.1/24 |             +-----------------------+
|             |
+-------------+

12.1.1. Switch configuration

Add the bridge and enable forwarding.

$ ip link add name br0 type bridge vlan_filtering 1 vlan_default_pvid 0
$ sysctl -w net.ipv4.ip_forward=1
$ sysctl -w net.ipv4.conf.all.forwarding=1

Disable ICMP redirects.

$ sysctl -w net.ipv4.conf.all.send_redirects=0
$ sysctl -w net.ipv4.conf.br0.send_redirects=0
$ sysctl -w net.ipv4.conf.eth0.send_redirects=0
$ sysctl -w net.ipv4.conf.eth1.send_redirects=0

Join the two port interfaces to the bridge, and add VLAN configuration.

$ ip link set eth0 master br0 up
$ ip link set eth1 master br0 up
$ ip link set up dev br0
$ bridge vlan add dev br0 vid 9 self
$ bridge vlan add dev br0 vid 10 self
$ bridge vlan add dev eth0 vid 9 pvid untagged
$ bridge vlan add dev eth1 vid 10 pvid untagged

For each VLAN, we add an VLAN upper interface to the bridge.

$ ip link add link br0 name br0.9 type vlan id 9
$ ip link add link br0 name br0.10 type vlan id 10
$ ip link set up dev br0.9
$ ip link set up dev br0.10

Create gateways in each VLAN, by assigning a subnet to each interface.

$ ip addr add 1.0.9.1/24 dev br0.9
$ ip addr add 1.0.10.1/24 dev br0.10

12.1.2. Host configuration

For each host, we configure an IP in the subnet, and set a gateway for the opposite subnet.

For host 0:

$ ip addr add 1.0.9.2/24 dev ethp0
$ ip link set up dev ethp0
$ ip route add 1.0.10.0/24 via 1.0.9.1 dev ethp0

For host 1:

$ ip addr add 1.0.10.2/24 dev ethp1
$ ip link set up dev ethp1
$ ip route add 1.0.9.0/24 via 1.0.10.1 dev ethp1

12.1.3. Ping

If we ping from 1.0.9.2 to 1.0.10.2 will, the switch kernel will initiate ARP for both hosts, install the neighbours in hardware and offload dataplane to hardware.

host0# ping -c 3 1.0.10.2
PING 1.0.10.2 (1.0.10.2) 56(84) bytes of data.
64 bytes from 1.0.10.2: icmp_seq=1 ttl=63 time=0.975 ms
64 bytes from 1.0.10.2: icmp_seq=2 ttl=63 time=0.342 ms
64 bytes from 1.0.10.2: icmp_seq=3 ttl=63 time=0.350 ms

Looking at the switch neighbour table, shows the offloaded neighbours:

$ ip nei
1.0.10.2 dev br0.10 lladdr 90:e2:ba:63:78:de offload STALE
1.0.9.2 dev br0.9 lladdr 90:e2:ba:63:78:dc offload STALE