L3 Unicast Routing
The hardware is capable of L3 forwarding at wire speed. The ip
command is used to manage the routing configuration in the Linux kernel and is part of the iproute2
tool suite.
ip
makes it simple to configure the Linux Kernel, and offloading to hardware happens automatically.
However, the ip
command exposes many configuration options which are not handled in hardware. The specific constraints will be described in the relevant sections below.
1. Basic settings
Forwarding must be enabled in the kernel. There are separate settings for IPv4 and IPv6:
$ sysctl -w net.ipv4.ip_forward=1 $ sysctl -w net.ipv6.conf.all.forwarding=1
The hardware only supports L3 forwarding on a VLAN-aware bridge. To enable L3 forwarding, create a VLAN-aware bridge.
$ ip link add name br0 type bridge vlan_filtering 1 $ bridge vlan port vlan-id br0 1 PVID Egress Untagged
By default, all bridge ports are part of VID 1. If this is undesirable, you can create a bridge without default VLANs:
$ ip link add name br0 type bridge vlan_filtering 1 vlan_default_pvid 0
2. Router interfaces
To enable routing, specific interfaces must be created. We add VLAN 9
and VLAN 10
to the
bridge. For each VLAN we add upper VLAN interfaces, br0.10
and br0.9
, on top of the bridge.
These will become our router interfaces.
$ bridge vlan add dev br0 vid 10 self $ bridge vlan add dev br0 vid 9 self $ ip link add link br0 name br0.10 type vlan id 10 $ ip link add link br0 name br0.9 type vlan id 9 $ ip link set up dev br0.10 $ ip link set up dev br0.9
A router interface is associated with a specific VLAN ID. With the example above, the interface br0.10
is associated with VLAN 10
.
Router interfaces are created in hardware by assigning IP addresses to these interfaces.
$ ip addr add 1.0.10.1/24 dev br0.10 $ ip addr add 1.0.9.1/24 dev br0.9 $ ip route 1.0.9.0/24 dev br0.9 proto kernel scope link src 1.0.9.1 rt_offload 1.0.10.0/24 dev br0.10 proto kernel scope link src 1.0.10.1 rt_offload
Routes that are offloaded are marked with rt_offload
.
Routes that are offloaded, but sent to the CPU, are marked with rt_trap
. As an example, traffic sent to the device itself is trapped:
$ ip route show table local local 1.0.9.1 dev br0.9 proto kernel scope host src 1.0.9.1 rt_offload rt_trap local 1.0.10.1 dev br0.10 proto kernel scope host src 1.0.10.1 rt_offload rt_trap
See L3 Examples for full examples.
3. Nexthop routes and ECMP
Nexthop routes are added in the usual manner.
$ ip route replace 6.6.0.0/16 nexthop via 1.0.9.50 nexthop via 1.0.10.50 $ ip route 6.6.0.0/16 rt_offload nexthop via 1.0.9.50 dev br0.9 weight 1 nexthop via 1.0.10.50 dev br0.10 weight 1
Not all possible nexthop routes are offloadable, due to hardware or driver constraints. The following requirements must be met:
-
Total number of nexthops is at most 16.
-
Only main and local routing tables are offloaded.
-
Nexthop objects are not offloaded.
-
Nexthop gateway egress device must be a router interface.
-
Nexthop weight must be 1.
-
Multicast not supported.
For the route to be offloaded, the requirements must be met for all nexthops in a route.
3.1. ECMP hashing algorithm
The hashing algorithm used in hardware, to select nexthop, is different from the kernel default algorithm. The following protocol fields contribute to the nexthop choice:
-
Source IP for IPv4 and IPv6 frames.
-
IPv6 flow label.
-
TCP/UDP source and destination ports for TCP/UDP frames.
All contributions are 4-bit XOR’ed. As an example suppose we receive a UDP frame with
SIP = 0x01030507 SRC_PORT = 0x1004 DST_PORT = 0x1003 ECMP_CODE = (0x0 ^ 0x1 ^ 0x0 ^ 0x3 ^ 0x0 ^ 0x5 ^ 0x0 ^ 0x7) ^ (0x1 ^ 0x0 ^ 0x0 ^ 0x4) ^ (0x1 ^ 0x0 ^ 0x0 ^ 0x3) = 0x7
For a nexthop route with N nexthops, the final nexthop is found by taking this code modulo N.
4. Blackhole Routes
A blackhole route is a route where incoming traffic is silently discarded. The source is not notified in any way. Blackhole routes are sometimes called null routes or discard routes.
The hardware supports adding blackhole
routes, which means frames are dropped
efficiently in hardware, without putting pressure on the CPU system. The ip
command makes it simple to add blackhole routes.
$ ip route add blackhole 6.7.0.0/16 $ ip route blackhole 6.7.0.0/16 rt_offload
Here, any IP frame received which is addressed to the subnet 6.7.0.0/16, is dropped and never routed.
Likewise, the prohibit
and unreachable
route types are also supported. They also drop frames,
but will respond with ICMP frames.
$ ip route replace prohibit 6.7.0.0/16 $ ip route prohibit 6.7.0.0/16 rt_offload rt_trap $ ip route replace unreachable 6.7.0.0/16 $ ip route unreachable 6.7.0.0/16 rt_offload rt_trap
As is seen above, these types are trapped and thus handled by the CPU. This is necessary to respond with the correct ICMP frames.
For IPv4, these routes respond with the following ICMPv4 messages:
-
prohibit
: type Destination Unreachable (0x03
) and code Communication administratively prohibited (0x0d
). -
unreachable
: type Destination Unreachable (0x03
) and code Destination host unreachable (0x01
).
For IPv6, these routes respond with the following ICMPv6 messages:
-
prohibit
: type Destination Unreachable (0x01
) and code Communication with destination administratively prohibited (0x01
). -
unreachable
: type Destination Unreachable (0x01
) and code No route to destination (0x00
).
The kernel may ratelimit responses, so not all frames matching the route will receive an ICMP response. |
5. Nexthop objects not supported
Nexthop objects can be managed with the ip nexthop
subcommand. The device does not support nexthop objects, and routes added referring to nexthop objects will not be offloaded.
6. Neighbours
Neighbours discovered by the kernel, which reside in offloaded subnets are offloaded automatically.
Offloaded neighbours are marked with offload
:
$ ip nei 1.0.10.2 dev br0.10 lladdr 90:e2:ba:63:78:de offload STALE 1.0.9.2 dev br0.9 lladdr 90:e2:ba:63:78:dc offload STALE
It is possible to add permanent neighbours explicitly:
$ ip nei replace to 1.0.10.2 lladdr 90:e2:ba:63:78:de dev br0.10 $ ip nei replace to 1.0.9.2 lladdr 90:e2:ba:63:78:dc dev br0.9 $ ip nei 1.0.9.2 dev br0.9 lladdr 90:e2:ba:63:78:dc offload PERMANENT 1.0.10.2 dev br0.10 lladdr 90:e2:ba:63:78:de offload PERMANENT
When a nexthop route is added/replaced, the neighbours are added and the ARP/ND process is initiated.
$ ip route replace 100.99.0.0/16 \ > nexthop via 1.0.10.101 \ > nexthop via 1.0.10.103 \ > nexthop via 1.0.10.104 $ ip nei 1.0.10.101 dev br0.10 INCOMPLETE 1.0.10.104 dev br0.10 INCOMPLETE 1.0.10.103 dev br0.10 INCOMPLETE
The neighbour state is updated as ARP/ND responses are processed by the kernel.
7. IPv6 vs IPv4 API differences
The kernel interface for IPv6 is slightly different from IPv4. For IPv4 you use the commands
-
ip route add
-
ip route replace
-
ip route delete
The entire nexthop group is always replaced/deleted. However, for IPv6 it is possible to add/remove single nexthops from a nexthop group using append/delete
.
$ ip -6 route replace 2001:db8::1/56 \ > nexthop via 2001:db8:1:20::100 \ > nexthop via 2001:db8:1:10::100 $ ip -6 route 2001:db8::/56 metric 1024 rt_offload pref medium nexthop via 2001:db8:1:20::100 dev br0.9 weight 1 nexthop via 2001:db8:1:10::100 dev br0.10 weight 1
Add two new nexthops to the route:
$ ip -6 route append 2001:db8::1/56 \ > nexthop via 2001:db8:1:20::101 \ > nexthop via 2001:db8:1:10::101 $ ip -6 route 2001:db8::/56 metric 1024 rt_offload pref medium nexthop via 2001:db8:1:20::100 dev br0.9 weight 1 nexthop via 2001:db8:1:10::100 dev br0.10 weight 1 nexthop via 2001:db8:1:20::101 dev br0.9 weight 1 nexthop via 2001:db8:1:10::101 dev br0.10 weight 1
Delete two nexthops from the route:
$ ip -6 route delete 2001:db8::1/56 \ > nexthop via 2001:db8:1:20::100 \ > nexthop via 2001:db8:1:10::101 $ ip -6 route 2001:db8::/56 metric 1024 rt_offload pref medium nexthop via 2001:db8:1:10::100 dev br0.10 weight 1 nexthop via 2001:db8:1:20::101 dev br0.9 weight 1
Replace the entire nexthop group:
$ ip -6 route replace 2001:db8::1/56 \ > nexthop via 2001:db8:1:20::110 \ > nexthop via 2001:db8:1:10::110 $ ip -6 route 2001:db8::/56 metric 1024 rt_offload pref medium nexthop via 2001:db8:1:20::110 dev br0.9 weight 1 nexthop via 2001:db8:1:10::110 dev br0.10 weight 1
8. Router MAC
The VLAN interfaces on top of the bridge interface all share MAC addresses with the bridge interface. The specific MAC assigned to a bridge depends on configuration.
The default Linux behaviour is to use the minimal MAC for all interfaces joined to the bridge. This means the MAC changes dynamically as ports join and leave the bridge. Alternatively, the bridge MAC can be explicitly set by the user.
9. Useful sysctl options
Here is a list of options which are either necessary or useful. The kernel defaults may not be aligned with the default hardware behaviour.
9.1. Enable Forwarding
Enable IPv4 and IPv6 forwarding.
$ sysctl -w net.ipv4.ip_forward=1 $ sysctl -w net.ipv6.conf.all.forwarding=1 $ sysctl -w net.ipv6.conf.default.forwarding=1
9.2. IPv6 Interface down
The kernel flushes IPv6 addresses when an interface is brought down, which destroys the router interfaces in hardware. This behaviour differs from IPv4, where addresses are kept when an interface is brought down. To get the same behavior for IPv6 you can use:
$ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1 $ sysctl -w net.ipv6.conf.default.keep_addr_on_down=1
9.3. Neighbour advertisement on interface down
Generate unsolicited neighbour advertisements when a device is brought up or hardware address changes.
$ sysctl -w net.ipv6.conf.all.ndisc_notify=1 $ sysctl -w net.ipv6.conf.default.ndisc_notify=1
9.4. Reverse Path Filtering
The kernel IPv4 network stack supports three RP filtering modes:
-
No filtering,
rp_filter=0
. -
Strict mode (RFC 3704),
rp_filter=1
. -
Loose mode (RFC 3704),
rp_filter=2
.
The hardware can support all modes, however the driver does not implement this at the moment. In practice this means the source IP does not matter for the purpose of routing decisions in hardware. Disable rp_filter in the kernel, to be consistent with hardware behaviour.
$ sysctl -w net.ipv4.conf.all.rp_filter=0 $ sysctl -w net.ipv4.conf.default.rp_filter=0
9.5. IPv6 netlink events on interface down
Controls whether an RTM_DELROUTE
message is generated for routes removed when a device is taken down or deleted. IPv4 does not generate this message; IPv6 does by default.
$ sysctl -w net.ipv6.route.skip_notify_on_dev_down=1
9.6. Neighbour GC thresholds
9.6.1. gc.thresh1
Minimum number of entries to keep. The garbage collector will not purge entries if there are fewer than this number. The default is 128.
$ sysctl -w net.ipv4.neigh.default.gc_thresh1=128 $ sysctl -w net.ipv6.neigh.default.gc_thresh1=128
9.6.2. gc.thresh2
Threshold when garbage collector becomes more aggressive about purging entries. Entries older than 5 seconds will be cleared when over this number. The default is 512.
$ sysctl -w net.ipv4.neigh.default.gc_thresh2=2048 $ sysctl -w net.ipv6.neigh.default.gc_thresh2=2048
9.6.3. gc.thresh3
Maximum number of non-PERMANENT neighbor entries allowed. Increase this when using large numbers of interfaces and when communicating with large numbers of directly connected peers. The default is 1024.
$ sysctl -w net.ipv4.neigh.default.gc_thresh3=4096 $ sysctl -w net.ipv6.neigh.default.gc_thresh3=4096
9.7. ICMP redirects
The kernel will not always determine if two hosts are in the same broadcast domain. When routing between such hosts, it will incorrectly dispatch an ICMP redirect. To avoid this redirects can be disabled.
10. Limitations to hardware routing
Routing in hardware is subject to limitations. The following frame types are not eligible for routing in hardware.
-
IPv4 packets must not contain options.
-
IPv6 packets must not contain hop-by-hop options.
-
IPv4 header checksum must be correct.
The default configuration will trap these frames to the CPU, leaving it up to the kernel configuration to decide how to handle them.
11. Debugging
Do not use the vcap tool to manipulate the LPM VCAP when routing is enabled. It is critical for proper functioning that the hardware state is in sync with the kernel state.
|
If the DebugFS is enabled, it is possible to inspect the hardware routing table.
$ cat /sys/kernel/debug/sparx5/vcaps/lpm_0
If a route has been hit in HW, the sticky bit will be set:
... rule: 6, addr: [18426,18426], X1, ctr[0]: 0, hit: 1 chain_id: 6000000 user: 5 priority: 96 state: permanent keysets: VCAP_KFS_SGL_IP4 keyset_sw: 1 keyset_sw_regs: 2 DST_FLAG: W1: 1/1 IP4_XIP: W32: 1.0.10.2/255.255.255.255 actionset: VCAP_AFS_ARP_ENTRY actionset_sw: 1 actionset_sw_regs: 3 ARP_ENA: W1: 1 MAC_LSB: W32: 0xba6378de MAC_MSB: W16: 0x90e2 TYPE: W2: 2 ...
The VCAP counters do not exist for the LPM VCAP. |
12. Full examples
12.1. IPv4
Suppose we have two VLANs configured on the switch, each with a subnet assigned. We wish to route L3 traffic between the VLANs.
bridge host0 +-------------+ pvid 9 +-----------------------+ | | untagged ethp0 1.0.9.2 | | br0.9 +------------>| | | eth0 |1.0.10.0/24 via 1.0.9.1| | 1.0.9.1/24 | +-----------------------+ | | | | | br0 | | | host1 | | pvid 10 +-----------------------+ | | untagged ethp1 1.0.10.2 | | br0.10 +------------>| | | eth1 |1.0.9.0/24 via 1.0.10.1| | 1.0.10.1/24 | +-----------------------+ | | +-------------+
12.1.1. Switch configuration
Add the bridge and enable forwarding.
$ ip link add name br0 type bridge vlan_filtering 1 vlan_default_pvid 0 $ sysctl -w net.ipv4.ip_forward=1 $ sysctl -w net.ipv4.conf.all.forwarding=1
Disable ICMP redirects.
$ sysctl -w net.ipv4.conf.all.send_redirects=0 $ sysctl -w net.ipv4.conf.br0.send_redirects=0 $ sysctl -w net.ipv4.conf.eth0.send_redirects=0 $ sysctl -w net.ipv4.conf.eth1.send_redirects=0
Join the two port interfaces to the bridge, and add VLAN configuration.
$ ip link set eth0 master br0 up $ ip link set eth1 master br0 up $ ip link set up dev br0 $ bridge vlan add dev br0 vid 9 self $ bridge vlan add dev br0 vid 10 self $ bridge vlan add dev eth0 vid 9 pvid untagged $ bridge vlan add dev eth1 vid 10 pvid untagged
For each VLAN, we add an VLAN upper interface to the bridge.
$ ip link add link br0 name br0.9 type vlan id 9 $ ip link add link br0 name br0.10 type vlan id 10 $ ip link set up dev br0.9 $ ip link set up dev br0.10
Create gateways in each VLAN, by assigning a subnet to each interface.
$ ip addr add 1.0.9.1/24 dev br0.9 $ ip addr add 1.0.10.1/24 dev br0.10
12.1.2. Host configuration
For each host, we configure an IP in the subnet, and set a gateway for the opposite subnet.
For host 0:
$ ip addr add 1.0.9.2/24 dev ethp0 $ ip link set up dev ethp0 $ ip route add 1.0.10.0/24 via 1.0.9.1 dev ethp0
For host 1:
$ ip addr add 1.0.10.2/24 dev ethp1 $ ip link set up dev ethp1 $ ip route add 1.0.9.0/24 via 1.0.10.1 dev ethp1
12.1.3. Ping
If we ping from 1.0.9.2 to 1.0.10.2 will, the switch kernel will initiate ARP for both hosts, install the neighbours in hardware and offload dataplane to hardware.
host0# ping -c 3 1.0.10.2 PING 1.0.10.2 (1.0.10.2) 56(84) bytes of data. 64 bytes from 1.0.10.2: icmp_seq=1 ttl=63 time=0.975 ms 64 bytes from 1.0.10.2: icmp_seq=2 ttl=63 time=0.342 ms 64 bytes from 1.0.10.2: icmp_seq=3 ttl=63 time=0.350 ms
Looking at the switch neighbour table, shows the offloaded neighbours:
$ ip nei 1.0.10.2 dev br0.10 lladdr 90:e2:ba:63:78:de offload STALE 1.0.9.2 dev br0.9 lladdr 90:e2:ba:63:78:dc offload STALE