Recapping
We’ll use our last depiction of the graph (pictured below) from the previous section as a starting point.
If we want symmetrical paths (or at least, “symmetrical enough” that they hit the same stateful nodes) we need a criteria for path selection that chooses the same edges regardless of which of the two nodes is executing the decision.
As discussed in the preamble, the reason that existing forwarding selection mechanisms (BGP ECMP in this case) don’t provide that symmetry is that their algorithms operate on hashes of characteristics of the source and destiation of the traffic, and those values are inverted for the two flows.
The problem, again.
The “problem” is that we want to enforce (pseudo-)symmetric routing in our network, while maintaining resiliency and load-sharing across multiple links and state-enforcing hops. To guarantee that the same path is selected in “both directions”, the path-selection mechanism has to act on information about both the destination and the source. That’s because the source and destination are inverted in the bi-directional flows between two endpoints. The proverbial kick in the head is that IP routing makes forwarding decisions based on destination addresses only (unless we utilize policy routing.) Generally speaking, we don’t want to use policy routing if we don’t need to, because it requires per-packet inspection of whatever fields we’re basing our policy on. Even if per-flow (rather than per-packet) classification is still possible, that classification requires inspecting the IP source-headers, which requires (or has required, historically) execution on sparse/costly generic CPU resources (rather than the preferred switching ASICs.)
With the relatively tightly-prescribed topology that we have defined, we know that there are three properties that we want to influence the path-selection.
- Path-affinity-group (PAG)
- Symmetric-perimeter-zone (SPZ)
- Symmetric-perimeter-node (SPN) We also know that there will be up to three equi-distant paths between endpoints with non-matching PAG and SPZ values. The question, then, is How can we differentiate and select from those multiple equidistant paths in a manner that is bi-directionally consistent when the paths are selected sequentially by each successive node?
The solution (in theory)
What we “care” about is that both members of a bi-directional flow-pair between two nodes traverse the same state-enforcing nodes. One thing that does differentiate our equi-distant paths is the PAG values (which ‘locations’ or ‘sites’) the SPN nodes are in. Our architecture dictates that all inter-PAG/inter-SPZ flows will traverse two SPN nodes (one the same SPZ value as one of the flow’s endpoint, and a second with the same SPZ value as the flow’s other endpoint). Of these three potential equi-distant paths:
- One has both SPN nodes in the same PAG as node-A
- One has two SPN nodes in the same PAG as node-B
- And one has an SPN node with each endpoint’s PAG value.
Bidirectional consistency in path identification
We can describe the three potential equidistant paths as:
- The one with more SPN nodes that share node-A’s PAG value
- The one with more SPN nodes that share node-B’s PAG value
- The one with an equal number of SPN nodes that share node-A and node-B’s PAG value
That description is bi-directionally consistent, but requires that each forwarding/transport node know (and agree on) node-A’s PAG value and node-B’s PAG value, as well as the number of SPN nodes in each path (and the PAG value of each of those nodes).
The first problem is agreeing on what “node-A” and “node-B” are. In one flow, node-A is the source, in the other flow, it’s the destination. Instead of describing the paths in terms of node-A and node-B’s PAG value, we could instead describe them in terms of “the higher or lower of node A/B’s PAG values. If node-A has a PAG value of “10” and node-B has a PAG value of “20”, we would end up evaluating “the higher of 10 and 20” in one direction, and “the higher of 20 and 10” in the other direction. (In both cases, it resolves to “20.”) That brings us to:
We’ll describe/differentiate our paths as:
- The one with more SPN nodes sharing the PAG value of the ‘higher’ of the two endpoints’ PAG values
- The one with more SPN nodes sharing the PAG value of the ‘lower’ of the two endpoints’ PAG values
- The one with an equal number of SPN nodes sharing the PAG value of each endpoint
NOTE
We could also base our evaluation on a comparison of node A/B’s symmetric-perimeter zone (SPZ) values, instead of path-affinity-group-values if we preferred.
Bidirectional consistency in path selection
Now that we can identify each of the potential equal-cost paths in a bidirectionally-consistent manner, we need to figure out how to decide which of those paths we want to use for any given bidirectional flow-pair. We want this decision process to:
- Be bi-directionally consistent
- Load-share (at the endpoint-pair granularity) across all of the equal-cost paths
- Provide resiliency in the event of failure on any of the three paths
To do this, we need an evaluation function that provides the same results when its two input parameters are reversed and provides a relatively even distribution across the three output values (per un-ordered endpoint pairs.)
Although it’s tempting, we can’t just XOR the IP-addresses (or PAG, or SPZ values) of the endpoints and perform a modulo-3 division on the result. That would work well for consistency and load-sharing, but would not work well for resiliency (we’d need to change our modulo arithmetic every time one of the equal-cost paths showed up). Also, it would require policy-based routing that acts on the packets’ source IP address, which we very-much want to avoid.
To avoid policy-based routing, we will try to use a model that achieves the desired results exlusively by modifying which routes are preferred by BGP (and eschewing source headers in forwarding decisions.)
NOTE
We use our observations about the distribution of PAG, SPZ, and SPN node values across the graph in our prescribed topology to safely let every forwarding/transport node use it’s own PAG value as a proxy for the source-node’s PAG value.
This is a viable approach only so long as the individual forwarding/transport nodes that have to select from among the equal-cost paths have the same PAG as the source/transmitting node of the packet that they are forwarding.
The path selection logic that solves our problem if we apply it on a per-node basis is as follows:
- Unless the destination has a different PAG and SPZ value from the current node and there are multiple shortest paths
- Use the path(s) from the current node to the destination node that has the fewest number of edges
- If there are multiple shortest paths to the destination and the destination has a different PAG value and a different SPZ value from the current node
- Select a single path, using the following preferences
- From among the shortest paths in which the next node has the same PAG depth as the current node:
- If the SPZ value of the destination node is higher than the SPZ value of the current node
- Prefer the path with the most SPN nodes whose PAG value matches the PAG value of the current node
- If the SPZ value of the destination node is lower than the SPZ value of the current node
- Prefer the path with the most SPN nodes whose PAG value matches the PAG value of the destination node
- Do not select a path in which the next node has a different PAG ‘depth’ than the current node
Why does this work?
As illustrated above, our logic (when evaluated on the path as a whole – against the actual endpoints’ PAG and SPZ values), selects path “A”. Node 14 (one endpoint) has a PAG value of 0.1.1 and SPZ value of 0.0.1, while node 22 (the other endpoint) has PAG 0.1.3 and SPZ 0.0.3. As such, we want to select the path with the most SPN nodes in PAG 0.1.3. (That’s path “a”, in the above figure)
But, when we use each individual node’s SPZ and PAG values as a proxy for the source-node’s SPZ/PAG values, we run into a potential problem. What happens on node “4” in the flow from node 22 to node 14 using path “a”? It is in a different SPZ than the originating node, so it has a different basis of comparison when comparing it’s SPZ to that of the destination node than the originating node did. Node 4 has an SPZ value of 0.0 (wherease the origin, node 22, has an SPZ value of 0.0.3. Finding the greater of “0.0 and 0.0.1” gives us a diferrent answer than finding the greater of “0.0.3 and 0.0.1”!
NOTE
That is why we included the provision to select only from paths in which the next-node has the same PAG depth than the current node (when multiple equal-length paths are available) in our path selection logic
The remaining sections of this paper detail how this generalized pattern can be implemented using BGP in a routed IP network.