Filename: 299-ip-failure-count.txt
Title: Preferring IPv4 or IPv6 based on IP Version Failure Count
Author: Neel Chauhan
Created: 25-Jan-2019
Status: Superseded
Superseded-by: 306
Ticket: https://trac.torproject.org/projects/tor/ticket/27491
1. Introduction
As IPv4 address space becomes scarce, ISPs and organizations will deploy
IPv6 in their networks. Right now, Tor clients connect to guards using
IPv4 connectivity by default.
When networks first transition to IPv6, both IPv4 and IPv6 will be enabled
on most networks in a so-called "dual-stack" configuration. This is to not
break existing IPv4-only applications while enabling IPv6 connectivity.
However, IPv6 connectivity may be unreliable and clients should be able
to connect to the guard using the most reliable technology, whether IPv4
or IPv6.
In ticket #27490, we introduced the option ClientAutoIPv6ORPort which adds
preliminary "happy eyeballs" support. If set, this lets a client randomly
choose between IPv4 or IPv6. However, this random decision does not take
into account unreliable connectivity or network failures of an IP family.
A successful Tor implementation of the happy eyeballs algorithm requires
that unreliable connectivity on IPv4 and IPv6 are taken into consideration.
This proposal describes an algorithm to take into account network failures
in the random decision used for choosing an IP family and the data fields
used by the algorithm.
2. Options To Enable The Failure Counter
To enable the failure counter, we will add a flags to ClientAutoIPv6ORPort.
The new format for ClientAutoIPv6ORPort is:
ClientAutoIPv6ORPort 0|1 [flags]
The first argument is to enable the automatic selection between IPv4 and
IPv6 if it is 1. The second argument is a list of optional flags.
The only flag so far is "TrackFailures", which enables the tracking of
failures to make a better decision when selecting between IPv4 and IPv6.
The tracking of failures will be described in the rest of this proposal.
However, we should be open to more flags from future proposals as they
are written and implemented.
3. Failure Counter Design
I propose that the failure counter uses the following fields:
* IPv4 failure points
* IPv6 failure points
These entries will exist as internal counters for the current session, and
a calculated value from the previous session in the statefile.
These values will be stored as 32-bit unsigned integers for the current
session and in the statefile.
When a new session is loaded, we will load the failure count from the
statefile, and when a session is closed, the failure counts from the current
session will be stored in the statefile.
4. Failure Probability Calculation
The failure count of one IP version will increase the probability of the
other IP version. For instance, a failure of IPv4 will increase the IPv6
probability, and vice versa.
When the IP version is being chosen, I propose that these values will be
included in the guard selection code:
* IPv4 failure points
* IPv6 failure points
* Total failure points
These values will be stored as 32-bit unsigned integers.
A generic failure of an IP version will add one point to the failure point
count values of the particular IP version which failed.
A failure of an IP version from a "no route" error which happens when
connections automatically fail will be counted as two failure points
for the automatically failed version.
The failure points for both IPv4 and IPv6 is sum of the values in the state
file plus the current session's failure values. The total failure points is
a sum of the IPv4 and IPv6 failure points, and is updated when the failure
point count of an IP version is updated.
The probability of a particular IP version is the failure points of the
other version divided by the total number of failure points, multiplied
by 4 and stored as an integer. We will call this value the summarized
failure point value (SFPV). The reason for this summarization is to
emulate a probability in 1/4 intervals by the random number generator.
In the random number generator, we will choose a random number between 0
and 4. If the random number is less than the IPv6 SFPV, we will choose
IPv4. If it is equal or greater, we will choose IPv6.
If the probability is 0/4 with a SFPV value of 0, it will be rounded to
1/4 with a SFPV of 1. Also, if the probability is 4/4 with a SFPV of 4,
it will be rounded to 3/4 with a SFPV of 3. The reason for this is to
accomodate mobile clients which could change networks at any time (e.g.
WiFi to cellular) which may be more or less reliable in terms of a
particular IP family when compared to the previous network of the client.
5. Initial Failure Point Calculation
When a client starts without failure points or if the FP value drops to 0,
we need a SFPV value to start with. The Initial SFPV value will be counted
based on whether the client is using a bridge or not and if the relays in
the bridge configuration or consensus have IPv6.
For clients connecting directly to Tor, we will:
* During Bootstrap: use the number of IPv4 and IPv6 capable fallback
directory mirrors during bootstrap.
* After the initial consensus is received: use the number of IPv4 and
IPv6 capable guards in the consensus.
The reason why the consensus will be used to calculate the initial failure
point value is because using the number of guards would bias the SFPV value
with whatever's dominant on the network rather than what works on the
client.
For clients connecting through bridges, we will use the number of bridges
configured and the IP versions supported.
The initial value of the failure points in the scenarios described in this
section would be:
* IPv4 Faulure Points: Count the number of IPv6-capable relays
* IPv6 Failure Points: Count the number of IPv4-capable relays
If the consensus or bridge configuration changes during a session, we should
not update the failure point counters to generate a SFPV.
If we are starting a new session, we should use the existing failure points
to generate a SFPV unless the counts for IPv4 or IPv6 are zero.
6. Forgetting Old Sessions
We should be able to forget old failures as clients could change networks.
For instance, a mobile phone could switch between WiFi and cellular. Keeping
an exact failure history would have privacy implications, so we should store
an approximate history.
One way we could forget old sessions is by halving all the failure point
(FP) values before adding when:
* One or more failure point values are a multiple of a random number
between 1 and 5
* One or more failure point values are greater than or equal to 100
The reason for halving the values at regular intervals is to forget old
sessions while keeping an approxmate history. We halve all FP values so
that one IP version doesn't dominante on the failure count if the other
is halved. This keeps an approximate scale of the failures on a client.
The reason for halving at a multiple of a random number instead of a fixed
interval is so we can halve regularly while not making it too predictable.
This prevents a situation where we would be halving too often to keep an
approximate failure history.
If we halve, we add the FP value for the failed IP version after halving all
FPs if done to account for the failure. If halving is not done, we will just
add the FP.
If the FP value for one IP version goes down to zero, we will re-calculate
the SFPV for that version using the methods described in Section 4.
7. Separate Concurrent Connection Limits
Right now, there is a limit for three concurrent connections from a client.
at any given time. This limit includes both IPv4 and IPv6 connections.
This is to prevent denial of service attacks. I propose that a seperate
connection limit is used for IPv4 and IPv6. This means we can have three
concurrent IPv4 connections and three concurrent IPv6 connections at the
same time.
Having seperate connection limits allows us to deal with networks dropping
packets for a particular IP family while still preventing potential denial
of service attacks.
8. Pathbias and Failure Probability
If ClientAutoIPv6ORPort is in use, and pathbias is triggered, we should
ignore "no route" warnings. The reason for this is because we would be
adding two failure points for the failed as described in Section 3 of this
proposal. Adding two failure points would make us more likely to prefer the
competing IP family over the failed one versus than adding a single failure
point on a normal failure.
9. Counting Successful Connections
If a connection to a particular IP version is successful, we should use
it. This ensures that clients have a reliable connection to Tor. Accounting
for successful connections can be done by adding one failure point to the
competing IP version of the successful connection. For instance, if we have
a successful IPv6 connection, we add one IPv4 failure point.
Why use failure points for successful connections? This reduces the need for
separate counters for successes and allows for code reuse. Why add to the
competing version's failure point? Similar to how we should prefer IPv4 if
IPv6 fails, we should also prefer IPv4 if it is successful. We should also
prefer IPv6 if it is successful.
Even on adding successes, we will still halve the failure counters as
described in Section 5.
10. Acknowledgements
Thank you teor for aiding me with the implementation of Happy Eyeballs in
Tor. This would not have been possible if it weren't for you.