Filename: 329-traffic-splitting.txt
Title: Overcoming Tor's Bottlenecks with Traffic Splitting
Author: David Goulet, Mike Perry
Created: 2020-11-25
Status: Finished

0. Status

  This proposal describes the Conflux [CONFLUX] system developed by
  Mashael AlSabah, Kevin Bauer, Tariq Elahi, and Ian Goldberg. It aims at
  improving Tor client network performance by dynamically splitting
  traffic between two circuits. We have made several additional improvements
  to the original Conflux design, by making use of congestion control
  information, as well as updates from Multipath TCP literature.


1. Overview

1.1. Multipath TCP Design Space

  In order to understand our improvements to Conflux, it is important to
  properly conceptualize what is involved in the design of multipath
  algorithms in general.

  The design space is broken into two orthogonal parts: congestion control
  algorithms that apply to each path, and traffic scheduling algorithms
  that decide which packets to send on each path.

  MPTCP specifies 'coupled' congestion control (see [COUPLED]). Coupled
  congestion control updates single-path congestion control algorithms to
  account for shared bottlenecks between the paths, so that the combined
  congestion control algorithms do not overwhelm any bottlenecks that
  happen to be shared between the multiple paths. Various ways of
  accomplishing this have been proposed and implemented in the Linux
  kernel.

  Because Tor's congestion control only concerns itself with bottlenecks in
  Tor relay queues, and not with any other bottlenecks (such as
  intermediate Internet routers), we can avoid this complexity merely by
  specifying that any paths that are constructed SHOULD NOT share any
  relays (except for the exit). This assumption is valid, because non-relay
  bottlenecks are managed by TCP of client-to-relay and relay-to-relay OR
  connections, and not Tor's circuit-level congestion control. In this way,
  we can proceed to use the exact same congestion control as specified in
  [PROP324], for each path.

  For this reason, this proposal will focus on protocol specification, and
  the traffic scheduling algorithms, rather than coupling. Note that the
  scheduling algorithms are currently in flux, and will be subject to
  change as we tune them in Shadow, on the live network, and for future
  UDP implementation (see [PROP339]). This proposal will be kept up to
  date with the current implementation.

1.2. Divergence from the initial Conflux design

  The initial [CONFLUX] paper doesn't provide any indications on how to
  handle the size of out-of-order cell queue, which we consider a
  potential dangerous memory DoS vector (see [MEMORY_DOS]). It also used
  RTT as the sole heuristic for selecting which circuit to send on (which
  may vary depending on the geographical locations of the participant
  relays), without considering their actual available circuit capacity
  (which will be available to us via Proposal 324). Additionally, since
  the publication of [CONFLUX], more modern packet scheduling algorithms
  have been developed, which aim to reduce out-of-order queue size.

  We propose mitigations for these issues using modern scheduling
  algorithms, as well as implementations options for avoiding the
  out-of-order queue at Exit relays. Additionally, we consider resumption,
  side channel, and traffic analysis risks and benefits in [RESUMPTION],
  [SIDE_CHANNELS] and [TRAFFIC_ANALYSIS].

1.3. Design Overview

  The following section describes the Conflux design.

  The circuit construction is as follows:

         Primary Circuit (lower RTT)
            +-------+      +--------+
            |Guard 1|----->|Middle 1|----------+
            +---^---+      +--------+          |
   +-----+      |                           +--v---+
   | OP  +------+                           | Exit |--> ...
   +-----+      |                           +--^---+
            +---v---+      +--------+          |
            |Guard 2|----->|Middle 2|----------+
            +-------+      +--------+
         Secondary Circuit (higher RTT)

  Both circuits are built using current Tor path selection, however they
  SHOULD NOT share the same Guard relay, or middle relay. By avoiding
  using the same relays in these positions in the path, we ensure
  additional path capacity, and eliminate the need to use more complicated
  'coupled' congestion control algorithms from the MPTCP
  literature[COUPLED].  This both simplifies design, and improves
  performance.

  Then, the OP needs to link the two circuits together, as described in
  [CONFLUX_HANDSHAKE].

  For ease of explanation, the primary circuit is the circuit that is
  more desirable to use, as per the scheduling algorithm, and the secondary
  circuit is used after the primary is blocked by congestion control. Note
  that for some algorithms, this selection becomes fuzzy, but all of them
  favor the circuit with lower RTT, at the beginning of transmission.

  Note also that this notion of primary vs secondary is a local property
  of the current sender: each endpoint may have different notions of
  primary, secondary, and current sending circuit. They also may use
  different scheduling algorithms to determine this.

  Initial RTT is measured during circuit linking, as described in
  [CONFLUX_HANDSHAKE]. After the initial link, RTT is continually measured
  using SENDME timing, as in Proposal 324. This means that during use,
  the primary circuit and secondary circuit may switch roles, depending on
  unrelated network congestion caused by other Tor clients.

  We also support linking onion service circuits together. In this case,
  only two rendezvous circuits are linked. Each of these RP circuits will
  be constructed separately, and then linked. However, the same path
  constraints apply to each half of the circuits (no shared relays between
  the legs). If, by chance, the service and the client sides end up
  sharing some relays, this is not catastrophic. Multipath TCP researchers
  we have consulted (see [ACKNOWLEDGMENTS]), believe Tor's congestion
  control from Proposal 324 to be sufficient in this rare case.

  In the algorithms we recommend here, only two circuits will be linked
  together at a time.  However, implementations SHOULD support more than
  two paths, as this has been shown to assist in traffic analysis
  resistance[WTF_SPLIT], and will also be useful for maintaining a desired
  target RTT, for UDP VoIP applications.

  If the number of circuits exceeds the current number of guard relays,
  guard relays MAY be re-used, but implementations SHOULD use the same
  number of Guards as paths.

  Linked circuits MUST NOT be extended further once linked (ie:
  'cannibalization' is not supported).


2. Protocol Mechanics

2.1. Advertising support for conflux

2.1.1 Relay

  We propose a new protocol version in order to advertise support for
  circuit linking on the relay side:

     "Conflux=1" -- Relay supports Conflux as in linking circuits together using
                    the new LINK, LINKED and SWITCH relay command.

2.1.2 Onion Service

  We propose to add a new line in order to advertise conflux support in the
  encrypted section of the onion service descriptor:

    "conflux" SP max-num-circ SP desired-ux NL

      The "max-num-circ" value indicate the maximum number of rendezvous
      circuits that are allowed to be linked together.

  We let the service specify the conflux algorithm to use, when sending data
  to the service. Some services may prefer latency, where as some may prefer
  throughput. However, clients also have the ability to request their own UX
  for data that the service sends, in the LINK handshake below, in part
  because the high-throughput algorithms will require more out-of-order queue
  memory, which may be infeasible on mobile.

  The next section describes how the circuits are linked together.

2.2. Conflux Handshake [CONFLUX_HANDSHAKE]

  To link circuits, we propose new relay commands that are sent on both
  circuits, as well as a response to confirm the join, and an ack of this
  response. These commands create a 3way handshake, which allows each
  endpoint to measure the initial RTT of each leg upon link, without
  needing to wait for any data.

  All three stages of this handshake are sent on *each* circuit leg to be
  linked.

  When packed cells are a reality (proposal 340), these cells SHOULD be
  combined with the initial RELAY_BEGIN cell on the faster circuit leg.
  This combination also allows better enforcement against side channels.
  (See [SIDE_CHANNELS]).

  There are other ways to do this linking that we have considered, but they
  seem not to be significantly better than this method, especially since we can
  use Proposal 340 to eliminate the RTT cost of this setup before sending data.
  For those other ideas, see [ALTERNATIVE_LINKING] and [ALTERNATIVE_RTT], in
  the appendix.

  The first two parts of the handshake establish the link, and enable
  resumption:

    19 -- RELAY_CONFLUX_LINK

          Sent from the OP to the exit/service in order to link circuits
          together at the end point.

    20 -- RELAY_CONFLUX_LINKED

          Sent from the exit/service to the OP, to confirm the circuits were
          linked.

  The contents of these two cells is exactly the same. They have the following
  contents:

    VERSION   [1 byte]
    PAYLOAD   [variable, up to end of relay payload]

  The VERSION tells us which circuit linking mechanism to use. At this
  point in time, only 0x01 is recognized and is the one described by the
  Conflux design.

  For version 0x01, the PAYLOAD contains:

     NONCE              [32 bytes]
     LAST_SEQNO_SENT    [8 bytes]
     LAST_SEQNO_RECV    [8 bytes]
     DESIRED_UX         [1 byte]

  The NONCE contains a random 256-bit secret, used to associate the two
  circuits together. The nonce MUST NOT be shared outside of the circuit
  transmission, or data may be injected into TCP streams. This means it
  MUST NOT be logged to disk.

  The two sequence number fields are 0 upon initial link, but non-zero in
  the case of a reattach or resumption attempt (See [CONFLUX_SET_MANAGEMENT]
  and [RESUMPTION]).

  The DESIRED_UX field allows the endpoint to request the UX properties
  it wants. The other endpoint SHOULD select the best known scheduling
  algorithm, for these properties. The endpoints do not need to agree
  on which UX style they prefer.

  The UX properties are:

    0 - NO_OPINION
    1 - MIN_LATENCY
    2 - LOW_MEM_LATENCY
    3 - HIGH_THROUGHPUT
    4 - LOW_MEM_THROUGHPUT

  The algorithm choice is performed by to the *sender* of data, (ie: the
  receiver of the PAYLOAD). The receiver of data (sender of the PAYLOAD)
  does not need to be aware of the exact algorithm in use, but MAY enforce
  expected properties (particularly low queue usage, in the case of requesting
  either LOW_MEM_LATENCY or LOW_MEM_THROUGHPUT). The receiver MAY close the
  entire conflux set if these properties are violated.

  If either circuit does not receive a RELAY_CONFLUX_LINKED response, both
  circuits MUST be closed.

  The third stage of the handshake exists to help the exit/service measure
  initial RTT, for use in [SCHEDULING]:

    21 -- RELAY_CONFLUX_LINKED_ACK

          Sent from the OP to the exit/service, to provide initial RTT
          measurement for the exit/service.

  These three relay commands are sent on *each* leg, to allow each endpoint to
  measure the initial RTT of each leg.

  The client SHOULD abandon and close circuit if the LINKED message takes too
  long to arrive. This timeout MUST be no larger than the normal SOCKS/stream
  timeout in use for RELAY_BEGIN, but MAY be the Circuit Build Timeout value,
  instead. (The C-Tor implementation currently uses Circuit Build Timeout).

  See [SIDE_CHANNELS] for rules for when to reject unexpected handshake cells.

2.2. Linking Circuits from OP to Exit [LINKING_EXIT]

  To link exit circuits, two circuits to the same exit are built, with
  additional restrictions such that they do not share Guard or Middle
  relays. This restriction applies via specific relay identity keys,
  and not IP addresses, families, or networks. (This is because the purpose
  of it is to avoid sharing a bottleneck *inside* relay circuit queues;
  bottlenecks caused by a shared network are handled by TCP's congestion
  control on the OR conns).

  Bridges also are subject to the same constraint as Guard relays;
  the C-Tor codebase emits a warn if only one bridge is configured, unless
  that bridge has transport "snowflake". Snowflake is exempt from this
  Guard restriction because it is actually backed by many bridges. In the
  bridge case, we only warn, and do not refuse to build conflux circuits,
  because it is not catastrophic that Bridges are shared, it is just
  sub-optimal for performance and congestion.

  When each circuit is opened, we ensure that congestion control
  has been negotiated. If congestion control negotiation has failed, the
  circuit MUST be closed. After this, the linking handshake begins.

  The RTT times between RELAY_CONFLUX_LINK and RELAY_CONFLUX_LINKED are
  measured by the client, to determine primary vs secondary circuit use,
  and for packet scheduling. Similarly, the exit measures the RTT times
  between RELAY_CONFLUX_LINKED and RELAY_CONFLUX_LINKED_ACK, for the same
  purpose.

  Because of the race between initial data and the RELAY_CONFLUX_LINKED_ACK
  cell, conditions can arise where an Exit needs to send data before the
  slowest circuit delivers this ACK. In these cases, it should prefer sending
  data on the circuit that has delivered the ACK (which will arrive immediately
  prior to any data from the client). This circuit will be the lower RTT
  circuit anyway, but the code needs to handle the fact that in this case,
  there won't yet be an RTT for the second circuit.

2.3. Linking circuits to an onion service [LINKING_SERVICE]

  For onion services, we will only concern ourselves with linking
  rendezvous circuits.

  To join rendezvous circuits, clients make two introduce requests to a
  service's intropoint, causing it to create two rendezvous circuits, to
  meet the client at two separate rendezvous points. These introduce
  requests MUST be sent to the same intropoint (due to potential use of
  onionbalance), and SHOULD be sent back-to-back on the same intro
  circuit. They MAY be combined with Proposal 340. (Note that if we do not
  use Prop340, we will have to raise the limit on number of intros per
  client circuit to 2, here, at intropoints).

  When rendezvous circuits are built, they should use the same Guard,
  Bridge, and Middle restrictions as specified in 2.2, for Exits. These
  restrictions SHOULD also take into consideration all Middles in the path,
  including the rendezvous point. All relay identities should be unique
  (again, except for when the Snowflake transport is in use). The one
  special case here is if the chosen rendezvous points by a client
  are the same as the service's guards. In this case, the service SHOULD
  NOT use different guards, but instead stick with the guards it has.
  The reason for this is that we do not want to create the ability
  for a client to force a service to use different guards.

  The first rendezvous circuit to get joined SHOULD use Proposal 340 to
  append the RELAY_BEGIN command, and the service MUST answer on this
  circuit, until RTT can be measured.

  Once both circuits are linked and RTT is measured, packet scheduling
  MUST be used, as per [SCHEDULING].

2.4. Conflux Set Management [CONFLUX_SET_MANAGEMENT]

  When managing legs, it is useful to separate sets that have completed the
  link handshake from legs that are still performing the handshake. Linked
  sets MAY have additional unlinked legs on the way, but these should not
  be used for sending data until the handshake is complete.

  It is also useful to enforce various additional conditions on the handshake,
  depending on if [RESUMPTION] is supported, and if a leg has been launched
  because of an early failure, or due to a desire for replacement.

2.4.1. Pre-Building Sets

  In C-Tor, conflux is only used via circuit prebuilding. Pre-built conflux
  sets are preferred over other pre-built circuits, but if the pre-built pool
  ends up empty, normal pre-built circuits are used. If those run out, regular
  non-conflux circuits are built. In other words, in C-Tor, conflux sets are
  never built on-demand, but this is strictly an implementation decision, to
  simplify dealing with the C-Tor codebase

  The consensus parameter 'cfx_max_prebuilt_set' specifies the number of
  sets to pre-build.

  During upgrade, the consensus parameter 'cfx_low_exit_threshold' will be
  used, so that if there is a low amount of conflux-supporting exits, only
  one conflux set will be built.

2.4.2. Set construction

  When a set is launched, legs begin the handshake in the unlinked state.
  As handshakes complete, finalization is attempted, to create a linked set.
  On the client, this finalization happens upon receipt of the LINKED cell.
  On the exit/service, this finalization happens upon *sending* the LINKED_ACK.

  The initiator of this handshake considers the set fully linked once the
  RELAY_CONFLUX_LINKED_ACK is sent (roughly upon receipt of the LINKED cell).
  Because of the potential race between LINKED_ACK, and initial data sent by
  the client, the receiver of the handshake must consider a leg linked at
  the time of *sending* a LINKED_ACK cell.

  This means that exit legs may not have an RTT measurement, if data on the
  faster leg beats the LINKED_ACK on the slower leg. The implementation MUST
  account for this, by treating unmeasured legs as having infinite RTT.

  When attempting to finalize a set, this finalization should not complete
  if any unlinked legs are still pending.

2.4.3. Closing circuits

  For circuits that are unlinked, the origin SHOULD immediately relaunch a new
  leg when it is closed, subject to the limits in [SIDE_CHANNELS].

  In C-Tor, we do not support arbitrary resumption. Therefore, we perform
  some additional checks upon closing circuits, to decide if we should
  immediately tear down the entire set:
     - If the closed leg was the current sending leg, close the set
     - If the closed leg had the highest non-zero last_seq_recv/sent, close the set
     - If data was in progress on a closed leg (inflight > cc_sendme_inc), then
       all legs must be closed

2.4.4. Reattaching Legs

  While C-Tor does not support arbitrary resumption, new legs *can* be
  attached, so long as there is no risk of data loss from a closed leg.
  This enables latency probing, which will be important for UDP VoIP.

  Currently, the C-Tor codebase checks for data loss by verifying that
  the LINK/LINKED cell has a lower last_seq_sent than all current
  legs' maximum last_seq_recv, and a lower last_seq_recv than all
  current legs last_seq_sent.

  This check is performed on finalization, not the receipt of first
  handshake cell. This gives the data additional time to arrive.

2.5. Congestion Control Application [CONGESTION_CONTROL]

  The SENDMEs for congestion control are performed per-leg. As soon as
  data arrives, regardless of its ordering, it is counted towards SENDME
  delivery. In this way, 'cwnd - inflight' of each leg always reflects
  the available data to send on each leg. This is important for
  [SCHEDULING].

  The Congestion control Stream XON/XOFF can be sent on either leg, and
  applies to the stream's transmission on both legs.

  In C-Tor, streams used to become blocked as soon as the OR conn
  of their circuit was blocked. Because conflux can send on the other
  circuit, which uses a different OR conn, this form of stream blocking
  has been decoupled from the OR conn status, and only happens when
  congestion control has decided that all circuits are blocked (congestion
  control becomes blocked when either 'cwnd - inflight <= 0', *or* when
  the local OR conn is blocked, so if all local OR conns of a set are
  blocked, then the stream will block that way).

  Note also that because congestion control only covers RELAY_COMMAND_DATA
  cells, for all algorithms, a special case must be made such that if no
  circuit is available to send on due to congestion control blocking,
  commands other than RELAY_COMMAN_DATA MUST be sent on the current
  circuit, even if the cell scheduler believes that no circuit is available.
  Depending on the code structure of Arti, this special case may or may
  not be necessary. It arises in C-Tor because nothing can block the
  sending of arbitrary non-DATA relay command cells.

2.6. Sequencing [SEQUENCING]

  With multiple paths for data, the problem of data re-ordering appears.
  In other words, cells can arrive out of order from the two circuits
  where cell N + 1 arrives before the cell N.

  Handling this reordering operates after congestion control for each
  circuit leg, but before relay cell command processing or stream data
  delivery.

  For the receiver to be able to reorder the receiving cells, a sequencing
  scheme needs to be implemented. However, because Tor does not drop or
  reorder packets inside of a circuit, this sequence number can be very
  small. It only has to signal that a cell comes after those arriving on
  another circuit.

  To achieve this, we propose a new relay command used to indicate a switch to
  another leg:

    22 -- RELAY_CONFLUX_SWITCH

          Sent from a sending endpoint when switching leg in an
          already linked circuit construction. This message is sent on the leg
          that will be used for new traffic, and tells the receiver the size of
          the gap since the last data (if any) sent on that leg.

  The cell payload format is:

    SeqNum  [4 bytes]

  The "SeqNum" value is a relative sequence number, which is the difference
  between the last absolute sequence number sent on the new leg and the last
  absolute sequence number sent on all other legs prior to the switch. In this
  way, the endpoint knows what to increment its local absolute sequence number
  by, before cells start to arrive.

  To achieve this, the sender must maintain the last absolute sequence sent for
  each leg, and the receiver must maintain the last absolute sequence number
  received for each leg.

  As an example, let's say we send 10 cells on the first leg, so our absolute
  sequence number is 10. If we then switch to the second leg, it is trivial to
  see that we should send a SWITCH with 10 as the relative sequence number, to
  indicate that regardless of the order in which the first cells are received,
  subsequent cells on the second leg should start counting at 10.

  However, if we then send 21 cells on this leg, our local absolute sequence
  number as the sender is 31. So when we switch back to the first leg, where
  the last absolute sequence sent was 10, we must send a SWITCH cell with 21,
  so that when the first leg receives subsequent cells, it assigns those cells
  an absolute sequence number starting at 31.

  In the rare event that we send more than 2^31 cells (~1TB) on a single leg,
  the leg should be switched in order to reset that relative sequence number to
  fit within 4 bytes.

  For a discussion of rules to rate limit the usage of SWITCH as a side
  channel, see [SIDE_CHANNELS].

2.7. Resumption [RESUMPTION]

  In the event that a circuit leg is destroyed, they MAY be resumed.
  Full resumption is not supported in C-Tor, but is possible to implement,
  at the expense of always storing roughly a congestion window of
  already-transmitted data on each endpoint, in the worst case. Simpler
  forms of resumption, where there is no data loss, are supported. This
  is important to support latency probing, for ensuring UDP VoIP minimum
  RTT requirements are met (roughly 300-500ms, depending on VoIP
  implementation).

  Resumption is achieved by re-using the NONCE to the same endpoint
  (either [LINKING_EXIT] or [LINKING_SERVICE]). The resumed path need
  not use the same middle and guard relays as the destroyed leg(s), but
  SHOULD NOT share any relays with any existing legs(s).

  If data loss has been detected upon a link handshake, resumption can be
  achieved by sending a switch cell, which is immediately followed by the
  missing data. Roughly, each endpoint must check:
    - if cell.last_seq_recv <
         min(max(legs.last_seq_sent),max(closed_legs.last_seq_sent)):
      - send a switch cell immediately with missing data:
        (last_seq_sent - cell.last_seq_recv)

  If an endpoint does not have this missing data due to memory pressure,
  that endpoint MUST destroy *both* legs, as this represents unrecoverable
  data loss.

  Re-transmitters MUST NOT re-increment their absolute sent fields
  while re-transmitting.

  It is even possible to resume conflux circuits where both legs have been
  collapsed using this scheme, if endpoints continue to buffer their
  unacked package_window data for some time after this close. However, see
  [TRAFFIC_ANALYSIS] for more details on the full scope of this issue.

  If endpoints are buffering package_window data, such data should be
  given priority to be freed in any oomkiller invocation. See [MEMORY_DOS]
  for more oomkiller information.

2.8. Data transmission

  Most cells in Tor are circuit-specific, and should only be sent on a
  circuit, even if that circuit is part of a conflux set. Cells that
  are not multiplexed do not count towards the conflux sequence numbers.

  However, in addition to the obvious RELAY_COMMAND_DATA, a subset of cells
  MUST ALSO be multiplexed, so that their ordering is preserved when they
  arrive at the other end. These cells do count towards conflux sequence
  numbers, and are handled in the out-of-order queue, to preserve ordered
  delivery:
    RELAY_COMMAND_BEGIN
    RELAY_COMMAND_DATA
    RELAY_COMMAND_END
    RELAY_COMMAND_CONNECTED
    RELAY_COMMAND_RESOLVE
    RELAY_COMMAND_RESOLVED
    RELAY_COMMAND_XOFF
    RELAY_COMMAND_XON

  Currently, this set is the same as the set of cells that have stream ID,
  but the property that leads to this requirement is not stream usage by
  itself, it is that these cells must be ordered with respect to all data
  on the circuit. It is not impossible that future relay commands could be
  invented that don't have stream IDs, but yet must still arrive in order
  with respect to circuit data cells. Prop#253 is one possible example of
  such a thing (though we won't be implementing that proposal).


3. Traffic Scheduling [SCHEDULING]

  In order to load balance the traffic between the two circuits, the
  original conflux paper used only RTT. However, with Proposal 324, we
  will have accurate information on the instantaneous available bandwidth
  of each circuit leg, as 'cwnd - inflight' (see Section 3 of
  Proposal 324). We also have the TCP block state of the local OR
  connection.

  We specify two traffic schedulers from the multipath literature and
  adapt them to Tor: [MINRTT_TOR], and [LOWRTT_TOR]. Additionally,
  we create low-memory variants of these that aim to minimize the
  out-of-order queue size at the receiving endpoint.

  Additionally, see the [TRAFFIC_ANALYSIS] sections of this proposal for
  important details on how this selection can be changed, to reduce
  website traffic fingerprinting.

3.1. MinRTT scheduling [MINRTT_TOR]

  This schedulng algorithm is used for the MIN_LATENCY user experience.

  It works by always and only sending on the circuit with the current minimum
  RTT. With this algorithm, conflux should effectively stay on the circuit with
  the lowest initial RTT, unless that circuit's RTT raises above the RTT of the
  other circuit (due to relay load or congestion). When the circuit's congestion
  window is full (ie: cwnd - inflight <= 0), or if the local OR conn blocks,
  the conflux set stops transmitting and stops reading on edge connections,
  rather than switch.

  This should result in low out-of-order queues in most situations, unless
  the initial RTTs of the two circuits are very close (basically within the
  Vegas RTT bounds of queue variance, 'alpha' and 'beta').

3.2. LowRTT Scheduling [LOWRTT_TOR]

  This scheduling algorithm is based on [MPTCP]'s LowRTT scheduler. This
  algorithm is used for the UX choice of HIGH_THROUGHPUT.

  In this algorithm, endpoints send cells on the circuit with lowest RTT that
  has an unblocked local OR connection, and room in its congestion window (ie:
  cwnd - inflight > 0). We stop reading on edge connections only when both
  congestion windows become full, or when both local OR connections are blocked.

  In this way, unlike original conflux, we switch to the secondary circuit
  without causing congestion either locally, or on either circuit. This
  improves both load times, and overall throughput. Given a large enough
  transmission, both circuits are used to their full capacity,
  simultaneously.

3.3. MinRTT Low-Memory Scheduling [MINRTT_LOWMEM_TOR]

  The low memory version of the MinRTT scheduler ensures that we do not
  perform a switch more often than once per congestion window worth of data.

  XXX: Other rate limiting, such as not switching unless the RTT changes by
  more than X%, may be useful here.

3.4. BLEST Scheduling [BLEST_TOR]

  XXX: Something like this might be useful for minimizing OOQ for the UX
  choice of LOW_MEM_THROUGHPUT, but we might just be able to reduce switching
  frequency instead.

  XXX: We want an algorithm that only uses cwnd instead. This algorithm
  has issues if the primary cwnd grows while the secondary does not.
  Expect this section to change.

  [BLEST] attempts to predict the availability of the primary circuit, and
  use this information to reorder transmitted data, to minimize
  head-of-line blocking in the recipient (and thus minimize out-of-order
  queues there).

  BLEST_TOR uses the primary circuit until the congestion window is full.
  Then, it uses the relative RTT times of the two circuits to calculate
  how much data can be sent on the secondary circuit faster than if we
  just waited for the primary circuit to become available.

  This is achieved by computing two variables at the sender:

    rtts = secondary.currRTT / primary.currRTT
    primary_limit = primary.cwnd + (rtts-1)/2)*rtts

  Note: This (rtts-1)/2 factor represents anticipated congestion window
  growth over this period.. it may be different for Tor, depending on CC
  alg.

  If primary_limit < secondary.cwnd - (secondary.package_window + 1), then
  there is enough space on the secondary circuit to send data faster than
  we could than waiting for the primary circuit.

  XXX: Note that BLEST uses total_send_window where we use secondary.cwnd
  in this check. total_send_window is min(recv_win, CWND). But since Tor
  does not use receive windows and instead uses stream XON/XOFF, we only
  use CWND. There is some concern this may alter BLEST's buffer
  minimization properties, but since receive window only matter if
  the application is slower than Tor, and XON/XOFF will cover that case,
  hopefully this is fine. If we need to, we could turn [REORDER_SIGNALING]
  into a receive window indication of some kind, to indicate remaining
  buffer size.

  Otherwise, if the primary_limit condition is not hit, cease reading on
  source edge connections until SENDME acks come back.

  Here is the pseudocode for this:

    while source.has_data_to_send():
      if primary.cwnd > primary.package_window:
        primary.send(source.get_packet())
        continue

      rtts = secondary.currRTT / primary.currRTT
      primary_limit = (primary.cwnd + (rtts-1)/2)*rtts

      if primary_limit < secondary.cwnd - (secondary.package_window+1):
        secondary.send(source.get_packet())
      else:
        break # done for now, wait for SENDME to free up CWND and restart

  Note that BLEST also has a parameter lambda that is updated whenever HoL
  blocking occurs. Because it is expensive and takes significant time to
  signal this over Tor, we omit this.


4. Security Considerations

4.1. Memory Denial of Service [MEMORY_DOS]

  Both reorder queues and retransmit buffers inherently represent a memory
  denial of service condition.

  For [RESUMPTION] retransmit buffers, endpoints that support this feature
  SHOULD free retransmit information as soon as they get close to memory
  pressure. This prevents resumption while data is in flight, but will not
  otherwise harm operation.

  In terms of adversarial issues, clients can lie about sequence numbers,
  sending cells with sequence numbers such that the next expected sequence
  number is never sent.  They can do this repeatedly on many circuits, to
  exhaust memory at exits.  Intermediate relays may also block a leg, allowing
  cells to traverse only one leg, thus still accumulating at the reorder queue.

  In C-Tor we will mitigate this in three ways: via the OOM killer, by the
  ability for exits to request that clients use the LOW_MEM_LATENCY UX
  behavior, and by rate limiting the frequency of switching under the
  LOW_MEM_LATENCY UX style.

  When a relay is under memory pressure, the circuit OOM killer SHOULD free
  and close circuits with the oldest reorder queue data, first. This heuristic
  was shown to be best during the [SNIPER] attack OOM killer iteration cycle.

  The rate limiting under LOW_MEM_LATENCY will be heuristic driven, based
  on data from Shadow simulations, and live network testing. It is possible that
  other algorithms may be able to be similarly rate limited.

4.2. Protocol Side Channels [SIDE_CHANNELS]

  To understand the decisions we make below with respect to handling
  potential side channels, it is important to understand a bit of the history
  of the Tor threat model.

  Tor's original threat model completely disregarded all traffic analysis,
  including protocol side channels, assuming that they were all equally
  effective, and that diversity of relays was what provided protection.
  Numerous attack papers have proven this to be an over-generalization.

  Protocol side channels are most severe when a circuit is known to be silent,
  because stateful protocol behavior prevents other normal cells from ever being
  sent. In these cases, it is trivial to inject a packet count pattern that has
  zero false positives. These kinds of side channels are made use of in the
  Guard discovery literature, such as [ONION_FOUND], and [DROPMARK]. It is even
  more trivial to manipulate the AES-CTR cipherstream, as per [RACOON23], until
  we implement [PROP308].

  However, because we do not want to make this problem worse, it is extremely
  important to be mindful of ways that an adversary can inject new cell
  commands, as well as ways that the adversary can spawn new circuits
  arbitrarily.

  It is also important, though slightly less so, to be mindful of the uniqueness
  of new handshakes, as handshakes can be used to classify usage (such as via
  Onion Service Circuit Fingerprinting). Handshake side channels are only
  weakly defended, via padding machines for onion services. These padding
  machines will need to be improved, and this is also scheduled for arti.

  Finally, usage-based traffic analysis need to be considered. This includes
  things like website traffic fingerprinting, and is covered in
  [TRAFFIC_ANALYSIS].

4.2.1. Cell Injection Side Channel Mitigations

  To avoid [DROPMARK] attacks, several checks must be performed, depending
  on the cell type. The circuit MUST be closed if any of these checks fail.

  RELAY_CONFLUX_LINK:
    - Ensure conflux is enabled
    - Ensure the circuit is an Exit (or Service Rend) circuit
    - Ensure that no previous LINK cell has arrived on this circuit

  RELAY_CONFLUX_LINKED:
    - Ensure conflux is enabled
    - Ensure the circuit is client-side
    - Ensure this is an unlinked circuit that sent a LINK command
    - Ensure that the nonce matches the nonce used in the LINK command
    - Ensure that the cell came from the expected hop

  RELAY_CONFLUX_LINKED_ACK:
    - Ensure conflux is enabled
    - Ensure that this circuit is not client-side
    - Ensure that the circuit has successfully received its LINK cell
    - Ensure that this circuit has not received a LINKED_ACK yet

  RELAY_CONFLUX_SWITCH
    - If Prop#340 is in use, this cell MUST be packed with a valid
      multiplexed RELAY_COMMAND cell.
    - XXX: Additional rate limiting per algorithm, after tuning.

4.2.2. Guard Discovery Side Channel Mitigations

  In order to mitigate potential guard discovery by malicious exits,
  clients MUST NOT retry failed unlinked circuit legs for a set more than
  'cfx_max_unlinked_leg_retry' times.

4.2.3. Usage-Based Side Channel Discussion

  After we have solved all of the zero false positive protocol side
  channels in Tor, our attention can turn to more subtle, usage-based
  side channels.

  Two potential usage side channels may be introduced by the use of Conflux:
     1. Delay-based side channels, by manipulating switching
     2. Location info leaks through the use of both leg's latencies

  To perform delay-based side channels, Exits can simply disregard the RTT
  or cwnd when deciding to switch legs, thus introducing a pattern of gaps that
  the Guard node can detect. Guard relays can also delay legs to introduce a
  pattern into the delivery of cells at the exit relay, by varying the latency
  of SENDME cells (every 31st cell) to change the distribution of traffic to
  send information. This attack could be performed in either direction of
  traffic, to bias traffic load off of a particular Guard. If an adversary
  controls both Guards, it could in theory send a binary signal, by
  alternating delays on each.

  However, Tor currently provides no defenses against already existing
  single-circuit delay-based (or stop-and-start) side channels. It is already
  the case that on a single circuit, either the Guard or the Exit can simply
  withhold sending traffic, as per a recognizable pattern. This class of
  attacks, and a possible defense for them, is discussed in [BACKLIT].

  However, circuit padding can also help to obscure these side channels,
  even if tuned for website fingerprinting. See [TRAFFIC_ANALYSIS] for more
  details there.

  The second class of side channel is where the Exit relay may be able to
  use the two legs to further infer more information about client
  location. See [LATENCY_LEAK] for more details. It is unclear at this
  time how much more severe this is for two paths than just one.

  We preserve the ability to disable conflux to and from Exit relays
  using consensus parameters, if these side channels prove more severe,
  or if it proves possible possible to mitigate single-circuit side
  channels, but not conflux side channels.

4.3. Traffic analysis [TRAFFIC_ANALYSIS]

  Even though conflux shows benefits against traffic analysis in
  [WTF_SPLIT], these gains may be moot if the adversary is able to perform
  packet counting and timing analysis at guards to guess which specific
  circuits are linked. In particular, the 3 way handshake in
  [LINKING_CIRCUITS] may be quite noticeable.

  Additionally, the conflux handshake may make onion services stand out
  more, regardless of the number of stages in the handshake. For this
  reason, it may be wise to simply address these issues with circuit
  padding machines during circuit setup (see padding-spec.txt).

  Additional traffic analysis considerations arise when combining conflux
  with padding, for purposes of mitigating traffic fingerprinting. For
  this, it seems wise to treat the packet schedulers as another piece of a
  combined optimization problem in tandem with optimizing padding
  machines, perhaps introducing randomness or fudge factors their
  scheduling, as a parameterized distribution. For details, see
  https://github.com/torproject/tor/blob/master/doc/HACKING/CircuitPaddingDevelopment.md

  Finally, conflux may exacerbate forms of confirmation-based traffic
  analysis that close circuits to determine concretely if they were in
  use, since closing either leg might cause resumption to fail. TCP RST
  injection can perform this attack on the side, without surveillance
  capability. [RESUMPTION] with buffering of the inflight unacked
  package_window data, for retransmit, is a partial mitigation, if
  endpoints buffer this data for retransmission for a brief time even if
  both legs close. This buffering seems more feasible for onion services,
  which are more vulnerable to this attack. However, if the adversary
  controls the client and is attacking the service in this way, they
  will notice the resumption re-link at their client, and still obtain
  confirmation that way.

  It seems the only way to fully mitigate these kinds of attacks is with
  the Snowflake pluggable transport, which provides its own resumption and
  retransmit behavior. Additionally, Snowflake's use of UDP DTLS also
  protects against TCP RST injection, which we suspect to be the main
  vector for such attacks.

  In the future, a DTLS or QUIC transport for Tor such as masque could
  provide similar RST injection resistance, and resumption at Guard/Bridge
  nodes, as well.

5. Consensus Parameters [CONSENSUS]

  - cfx_enabled
    - Values: 0=off, 1=on
    - Description: Emergency off switch, in case major issues are discovered.

  - cfx_low_exit_threshold
    - Range: 0-10000
    - Description: Fraction out of 10000 that represents the fractional rate of
      exits that must support protover 5. If the fraction is below this
      amount, the number of pre-built sets is restricted to 1.

  - cfx_max_linked_set
    - Range: 0-255
    - Description: The total number of linked sets that can be created. 255
      means "unlimited".

  - cfx_max_prebuilt_set
    - Range: 0-255
    - Description: The maximum number of pre-built conflux sets to make.
      This value is overridden by the 'cfx_low_exit_threshold' criteria.

  - cfx_max_unlinked_leg_retry
    - Range: 0-255
    - Description: The maximum number of times to retry an unlinked leg that
      fails during build or link, to mitigate guard discovery attacks.

  - cfx_num_legs_set
    - Range: 0-255
    - Description: The number of legs to link in a set.

  - cfx_send_pct
    - XXX: Experimental tuning parameter. Subject to change/removal.

  - cfx_drain_pct
    - XXX: Experimental tuning parameter. Subject to change/removal.


7. Tuning Experiments [EXPERIMENTS]

  - conflux_sched & conflux_exits
    - Exit reorder queue size
    - Responsiveness vs throughput tradeoff?
  - Congestion control
  - EWMA and KIST
  - num guards & conflux_circs


Appended A [ALTERNATIVES]

A.1 Alternative Link Handshake [ALTERNATIVE_LINKING]

  The circuit linking in [LINKING_CIRCUITS] could be done as encrypted
  ntor onionskin extension fields, similar to those used by v3 onions.

  This approach has at least four problems:
    i). For onion services, since onionskins traverse the intro circuit
        and return on the rend circuit, this handshake cannot measure
        RTT there.
   ii). Since these onionskins are larger, and have no PFS, an adversary
        at the middle relay knows that the onionskin is for linking, and
        can potentially try to obtain the onionskin key for attacks on
        the link.
  iii). It makes linking circuits more fragile, since they could timeout
        due to CBT, or other issues during construction.
   iv). The overhead in processing this onionskin in onionskin queues
        adds additional time for linking, even in the Exit case, making
        that RTT potentially noisy.

  Additionally, it is not clear that this approach actually saves us
  anything in terms of setup time, because we can optimize away the
  linking phase using Proposal 340, to combine initial RELAY_BEGIN cells
  with RELAY_CIRCUIT_LINK.

A.2. Alternative RTT measurement [ALTERNATIVE_RTT]

  Instead of measuring RTTs during [LINKING_CIRCUITS], we could create
  PING/PONG cells, whose sole purpose is to allow endpoints to measure
  RTT.

  This was rejected for several reasons. First, during circuit use, we
  already have SENDMEs to measure RTT. Every 100 cells (or
  'circwindow_inc' from Proposal 324), we are able to re-measure RTT based
  on the time between that Nth cell and the SENDME ack. So we only need
  PING/PONG to measure initial circuit RTT.

  If we were able to use onionskins, as per [ALTERNATIVE_LINKING] above,
  we might be able to specify a PING/PONG/PING handshake solely for
  measuring initial RTT, especially for onion service circuits.

  The reason for not making a dedicated PING/PONG for this purpose is that
  it is context-free. Even if we were able to use onionskins for linking
  and resumption, to avoid additional data in handshake that just measures
  RTT, we would have to enforce that this PING/PONG/PING only follows the
  exact form needed by this proposal, at the expected time, and at no
  other points.

  If we do not enforce this specific use of PING/PONG/PING, it becomes
  another potential side channel, for use in attacks such as [DROPMARK].

  In general, Tor is planning to remove current forms of context-free and
  semantic-free cells from its protocol:
  https://gitlab.torproject.org/tpo/core/torspec/-/issues/39

  We should not add more.


Appendix B: Acknowledgments [ACKNOWLEDGMENTS]

  Thanks to Per Hurtig for helping us with the framing of the MPTCP
  problem space.

  Thanks to Simone Ferlin for clarifications on the [BLEST] paper, and for
  pointing us at the Linux kernel implementation.

  Extreme thanks goes again to Toke Høiland-Jørgensen, who helped
  immensely towards our understanding of how the BLEST condition relates
  to edge connection pushback, and for clearing up many other
  misconceptions we had.

  Finally, thanks to Mashael AlSabah, Kevin Bauer, Tariq Elahi, and Ian
  Goldberg, for the original [CONFLUX] paper!


References:

[CONFLUX]
   https://freehaven.net/anonbib/papers/pets2013/paper_65.pdf

[BLEST]
  https://olivier.mehani.name/publications/2016ferlin_blest_blocking_estimation_mptcp_scheduler.pdf
  https://opus.lib.uts.edu.au/bitstream/10453/140571/2/08636963.pdf
  https://github.com/multipath-tcp/mptcp/blob/mptcp_v0.95/net/mptcp/mptcp_blest.c

[WTF_SPLIT]
   https://www.comsys.rwth-aachen.de/fileadmin/papers/2020/2020-delacadena-trafficsliver.pdf

[COUPLED]
   https://datatracker.ietf.org/doc/html/rfc6356
   https://www.researchgate.net/profile/Xiaoming_Fu2/publication/230888515_Delay-based_Congestion_Control_for_Multipath_TCP/links/54abb13f0cf2ce2df668ee4e.pdf?disableCoverPage=true
   http://staff.ustc.edu.cn/~kpxue/paper/ToN-wwj-2020.04.pdf
   https://www.thinkmind.org/articles/icn_2019_2_10_30024.pdf
   https://arxiv.org/pdf/1308.3119.pdf

[BACKLIT]
   https://www.freehaven.net/anonbib/cache/acsac11-backlit.pdf

[LATENCY_LEAK]
   https://www.freehaven.net/anonbib/cache/ccs07-latency-leak.pdf
   https://www.robgjansen.com/publications/howlow-pets2013.pdf

[SNIPER]
   https://www.freehaven.net/anonbib/cache/sniper14.pdf

[DROPMARK]
   https://www.petsymposium.org/2018/files/papers/issue2/popets-2018-0011.pdf

[RACCOON23]
   https://archives.seul.org/or/dev/Mar-2012/msg00019.html

[ONION_FOUND]
   https://www.researchgate.net/publication/356421302_From_Onion_Not_Found_to_Guard_Discovery/fulltext/619be24907be5f31b7ac194a/From-Onion-Not-Found-to-Guard-Discovery.pdf

[VANGUARDS_ADDON]
  https://github.com/mikeperry-tor/vanguards/blob/master/README_TECHNICAL.md

[PROP324]
  https://gitlab.torproject.org/tpo/core/torspec/-/blob/main/proposals/324-rtt-congestion-control.txt

[PROP339]
  https://gitlab.torproject.org/tpo/core/torspec/-/blob/main/proposals/339-udp-over-tor.md

[PROP308]
  https://gitlab.torproject.org/tpo/core/torspec/-/blob/main/proposals/308-counter-galois-onion.txt