Internet-Draft | NVO3 Encapsulation Considerations | February 2024 |
Boutros & Eastlake | Expires 22 August 2024 | [Page] |
The IETF Network Virtualization Overlays (NVO3) Working Group developed considerations for a common encapsulation that addresses various network virtualization overlay technical concerns. This document provides a record, for the benefit of the IETF community, of the considerations arrived at starting from the output of an NVO3 encapsulation design team. These considerations may be helpful with future deliberations by working groups over the choice of encapsulation formats.¶
There are implications of having different encapsulations in real environments consisting of both software and hardware implementations and within and spanning multiple data centers. For example, OAM functions such as path MTU discovery become challenging with multiple encapsulations along the data path.¶
Based on these considerations, the Working Group determined that Geneve with a few modifications as the common encapsulation. This document provides more details, particularly in Section 7.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 22 August 2024.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The NVO3 Working Group is chartered to gather requirements and develop solutions for network virtualization data planes based on encapsulation of virtual network traffic over an IP-based underlay data plane. Requirements include due consideration for OAM and security. Based on these requirements the WG was to select, extend, and/or develop one or more data plane encapsulation format(s).¶
This led to WG drafts and an RFC describing three encapsulations as follows:¶
Discussion on the list and in face-to-face meetings identified a number of technical problems with each of these encapsulations. Furthermore, there was clear consensus at the 96th IETF meeting in Berlin that, to maximize interoperability, the working group should progress only one data plane encapsulation. In order to overcome a deadlock on the encapsulation decision, the WG consensus was to form a Design Team [RFC2418] to resolve this issue and provide initial considerations.¶
The Design Team was to select one of the proposed encapsulations and enhance it to address the technical concerns. The simple evolution of deployed networks as well as applicability to all locations in the NVO3 architecture were goals. The Design Team was to specifically avoid selecting a design that is burdensome on hardware implementations but should allow future extensibility. The selected design also needed to operate well with ICMP and in Equal Cost Multi-Path (ECMP) environments. If further extensibility is required, then it should be done in such a manner that it does not require the consent of an entity outside of the IETF.¶
The output of the Design Team was then prcoessed through the working group resulting in working group consensus for this document.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The following abbreviations and acronyms are used in this document:¶
The following subsections describe issues with current encapsulations as discussed by the NVO3 WG. Numerous extensions and options have been designed for GUE and Geneve which may help resolve some of these issues but have not yet been validated by the WG.¶
Also included are diagrams and information on the candidate encapsulations. These are mostly copied from other documents. Since each protocol is assumed to be sent over UDP, an initial UDP Header is shown which would be preceded by an IPv4 or IPv6 Header.¶
The Geneve packet format, taken from [RFC8926], is shown in Figure 1 below.¶
The type of payload being carried is indicated by an Ethertype [RFC7042] in the Protocol Type field in the Geneve Header; Ethernet itself is represented by Ethertype 0x6558. See [RFC8926] for details concerning UDP header fields. The O bit indicates an OAM packet. The C bit is the "Critical" bit which means that the options must be processed or the packet discarded.¶
Issues with Geneve [RFC8926] are as follows:¶
Selection of Geneve despite these issues may be the result of the Geneve design effort assuming that the Geneve header would typically be delivered to a server and parsed in software.¶
The type of payload being carried is indicated by an IANA Internet protocol number in the Proto/ctype field. The C bit indicates a Control packet.¶
Issues with GUE [ietf_intarea_gue] are as follows:¶
The type of payload being carried is indicated by the Next Protocol field using a VXLAN-GPE-specific registry. The I bit indicates that the VNI is valid. The P bit indicates that the Next Protocol field is valid. The B bit indicates the packet is an ingress replicated Broadcast, Unknown Unicast, or Multicast packet. The O bit indicates an OAM packet.¶
Issues with VXLAN-GPE [nvo3_vxlan_gpe] are as follows:¶
Appendix A includes a detailed comparison between the three proposed encapsulations. The comparison indicates several common properties but also three major differences among the encapsulations:¶
Non-vendor specific extensions, such as TLVs, MUST follow the standardization process. The following use cases for extensions show that there is a strong requirement to support variable length extensions with possible different subtypes.¶
In several scenarios it is beneficial to make information about the path a packet took through the network or through a network device as well as associated telemetry information available to the operator.¶
This includes not only tasks like debugging, troubleshooting, and network planning and optimization but also policy or service level agreement compliance checks.¶
Packet scheduling algorithms, especially for balancing traffic across equal cost paths or links, often leverage information contained within the packet, such as protocol number, IP address, or MAC address. Probe packets would thus either need to be sent between the exact same endpoints with the exact same parameters, or probe packets would need to be artificially constructed as "fake" packets and inserted along the path. Both approaches are often not feasible from an operational perspective because access to the end-system is not feasible or the diversity of parameters and associated probe packets to be created is simply too large. An extension providing an in-band telemetry mechanism [RFC9197] is an alternative in those cases.¶
Since the currently proposed NVO3 encapsulations do not protect their headers, a single bit corruption in the VNI field could deliver a packet to the wrong tenant. Extension headers are needed to use any sophisticated security.¶
The possibility of VNI spoofing with an NVO3 protocol is exacerbated by using UDP. Systems typically have no restrictions on applications being able to send to any UDP port so an unprivileged application can trivially spoof VXLAN [RFC7348] packets for instance, including using arbitrary VNIs.¶
One can envision support of an HMAC-like Message Authentication Code (MAC) [RFC2104] in an NVO3 extension to authenticate the header and the outer IP addresses, thereby preventing attackers from injecting packets with spoofed VNIs.¶
Another aspect of security is payload security. Essentially this makes packets that look like the following:¶
IP|UDP|NVO3 Encap|DTLS/IPsec-ESP Extension|payload.¶
This is desirable since we still have the UDP header for ECMP, the NVO3 header is in plain text so it can be read by network elements, and different security or other payload transforms can be supported on a single UDP port (we don't need a separate UDP port for DTLS/IPsec [RFC9147]/[RFC6071]).¶
Another use case would be to carry the Group Based Policy (GBP) source group information within a NVO3 header extension in a similar manner as has been implemented for VXLAN [VXLANgroup]. This allows various forms of policy such as access control and QoS to be applied between abstract groups rather than coupled to specific endpoint addresses.¶
Hardware restrictions should be taken into consideration along with future hardware enhancements that may provide more flexible metadata processing. However, the set of options that need to and will be implemented in hardware will be a subset of what is implemented in software, since software NVEs are likely to grow features, and hence option support, at a more rapid rate.¶
It is hard to predict which options will be implemented in which piece of hardware and when. That depends on whether the hardware will be in the form of¶
A result of this is that it doesn't look useful to prescribe some order of the options so that the ones that are likely to be implemented in hardware come first; we can't decide such an order when we define the options, however a control plane can enforce such an order for some hardware implementation.¶
We do know that hardware needs to initially be able to efficiently skip over the NVO3 header to find the inner payload. That is needed both for NICs implementing various TCP offload mechanisms and for transit devices and NVEs applying policy or ACLs to the inner payload.¶
Extension header length has a significant impact on hardware and software implementations. A maximum total header length that is too small will unnecessarily constrain software flexibility. A maximum total header length that is too large will place a nontrivial cost on hardware implementations. Thus, the DT recommends that there be a minimum and maximum total available extension header length specified. The maximum total header length is determined by the size of the bit field allocated for the total extension header length field. The risk with this approach is that it may be difficult to extend the total header size in the future. The minimum total header length is determined by a requirement in the specifications that all implementations must meet. The risk with this approach is that all implementations will only implement support for the minimum total header length which would then become the de facto maximum total header length.¶
The recommended minimum total available header length is 64 bytes.¶
The size of an extension header should always be 4 byte aligned.¶
The maximum length of a single option should be large enough to meet the different extension use case requirements, e.g., in-band telemetry and future use.¶
To support hardware nodes at the target NVE or at a transit device that can process one or a few extension headers in TCAM, a control plane in such a deployment can signal a capability to ensure a specific extension header will always appear in a specific order, for example the first one in the packet.¶
The order of the extension headers should be hardware friendly for both the sender and the receiver and possibly some transit devices also. This may requre that the extension headers and their order be dynamically determined based on the hardware of those devices.¶
Transit devices don't participate in control plane communication between the end points and are not required to process the extension headers; however, if they do, they may need to process only a small subset of the extension headers that will be consumed by target NVEs.¶
If there is a well-known initial set of options that are likely to be implemented in software and in hardware, it can be efficient to use the bit fields approach to indicate the presence of extensions as in GUE. However, as described in section 6.3, if options are added over time and different subsets of options are likely to be implemented in different pieces of hardware, then it would be hard for the IETF to specify which options should get the early bit fields. TLVs are a lot more flexible, which avoids the need to determine the relative importance of different options. However, general TLVs of arbitrary order, size, and repetition are difficult to implement in hardware. A middle ground is to use TLVs with restrictions on their size and alignment, observing that individual TLVs can have a fixed length, and to support via the control plane a method such that an NVE will only receive options that it needs and implements. The control plane approach can potentially be used to control the order of the TLVs sent to a particular NVE. Note that transit devices are not likely to participate in the control plane; hence, to the extent that they need to participate in option processing, some other method must be used. Transit devices would have issues with future GUE bit fields being defined for future options as well.¶
A benefit of TLVs from a hardware perspective is that they are self describing, i.e., all the information is in the TLV. In a bit field approach, the hardware needs to look up the bit to determine the length of the data associated with the bit through some separate table, which would add hardware complexity.¶
There are use cases where multiple modules of software are running on an NVE. These can be modules such as a diagnostic module by one vendor that does packet sampling and another module from a different vendor that implements a firewall. Using a TLV format, it is easier to have different software modules process different TLVs, which could be standard extensions or vendor specific extensions defined by the different vendors, without conflicting with each other. This can help with hardware modularity as well. There are some implementations with options that allows different software modules, like MAC learning and security, to process different options.¶
Given that we want to allow considerable flexibility and extensibility, e.g., for software NVEs, yet be able to support important extensions in less flexible contexts such as hardware NVEs, it is useful to consider the control plane. By control plane in this section we mean both protocols, such as EVPN [RFC8365] and others, and deployment specific configuration.¶
If each NVE can express in the control plane that it only supports certain extensions (which could be a single extension, or a few), and the source NVEs only include supported extensions in the NVO3 packets, then the target NVE can both use a simpler parser (e.g., a TCAM might be usable to look for a single NVO3 extension) and the depth of the inner payload in the NVO3 packet will be minimized. Furthermore, if the target NVE cares about a few extensions and can express in the control plane the desired order of those extensions in the NVO3 packets, then the deployment can provide useful functionality with simplified hardware requirements for the target NVE.¶
Transit devices that are not aware of the NVO3 extensions somewhat benefit from such an approach, since the inner payload is less deep in the packet if no extraneous extension headers are included in the packet. In general, a transit device is not likely to participate in the NVO3 control plane. However, configuration mechanisms can take into account limitations of the transit devices used in particular deployments.¶
Note that with this approach different NVEs could desire different extensions or sets of extensions, which means that the source NVE needs to be able to place different sets of extensions in different NVO3 packets, and perhaps in different order. It also assumes that underlay multicast or replication servers are not used together with NVO3 extension headers.¶
There is a need to consider mandatory extensions versus optional extensions. Mandatory extensions require the receiver to drop the packet if the extension is unknown. A control plane mechanism can prevent the need for dropping unknown extensions, since they would not be included to target NVEs that do not support them.¶
The control planes defined today need to add the ability to describe the different encapsulations. Thus, perhaps EVPN [RFC8365] and any other control plane protocol that the IETF defines should have a way to indicate the supported NVO3 extensions and their order, for each of the encapsulations supported.¶
Developing a separate draft on guidance for option processing and control plane participation should be considered. This should provide examples/guidance on range of usage models and deployments scenarios for specific options and ordering that are relevant for that specific deployment. This includes end points and middle boxes using the options. Having the control plane negotiate the constraints is the most appropriate and flexible way to address these requirements.¶
If there is a need for hosts to send and receive options in a split NVE case [RFC8394], this is possible using any of the existing extensible encapsulations (Geneve, GUE, GPE+NSH) by defining a way to carry those over other transports. NSH can already be used over different transports.¶
If this is needed with other encapsulations it can be done by defining an Ethertype so that it can be carried over Ethernet and [IEEE802.1Q].¶
If there is a need to carry other encapsulations over MPLS, it would require an EVPN control plane to signal that other encapsulation header + options will be present in front of the L2 packet. The VNI can be ignored in the header, and the MPLS label will be the one used to identify the EVPN L2 instance.¶
Whether we should make the VNI 32-bits or larger was one of the topics considered. The benefit of a 24-bit VNI would be to avoid unnecessary changes with existing proposals and implementations that are almost all, if not all, using 24-bit VNI. If we need a larger VNI, perhaps for a telemetry case, an extension can be used to support that.¶
The Design Team (DT) reported that Geneve was most suitable as a starting point for a proposed standard for network virtualization, for the following reasons given below. This conclusion was supported by the NVO3 Working Group.¶
There seems to be interest in standardizing some well-known secure option TLVs to secure the header and payload to guarantee encapsulation header integrity and tenant data privacy. The working group should consider standardizing such option(s).¶
The following enhancements to Geneve are recommended to make it more suitable to hardware and yet provide flexibility for software:¶
The authors would like to thank Tom Herbert for providing the motivation for the Security/Integrity extension, and for his valuable comments, T. Sridhar for his valuable comments and feedback, Anoop Ghanwani for his extensive comments, and Ignas Bagdonas.¶
This document does not introduce any additional security constraints; however, Section 6.2.2 discusess security/integrity extensions and this document suggests, in Section 7, that the the nvo3 WG work on security options for Geneve.¶
This document requires no IANA actions.¶
This section presents a comparison of the three NVO3 encapsulation proposals, Geneve [RFC8926], GUE [ietf_intarea_gue], and VXLAN-GPE [nvo3_vxlan_gpe]. The three encapsulations use an outer UDP/IP transport. Geneve and VXLAN-GPE use an 8-octet header, while GUE uses a 4-octet header. In addition to the base header, optional extensions may be included in the encapsulation, as discussed in Section A.2 below.¶
The Geneve and GUE encapsulations both enable optional headers to be incorporated at the end of the base encapsulation header.¶
VXLAN-GPE does not provide native support for header extensions. However, as discussed in [nvo3_vxlan_gpe], extensibility can be attained to some extent if the Network Service Header (NSH) [RFC8300] is used immediately following the VXLAN-GPE header. NSH supports either a fixed-size extension (MD Type 1), or a variable-size TLV-based extension (MD Type 2). Note that NSH-over-VXLAN-GPE implies an additional overhead of the 8-octets NSH header, in addition to the VXLAN-GPE header.¶
The Geneve Variable Length Options are defined as Type/Length/Value (TLV) extensions. Similarly, VXLAN-GPE, when using NSH, can include NSH TLV-based extensions. In contrast, GUE defines a small set of possible extension fields (proposed in [ietf_gue_extensions]), and a set of flags in the GUE header that indicate for each extension type whether it is present or not.¶
TLV-based extensions, as defined in Geneve, provide the flexibility for a large number of possible extension types. Similar behavior can be supported in NSH-over-VXLAN-GPE when using MD Type 2. The flag-based approach taken in GUE strives to simplify implementations by defining a small number of possible extensions used in a fixed order.¶
The Geneve and GUE headers both include a length field, defining the total length of the encapsulation, including the optional extensions. This length field simplifies the parsing by transit devices that skip the encapsulation header without parsing its extensions.¶
The Geneve encapsulation header includes the 'C' field, which indicates whether the current Geneve header includes critical options, that is to say, options which must be parsed by the target NVE. If the endpoint is not able to process a critical option, the packet is discarded.¶
The maximal header length in Geneve, including options, is 260 octets. GUE defines the maximal header to be 128 octets. VXLAN-GPE uses a fixed-length header of 8 octets, unless NSH-over-VXLAN-GPE is used, yielding an encapsulation header of up to 264 octets.¶
The Geneve and VXLAN-GPE headers both include a 24-bit VNI field. GUE, on the other hand, enables the use of a 32-bit field called VNID; this field is not included in the GUE header, but was defined as an optional extension in [ietf_gue_extensions].¶
The VXLAN-GPE header includes the 'I' bit, indicating that the VNI field is valid in the current header. A similar indicator is defined as a flag in the GUE header [ietf_gue_extensions].¶
All three encapsulation headers include a field that specifies the type of the next protocol header, which resides after the NVO3 encapsulation header. The Geneve header includes a 16-bit field that uses the IEEE Ethertype convention. GUE uses an 8-bit field, which uses the IANA Internet protocol numbering. The VXLAN-GPE header incorporates an 8-bit Next Protocol field, using a VXLAN-GPE-specific registry, defined in [nvo3_vxlan_gpe].¶
The VXLAN-GPE header also includes the 'P' bit, which explicitly indicates whether the Next Protocol field is present in the current header.¶
The OAM bit, which is defined in Geneve and in VXLAN-GPE, indicates whether the current packet is an OAM packet. The GUE header includes a similar field, but uses different terminology; the GUE 'C-bit' specifies whether the current packet is a control packet. Note that the GUE control bit can potentially be used in a large set of protocols that are not OAM protocols. However, the control packet examples discussed in [ietf_intarea_gue] are OAM-related.¶
Each of the three NVO3 encapsulation headers includes a 2-bit Version field, which is currently defined to be zero.¶
The Geneve and VXLAN-GPE headers include reserved fields; 14 bits in the Geneve header, and 27 bits in the VXLAN-GPE header are reserved.¶
The following table summarizes the comparison between the three NVO3 encapsulations. In some cases a plus sign ("+") or minus sign ("-") is used to indicate that the header is stronger or weaker in an area respectively.¶
The following co-authors have contributed to this document:.¶