Crubit: C++/Rust Bidirectional Interop Tool
NOTE: Crubit currently expects deep integration with the build system, and is difficult to deploy to environments dissimilar to Google's monorepo. External contributions are accepted, but may in some cases be difficult to integrate for tooling reasons. See CONTRIBUTING. Both of these are being worked on, see https://github.com/google/crubit/blob/main/docs/overview/status.md#usage-outside-of-google
Crubit is a bidirectional bindings generator for C++ and Rust, with the goal of integrating the C++ and Rust ecosystems.
Status
See the [status](http://
Example
C++
Consider the following C++ function:
bool IsGreater(int lhs, int rhs);
This function, if present in a header file which is processed by Crubit, becomes callable from Rust as if it were defined as:
pub fn IsGreater(lhs: ffi::c_int, rhs: ffi::c_int) -> bool {...}
Note: There are some temporary restrictions on the API shape. For example,
functions that accept a type like std::map can't be called from Rust
directly via Crubit. These restrictions will be relaxed over time.
Rust
Consider the following Rust function:
#![allow(unused)] fn main() { pub fn is_greater(lhs: i32, rhs: i32) -> bool { ... } }
This function becomes callable from C++ as if it were defined as:
bool is_greater(int32_t lhs, int32_t rhs);
Note: There are some temporary restrictions on the API shape. For example, functions that accept two mutable references can't be called from C++ directly via Crubit. These restrictions will be relaxed over time.
Getting Started
We have detailed walkthroughs on how to use C++ from Rust, or Rust from C++, using Crubit, as well as copy-pastable example code. The example code also includes spanshots of what the generated bindings look like.
- Walkthrough:
Rust Bindings for C++ Libraries
- Examples:
examples/cpp/
- Examples:
- Walkthrough:
C++ Bindings for Rust Libraries
- Examples:
examples/rust/
- Examples:
Building Crubit
Cargo
cc_bindings_from_rs
You can build cc_bindings_from_rs, which allows Rust code to be called from
C++, using cargo build --bin cc_bindings_from_rs.
rs_bindings_from_cc
Prerequisites:
- Requires LLVM and Clang libraries to be built and installed.
- They must be built with support for compression (zlib), which is the default build config.
- Requires Abseil libraries to be built and installed.
- Requires zlib (e.g. libz.so) to be available in the system include and lib paths.
- An up-to-date stable Rust toolchain.
Linux-specific setup:
# Choice of compiler is optional.
export CC=/path/to/clang
export CXX=/path/to/clang++
# We must use `lld` linker via clang. It must be in the PATH.
export PATH="$PATH:/dir/containing/lld"
export RUSTFLAGS="$RUSTFLAGS -Clinker=/path/to/clang"
export RUSTFLAGS="$RUSTFLAGS -Clink-arg=-fuse-ld=lld"
# If you want to use a sysroot.
# SYSROOT_FLAG=--sysroot=$SYSROOT
# export CXXFLAGS="$CXXFLAGS $SYSROOT_FLAG"
# export RUSTFLAGS="$RUSTFLAGS -Clink-arg=$SYSROOT_FLAG"
MacOS-specific setup:
export CC=clang
export CXX=clang++
export RUSTFLAGS="$RUSTFLAGS -Clinker=clang"
export RUSTFLAGS="$RUSTFLAGS -Clink-arg=-fuse-ld=lld"
# Point to the Xcode sysroot.
export CXXFLAGS="$CXXFLAGS -isysroot $(xcrun --show-sdk-path)"
export RUSTFLAGS="$RUSTFLAGS -Clink-arg=-isysroot -Clink-arg=$(xcrun --show-sdk-path)"
Windows-specific setup:
- Windows is currently unsupported, and the APIs generated by Crubit may not compile and will change over time.
- All commands must be run from a development shell, where MSVC environment variables are set up.
# We use clang compiler (clang-cl); MSVC may work too but is unsupported.
export CC=clang-cl
export CXX=clang-cl
# We must use lld to link, which is spelt lld-link. So user-specified linker
# flags must be in MSVC format.
export RUSTFLAGS="$RUSTFLAGS -Clinker=/path/to/lld-link"
# LLVM was built with Zlib support. Point Crubit to the same library.
export CXXFLAGS="$CXXFLAGS /I/path/to/zlib"
export RUSTFLAGS="$RUSTFLAGS -Clink-arg=/LIBPATH:/path/to/zlib"
# Avoid deprecation warnings.
export CXXFLAGS="$CXXFLAGS /D_CRT_SECURE_NO_DEPRECATE"
# If LLVM (-DCMAKE_MSVC_RUNTIME_LIBRARY) and Abseil (-DABSL_MSVC_STATIC_RUNTIME)
# are built against static CRT, then Rust needs to match, or vice-versa.
# export RUSTFLAGS="$RUSTFLAGS -Ctarget-feature=+crt-static"
Run the build step via cargo:
# Paths for Crubit's cargo to use.
## This path contains clang/ and llvm/ dirs with their respective headers.
export CLANG_INCLUDE_PATH=/path/to/llvm/and/clang/headers
## This path contains libLLVM*.a and libclang*.a.
export CLANG_LIB_STATIC_PATH=/path/to/llvm/and/clang/libs
## This path contains absl/ dir with all the includes.
export ABSL_INCLUDE_PATH=/path/to/absl/include/dir
## This path contains libabsl_*
export ABSL_LIB_STATIC_PATH=/path/to/absl/libs
cargo build --bin rs_bindings_from_cc
Bazel
apt install clang lld bazel
git clone git@github.com:google/crubit.git
cd crubit
bazel build --linkopt=-fuse-ld=/usr/bin/ld.lld //rs_bindings_from_cc:rs_bindings_from_cc_impl
Using a prebuilt LLVM tree
git clone https://github.com/llvm/llvm-project
cd llvm-project
CC=clang CXX=clang++ cmake -S llvm -B build -DLLVM_ENABLE_PROJECTS='clang' -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install
cmake --build build -j
# wait...
cmake --install build
cd ../crubit
LLVM_INSTALL_PATH=../llvm-project/install bazel build //rs_bindings_from_cc:rs_bindings_from_cc_impl
Are we Crubit Yet?
NOTE: The bug links below, of the form b/123456, are for Google-internal
tracking purposes.
What follows is an overview of the major features Crubit does and does not support. The list is necessarily incomplete, because there exist more features and types than could be feasibly listed in anything readable, but it should give a rough idea.
This page should evolve over time:
- If the status of a given feature is not listed, and not clear based on what is here, we should add it.
- Some features may not have bug IDs attached. If a feature is actively requested, it should be listed with a given bug that updates will be posted to.
- This page may fall out of date, since the set of features supported by Crubit is documented in many places. Sorry! Please update it if you notice any problems.
Types
See
Unless otherwise specified, the types below are supported and ABI-compatible
(see
- integer types (except 128-bit integers)
- floating point types
- user-defined types
- These are either layout-compatible (usually) or ABI-compatible (rarely – if all member types are supported, and it's nonempty, and it uses no obscure attributes)
- function pointers, where the parameters and return type are in this list and are ABI-compatible
std::string_view/absl::string_view- Bridged:
std::string - Bridged:
&str - Bridged: Rust tuples (e.g.
(i32, i64)) - Bridged:
std::optional<T> - Bridged: (allowlisted) protocol buffers
- Bridged:
absl::Status - raw pointers to any ABI-compatible or layout-compatible item in this list
We have experimental unreleased support for the following types:
- (2025H2) b/362475441: references and pointers to
MaybeUninit<T>, which are treated asT.
We have planned support for the following types:
- (2025H2) b/271016831: layout-compatible
*const [T],*mut [T] - (2025H2) bridged
Option<T> - (2025) b/356638830: layout-compatible
std::vector - (2025) b/369994952: layout-compatible
std::unique_ptr
The following types are not yet supported, among many others:
- b/254507801: Rust
! - b/260128806: Arrays (
std::array<T, N>,[T; N]) - b/254094650:
i128andu128 - Rust
String Result<T, E>- b/254099023:
()as anything but a return type. - b/213960614:
std::byte
C++
For C++ libraries, used from Rust, we have support for the following language features, used in public interfaces:
- rust-movable structs. (Either trivially copyable, or
[[clang::trivial_abi]]) - rust-movable unions.
- enums
- type aliases
- non-overloaded functions (which are not member functions)
- inline or non-inline
- extern "C" or non-extern "C"
We have experimental unreleased support for the following language features:
- forward declarations
- non-trivial types
- b/356224404: non-overloaded member functions, (overloaded) constructors and assignment operators
- templated types, bridged to a non-generic concrete type.
- e.g.
vector<int>becomesstruct __crubit_mangled_vector_i, notstruct vector<T>(...) - specialization
- e.g.
- operator overloading
- nullability annotations
- lifetime annotations, mapped unsafely to references
- Some object-orientation:
- types with non-virtual base classes
- upcasting
- downcasting
- inherited methods
The following features are not supported yet, among many others:
- b/213280424: overloading
- b/313733992: Object-Oriented Programming more generally
- e.g., cannot derive from a C++ class and override its virtual methods
- safe support for references
- template-generic bridging, so that a C++ template becomes a Rust generic
- non-type
usingaliases- using enum
- using namespace
- constants
- macros
Rust
For Rust libraries, used from C++, we have support for the following language features, used in public interfaces:
- structs
repr(C)unions- opaque representations of other user-defined types
- enums
- non-repr(C) unions
- aliases (via
use,type) - functions and methods
- references
- specific known traits with equivalents in C++:
CloneDefaultDropFrom
- simple
constconstants - Defining a C++ enum from Rust
We have experimental unreleased support for the following language features:
- non-opaque enums
- non-opaque non-
repr(C)unions
The following features are not supported yet, among others:
- traits and trait methods in general
- defining C++ abstractions from Rust
- inheriting from a C++ class
- defining a C++ base class
- statics and more complex
constconstants - macros
Usage outside of Google
Crubit was initially written to take advantage of the superpowers that come with a centrally controlled monorepo using a Bazel build system. However, this presents a high barrier to entry: in order to use Crubit, you must satisfy all of the preconditions.
In 2026, we are building Crubit up to be a tool shaped like OSS users expect: an IDL-based FFI tool with Cargo integration, with options for a better experience in codebases with strong control over the build environment. (Though for calling Rust from C++, we might stop short of an IDL, and instead rely on compiler-synced binary releases, since there is only one compiler.)
In particular, this involves decomposing Crubit into a collection of parts that can be used on their own, without needing to consume the whole:
- Reusable libraries that implement C++ functionality (e.g., forward declarations, nontrivial object semantics.)
- An IDL-based core, with optional compiler integration at the front-end.
- Support for building with Cargo, stable named versions of Clang or Rust, etc.
Decoupling from the toolchain
By using an IDL as input, instead of a C++ compiler frontend, Crubit can be made compatible with arbitrary C++ compilers: a human can write the IDL in a way that is compatible with the compiler in question, even if Crubit does not integrate with that compiler yet.
For the Rust compiler, however, there is only one. The main toolchain integration hazard is that the compiler and its arguments must be exactly matched with the version and arguments used to compile the Rust crate itself. This can be resolved by using rmeta files as inputs, instead of source code.
TODO:
- rs_bindings_from_idl and idl_from_cc exist, and Crubit can be used with IDL inputs
- cc_bindings_from_rs can accept rmeta inputs
Crate Ecosystem
TODO:
- Crubit accepts pull requests and regularly reviews GitHub issues and PRs.
- A C++ stdlib crate exists in crates.io
- The Crubit
ctorcrate is either replaced withpin-init, the equivalent standard library module, or else has a crate in crates.io with documentation and an explanation of why to use it vspin-init.- For all other support libraries: they exist in crates.io and are documented.
Build System
We currently only support Bazel.
TODO:
- cc_bindings_from_rs builds using Cargo
- rs_bindings_from_cc builds using Cargo
- idl_bindings_from_cc, rs_bindings_from_idl build using Cargo
- Crubit is usable as a Bazel dependency
- Crubit is usable as a Bazel dependency
- Crubit builds against public Rust and Clang releases
- Crubit binary releases
- (not planned) Buck2
- (not planned) CMake
Types
Overview
In brief, Crubit supports:
- Primitive types (
/types/primitive), such as floatori32. - Pointer types (
/types/pointer), such as float*or*const i32, including function pointers. - User-defined types, with some language-specific rules and restrictions. (See
/cpp and /rust).
ABI-Compatibility
Certain references to C++ or Rust types will not receive Crubit bindings. Some types may only be usable in certain locations due to current Crubit limitations, inherent properties of the type, or both. Supported types fall into one of three categories ranging from "most widely supported" to "most restricted":
- ABI-compatible: these types have a C-ABI-equivalent representation which can be used anywhere a value of this type is expected from both C++ and Rust.
- Layout-compatible: these types have equivalent in-memory representations
in C++ and Rust but cannot be represented using standard C ABI. These types
will only be usable as by-value function arguments if they are C++-movable.
For example,
Box<i32>is not C++-movable because it has nonullptr/ moved-from representation. - Bridged: these types may have different in-memory representations in C++
and Rust, and so can only be passed by-value between the two languages.
Examples include Rust tuples, which are bridged by-value into C++
std::tuple.
| Level of Support | Example | Pass by-reference | Pass by-value | Return by-value | Fields | In Function Pointer Types |
|---|---|---|---|---|---|---|
| ABI Compatible | i32 | Y | Y | Y | Y | Y |
| Layout-compatible C++ type | absl::string_view | Y | if Rust-movable1 | if Rust-movable2 | Y | N |
| Layout-compatible Rust type | UserDefinedStruct | Y | if C++ movable3 | Y | Y | N |
| Bridged | (i32, i32) | N | Y | Y | N | N |
See
See
See
NOTE: All primitive and pointer types are ABI-compatible. However, due to b/369895805, all non-bridged user-defined types are only layout-compatible.
In the following examples, foo receives bindings, but bad_foo will not
receive bindings, because while the types it uses in its function signature are
supported by Crubit, they are not supported in this particular context.
C++
void foo(int32_t);
void foo(void (*)(int32_t));
void foo(Status);
struct LayoutCompatibleType {
UnsupportedType field;
// or [[no_unique_addres]] int field; or...
};
// foo cannot receive bindings, because the function pointer type
// does not work with non-ABI-compatible types
void bad_foo(void (*)(LayoutCompatibleType));
// foo cannot receive bindings, because bridged types cannot be passed
// by reference
void bad_foo(const Status&);
Rust
#![allow(unused)] fn main() { pub fn foo(_: i32) {} pub fn foo(_: fn(i32)) {} pub fn foo(_: Status) {} }
#![allow(unused)] fn main() { struct LayoutCompatibleType { field: UnsupportedType } // foo cannot receive bindings, because the function pointer type // does not work with non-ABI-compatible types pub fn bad_foo(_: fn(LayoutCompatibleType)) {} }
#![allow(unused)] fn main() { // foo cannot receive bindings, because bridged types cannot be passed // by reference fn bad_foo(_: &Status) {} }
Bidirectionality
Usually, the mapping of types between languages is bidirectional. For example, a
C++ function which returns an int32_t will become a Rust function returning an
i32, and vice versa. In some sense, an i32 is an int32_t.
However, in other cases, the mapping is not reversible. C++ and Rust have types
or aliases that the other language does not. For example, isize becomes
intptr_t, but intptr_t is (on some platforms) the same type as int64_t,
and so intptr_t becomes i64.
Primitive types
Crubit maps primitive types1 to the direct equivalent in the other
language. For example, C++ int32_t is Rust i32, C++ int is Rust
ffi::c_int, C++ double is Rust f64, and so on.
Exceptions:
- C++: There is no mapping for the currently-unsupported types
nullptr_t,char8_t,wchar_t, and(u)int128_t. - Rust: There is no mapping for the currently-unsupported
charandstrtypes, and the never (!) type, except as a return type.
For more information, see Unsupported types
Bidirectional type mapping
The following map is bidirectional. If you call a C++ interface from Rust using
Crubit, then int32_t in C++ becomes i32 in Rust. Vice versa, if you call a
Rust interface from C++ using Crubit, i32 in Rust becomes int32_t in C++.
| C++ | Rust |
|---|---|
void | () as a return type, ::core::ffi::c_void otherwise. |
int8_t | i8 |
int16_t | i16 |
int32_t | i32 |
int64_t | i64 |
intptr_t | isize |
uint8_t | u8 |
uint16_t | u16 |
uint32_t | u32 |
uint64_t | u64 |
uintptr_t | usize |
bool | bool |
double | f64 |
float | f32 |
char | ::core::ffi::c_char 2 |
signed char | ::core::ffi::c_schar |
unsigned char | ::core::ffi::c_uchar |
short | ::core::ffi::c_short |
unsigned short | ::core::ffi::c_ushort |
int | ::core::ffi::c_int |
unsigned int | ::core::ffi::c_uint |
long | ::core::ffi::c_long |
unsigned long | ::core::ffi::c_ulong |
long long | ::core::ffi::c_longlong |
unsigned long long | ::core::ffi::c_ulonglong |
One-way type mapping
The types below are mapped in only one direction, but do not round trip back to
the original type. For example, size_t maps to usize, but usize maps to
uintptr_t.
C++ to Rust
The following C++ types become the following Rust types, but not vice versa:
| C++ | Rust |
|---|---|
ptrdiff_t | isize |
size_t | usize |
char16_t | u16 |
char32_t | u32 3 |
One-way mapping of Rust to C++ types
The following Rust types become the following C++ types, but not vice versa:
| Rust | C++ |
|---|---|
! (return type) | void |
Unsupported types
Bindings for the following types are not supported at this point:
C++
nullptr_tandchar8_thave not yet been implemented.- b/283268558:
wchar_tis currently unsupported, for portability reasons. - b/254094650:
int128_tis currently unsupported, because it does not yet have a decided ABI.
Rust
charis currently unsupported, pending design review.- b/262580415:
strhas not yet been implemented - b/254507801:
!has not yet been implemented except for return types.
Rust calls these types primitive types, while C++ calls them fundamental types. Since the Rust terminology is probably well understood by everybody, we use it here.
Unlike Rust char, char16_t and char32_t may contain invalid
Unicode characters.
Note that Rust c_char and C++ char have different signedness in
Google, or any other codebase with widespread use of unsigned char in
x86.
TODO(jeanpierreda): document this in more detail.
Pointer types
C++ defines two categories of pointer types, while Rust adds a third. They are:
- Pointers to some (non-function) object, without lifetime information. C++ calls these object pointers, while Rust calls them raw pointers.
- Function pointers (C++, Rust).
- Finally, Rust references: non-aliasing pointers with lifetime information.
With the exception of Rust references, which are only
permitted in limited circumstances, pointer types are fully supported as long as
the type they point to is supported. For example, const int32_t* maps
bidirectionally to *const i32, and void (*)(int32_t) maps bidirectionally to
fn(i32).
Object pointers
An "object pointer" is the C++ terminology for any pointer that is not a function pointer. Rust would call these "raw pointers". These are mapped to each other bidirectionally:
| C++ | Rust |
|---|---|
const T* | *const T |
T* | *mut T |
C++ pointers with lifetime
C++ allows attaching lifetime annotations to arbitrary types, including pointers. There are two competing annotations for this, neither of which are supported in Rust bindings yet:
[[clang::lifetimebound]]- Lifetime attributes
Function pointers
C++ function pointers map to Rust extern "C" fn(...) -> ... function pointers,
and vice versa:
| C++ | Rust |
|---|---|
void(&)(int32_t)> | extern "C" fn(i32) |
void(*)(int32_t) | Option<extern "C" fn(i32)> |
std::type_identity_t<void(int32_t)> | Not supported 1 |
If the corresponding C++ function definition would be unsafe in Rust (per the
rules for C++ function declarations), then so is
the function pointer – for example, a C++ reference to void(void*) becomes a
Rust unsafe extern "C" fn(_: *mut c_void).
Not all function pointers receive bindings. If the function cannot be called directly, due to a known or potential ABI mismatch between Rust and C++, then the function pointer receives no bindings. In particular, function pointers cannot take layout-compatible types by value. You can work around this by taking or returning such problematic types by pointer instead of by value.
Lifetime
All function pointers are 'static.
There is no way to specify the lifetime of a function pointer in Rust, nor in
C++: both assume a 'static lifetime. In scenarios where the lifetime may be
shorter than 'static (e.g., JIT compilation, or dynamic loading and unloading
of shared libraries at runtime), the developer is responsible for managing the
lifetime of the function pointer.
C++ has plain
function types:
the type pointed to by function pointers. There is no Rust
equivalent. However, since C++ functions implicitly coerce to
function pointers, this only comes up in template classes
like
std::function
or
absl::AnyInvocable.
Or, in this case, type_identity_t.
Rust references
Rust references, unlike C++ references, cannot mutably alias. This introduces a new form of Undefined Behavior (UB) that many C++ programmers may not be accustomed to. For now, C++ pointers and references do not map to Rust references. Instead, they map to Rust raw pointers. Vice versa, Rust references are an unsupported type which do not map to any C++ type at all.
The one exception to this rule are function parameters. In some limited
circumstances, Rust functions may accept references, and the corresponding C++
interface will accept C++ references. This is documented in
absl::Status in Rust
NOTE: The APIs here have planned future backwards-incompatible changes, and you may see LSCs as we migrate to the end state API.
In Google C++, the standard types for communicating an error are absl::Status
and absl::StatusOr<T>. These have support in Rust when they are directly
passed by value, or returned by value, and are mapped to a Rust Result. For
example:
absl::Status Foo();
This becomes:
#![allow(unused)] fn main() { pub fn Foo() -> Result<(), StatusError> {...} }
(Specifically, it will return Status, which is an alias for Result<(), StatusError>.)
Calling C++ APIs using Status
C++ functions returning Status/StatusOr can be defined as normal:
cs/file:examples/types/absl_status/cpp_api.h content:ReturnsStatus
...and will return a Result:
cs/file:examples/types/absl_status/user_of_cpp_api.rs content:ReturnsStatus
Calling Rust APIs using Status
Unlike when calling C++ APIs, currently you cannot directly call a Rust API
returning a Status or StatusOr. Instead, it must use a workaround type,
StatusWrapper. This is tracked by b/441266536.
cs/file:examples/types/absl_status/rust_api.rs
The StatusWrapper type automatically becomes an absl::Status in C++:
cs/file:examples/types/absl_status/user_of_rust_api.cc content:rust_api::ReturnsStatus
Future Evolution
We expect to stop using Result, and instead use the plain actual bindings for
absl::Status itself, using the Try trait to enable conversion into Result
and error handling via ?.
This would allow Status to be used not only as function parameter and return
values, but also in struct fields, arrays, or behind pointers and references.
However, this is blocked on stabilization of the Try trait.
C++/Rust Protobuf interop
WARNING: This page documents functionality that is currently internal to the Google monorepo.
Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. Once you define how you want your data to be structured once, you can generate source code in a variety of languages to manipulate and serialize/deserialize your structured data. Protobuf messages are among the most common types at Google, appearing in vast majority of APIs.
The usual way to passing data from one language to another using Protobufs is to serialize a message in one language, and deserialize it in another. This serialization/deserialization has costs which makes this approach unsuitable for hot code paths.
To avoid those costs, we've intentionally designed C++ and Rust Protobuf message types to have identical layouts. We avoid the need for serialization/deserialization and instead we directly use the same message object from both languages. Crubit automatically generates the zero-cost glue code for us. For example, take this piece of a C++ header:
MyProto Foo();
This becomes available to Rust as:
#![allow(unused)] fn main() { pub fn Foo() -> MyProto {...} }
(Specifically, Crubit will detect that this is a Protobuf message, and it will convert from the C++ message type to the Rust message type.)
Calling Rust APIs using Protobuf message types
| Rust | C++ |
|---|---|
Message | Message |
MessageView | const Message* |
MessageMut | Message* |
Protocol buffers are supported by value, and using the View and Mut view
types, where they are mapped to C++ pointers.
See cc_bindings_from_rs/test/bridging/protobuf/rust_lib.rs for an example definition, and cc_bindings_from_rs/test/bridging/protobuf/user_of_rust_lib.cc for how to call it from Rust.
Calling C++ APIs using Protobuf message types
Calling C++ APIs which use protobuf is slightly more difficult.
First of all, add your proto_library target to the
[allowlist](http://
Passing by value
| C++ | Rust |
|---|---|
Message | Message |
When a C++ proto message is passed or returned by value, it is mapped directly to the Rust message type, as you would expect.
C++:
cs/file:google_internal/protobuf/by_value.h content:foo::Message
Rust:
cs/file:google_internal/protobuf/by_value_test.rs content:by_value\:\:|\bmsg\b
Passing by reference
| C++ | Rust |
|---|---|
const Message*, const Message& | *const Incomplete<symbol!("Message"), ...> |
Message*, Message& | *mut Incomplete<symbol!("Message"), ...> |
When a C++ proto is passed by pointer or by reference, the Rust type is a pointer to a forward declaration of the C++ protocol buffer type.
In particular, C++ APIs are not exposed using the View or Mut types.
These are pointers because C++ APIs do not annotate ownership, lifetime, or
aliasing properties, and so these cannot be mapped to the distinct owned,
View, or Mut types of the Rust protobuf API. And these are forward-declared
because the C++ types do not have direct Rust bindings: the generated .proto.h
file does not get piped through Crubit.
-
To convert a Rust
Prototo a C++const Proto*: usemy_proto.as_view().cpp_cast() -
To convert a Rust
Prototo a C++Proto*: usemy_proto.as_mut().cpp_cast() -
To convert a C++
(const) Proto*to a RustView/Mut: useunsafe {my_ptr.unsafe_cpp_cast()}.
See support/forward_declare.rs for the definition of
Incomplete, CppCast, and UnsafeCppCast.
For copy-pastable example code, see the examples in google_internal/protobuf/
Type visibility
In Crubit's :wrapper mode, pub(crate) types can be generated, which are
restricted to a specific library. This is generally a temporary state of
affairs: as a way of enabling types to be used for a specific library, without
exposing them everywhere, if their bindings are flawed or need work.
Visibility errors
If the generated bindings for a type are pub(crate), then bindings will not be
generated when the type is used outside of that library. For example, consider
the following library, which uses :wrapper mode:
struct WEIRD_EXPERIMENTAL_ATTRIBUTE SomeType {};
void Foo(SomeType);
If SomeType is pub(crate) because of its use of
WEIRD_EXPERIMENTAL_ATTRIBUTE, then functions, class members, constants, etc.
which use that type will only receive bindings in the same crate, and those
bindings will themselves be pub(crate):
#![allow(unused)] fn main() { pub(crate) struct SomeType { ... } pub(crate) fn Foo(...: SomeType) { ... } }
If a different library uses the type, and defines a similar function Bar, then
it will not receive bindings at all, because the bindings for Bar are only
visible in the library where it was defined.
void Bar(SomeType); // won't receive bindings: it's in a another library
This can dramatically reduce the set of bindings which are generated, and it is
for this reason that these pub(crate) type bindings are only used sparingly,
typically for early release of features that cannot yet be globally supported.
You should not rely on the pub(crate) status of a type!
Fix
To work around this, you can wrap or hide the type as it is used in the public
API. For example, if you needed to accept a pointer to X, but X is
pub(crate), you can accept a void* instead.
Rust bindings for C++ libraries
When a C++ library enables Crubit, that library can be used directly from Rust. This page documents roughly what that entails, and additional subpages (available in the left-hand navigation) document specific aspects of the generated bindings.
Tip: The code examples below are pulled straight from examples/cpp/function/. The other examples in examples/cpp/ are also useful. If you prefer just copy-pasting something, start there.
How to use Crubit
Crubit allows you to call some C++ interfaces from Rust. It supports functions, rust-movable classes and structs, and enums. Crubit does not support advanced features like templates or virtual inheritance.
The rest of this document goes over how to create a C++ library that can be called from Rust, and how to actually call it from Rust. The quick summary is:
-
A
cc_librarygets (nonempty) Rust bindings if it specifiesaspect_hints = ["//features:supported"]. -
Any Rust build target can depend on the bindings for a
cc_library, by specifyingcc_deps=["//path/to:target"]. -
The bindings can be previewed using the following command:
$ bazel build --config=crubit-genfiles //path/to:target
Write a cc_library target
The first part of creating a library that can be used by Crubit is to write a
cc_library target. For example:
cs/file:examples/cpp/function/example.h
If you write a BUILD target as normal, it will not actually get Crubit bindings, but we'll start from there:
cs/file:examples/cpp/function/BUILD symbol:example_lib_broken
Look at the generated bindings
Bindings can be generated for any C++ target, anywhere in the build graph. (Crubit is an aspect1 on all C++ targets.) However, that is not to say that the generated bindings will be useful: by default, Crubit doesn't generate any bindings. Try it!
To examine the generated C++ bindings for the target, you can run the following command:
$ bazel build --config=crubit-genfiles //examples/cpp/function:example_lib_broken
This is the best way to preview the generated bindings for a given C++ target right now. You might end up using this a lot, so keep it in your shell history.
If you run the above command, you should see some output like the following:
Aspect //rs_bindings_from_cc/bazel_support:rust_bindings_from_cc_aspect.bzl%rust_bindings_from_cc_aspect of //examples/cpp/function:example_lib_broken up-to-date:
bazel-bin/examples/cpp/function/example_lib_broken_rust_api_impl.cc
bazel-bin/examples/cpp/function/example_lib_broken_rust_api.rs
bazel-bin/examples/cpp/function/example_lib_broken_namespaces.json
These files are the generated bindings which are used under the hood when depending on a C++ target from Rust. They consist of:
- The supporting C++ code to glue Rust and C++ together. (The
.ccfile.) - The public Rust interface. (The
.rsfile.) - Supporting information that is used by bindings that depend on these
bindings. (The
.jsonfile.)
You don't need to check them in, as they are regenerated automatically whenever you build a Rust build target which depends on C++.
The .rs file is the interesting one for end users. For a library like
:example_lib_broken, which does not enable Crubit, the .rs file will be
essentially empty, only consisting of comments describing the bindings it did
not generate:
#![allow(unused)] fn main() { // Generated from: examples/cpp/function/example.h;l=11 // Error while generating bindings for item 'crubit_add_two_integers': // Can't generate bindings for crubit_add_two_integers, because of missing required features (<internal link>): // //examples/cpp/function:example_lib_broken needs [//features:supported] for crubit_add_two_integers (return type) // //examples/cpp/function:example_lib_broken needs [//features:supported] for crubit_add_two_integers (the type of x (parameter #0)) // //examples/cpp/function:example_lib_broken needs [//features:supported] for crubit_add_two_integers (the type of y (parameter #1)) // //examples/cpp/function:example_lib_broken needs [//features:supported] for crubit_add_two_integers (extern \"C\" function) }
This error is saying something important. It was trying to generate bindings for
the function crubit_add_two_integers, but it couldn't, because four different
things about the function require the supported feature to be enabled on the
target. The parameter and return types require supported, as does the function
itself in the abstract.
supported indicates that a library target supports Rust callers via Crubit,
using the stable features. Other functions and classes might require
experimental, for experimental features of Crubit. For example, if we had
defined anoperator+. For more on this, see
Enable Crubit on a target
To enable Crubit on a C++ target, one must pass an argument, via aspect_hints.
Specifically, as mentioned in the comments, the target must enable the
supported feature:
cs/file:examples/cpp/function/BUILD symbol:\bexample_lib\b
This tells Crubit that it can generate bindings for this target, for any part of
the library that uses features from supported. Now, if we look at a preview of
the automatically generated bindings:
$ bazel build --config=crubit-genfiles //examples/cpp/function:example_lib
We can see the fully-fledged bindings for the library:
cs/file:examples/cpp/function/example_generated.rs
Use a C++ library from Rust
To depend on a C++ library from Rust, add it to cc_deps:
cs/file:examples/cpp/function/BUILD symbol:main
At that point, the bindings are directly usable from Rust. The interface is
identical to the .rs file previewed earlier, but can be used directly:
cs/file:examples/cpp/function/main.rs
Common Errors
Unsupported features
Some features are either unsupported, or else only supported with experimental
feature flags (
For a particularly notable example, a class cannot have a std::string field,
because std::string has properties around move semantics that Crubit does not
yet support. In turn, this means the class containing the std::string has
semantics that Crubit doesn't yet support.
The way to work around this kind of problem, in all cases, is to wrap or hide the problematic interface behind an interface Crubit can handle:
- Move nontrivial types behind a
unique_ptr<T>. Astd::stringfield is not rust-movable, but aunique_ptr<std::string>field is. - Hide unsupported types, in general, behind a wrapper. For example, a
std::vector<T>is not supported, but a struct which wraps aunique_ptr<std::vector<int32_t>>is. - Wrap unsupported functions behind wrappers. For example, methods are not yet supported, but top-level functions are, and can invoke methods.
Crubit is an aspect: an automatically generated entity that exists on every build target. It is disabled by default, so that Rust callers don't accidentally impose on C++ libraries that weren't expecting them.
Aspects allow Crubit to fully understand the dependency graph: the
bindings for X are in the Crubit aspect of X. This allows Crubit to
generate bindings which themselves rely on bindings: if a function
in target `A` returns a struct from target `B`, we know that the
bindings for `A` will depend on the bindings for `B`. Because Crubit
is an aspect, it already knows the name of the bindings for `B`:
it's simply the Crubit aspect on `B`!
Without aspects, or something like aspects, you would need to write
down, for every library, the location of its Rust bindings. There is
no need for that kind of boilerplate when aspects are involved, and
that is why most things shaped like Crubit use aspects. For example,
protocol buffers use aspects for their generated implementations in
multiple languages. (They *also* use named rules, but the rules
simply re-export the aspect, and the underlying aspect is what is
used within the rule for referring to transitive dependencies.)
Thanks to aspects, the `proto_library` doesn't need to re-specify
"ah, and the Go proto is named `'x'`".
Be not afraid! Aspects are what make transitive dependencies work
seamlessly, without boilerplate. So when you see aspect this, or
aspect that, remember: this is a Good Thing.
C++ Bindings Cookbook
This document presents a collection of techniques for creating Rust bindings for C++ libraries.
These techniques are often workarounds for gaps in what Crubit can do. Expect the recommended practices to evolve over time, as Crubit's capabilities expand!
BEST PRACTICE: The tips below describe deviations from typical C++ style. (If typical C++ style worked, you wouldn't need a cookbook.) When you deviate from typical C++ style, document why, and try to keep changes limited in scope, close to the interop boundary.
If possible, solve the same problem while staying within more typical C++ style. For example: you may be able to add
ABSL_ATTRIBUTE_TRIVIAL_ABIto a type you control, instead of boxing the type in a pointer.
Making types Rust-movable
As described in
This can happen for a couple of easily fixable reasons, described in subsections:
- The type defines a destructor or copy/move constructor / assignment
operator. If it is in-principle still Rust-movable, and these functions do
not care about the address of the object in memory, then the type can be
annotated with
ABSL_ATTRIBUTE_TRIVIAL_ABI - The type has a field which is not rust-movable. In that case, the field can be boxed in a pointer.
There are other reasons a type can become non-Rust-movable, which do not have these easy fixes described below. For example, virtual methods, or non-Rust-movable base classes. For those, your only option is the hard option of more radically restructuring your code to avoid those patterns.
ABSL_ATTRIBUTE_TRIVIAL_ABI
One of the ways a type can become non-Rust-movable is if it has a copy/move
constructor / assignment operator, or a destructor. In that case, Clang will
assume that it cannot be trivially relocated, unless it is annotated with
ABSL_ATTRIBUTE_TRIVIAL_ABI.
struct LogWhenDestroyed {
~LogWhenDestroyed() {
std::cerr << "I was destroyed!\n";
}
};
struct ABSL_ATTRIBUTE_TRIVIAL_ABI LogWhenDestroyed {
~LogWhenDestroyed() {
std::cerr << "I was destroyed!\n";
}
};
WARNING: Only use
ABSL_ATTRIBUTE_TRIVIAL_ABIif changing the location of an object in memory is safe. In particular, if the object is self-referential, usingABSL_ATTRIBUTE_TRIVIAL_ABIwill result in Undefined Behavior (UB).class SelfReferential { public: SelfReferential(const SelfReferential& other) : x(other.x), x_ptr(&x) {} private: int x = 0; int* x_ptr = &x; }Types like this, if Rust-moved, will contain invalid pointers. Carefully review any code adding
ABSL_ATTRIBUTE_TRIVIAL_ABI.
Boxing in a pointer
One of the ways a type can become non-Rust-movable is if it has a field, where the type of that field is not Rust-movable. There is no way to override this: there is nothing a type can do to make itself Rust-movable if one subobject is not.
For example, consider a field like std::string name;. std::string defines a
custom destructor and copy / move constructor/assignment operator, in order to
correctly manage owned heap memory for the string. Because of this, it also is
not Rust-movable. And, at the time of writing, std::string currently cannot
use ABSL_ATTRIBUTE_TRIVIAL_ABI in any STL implementation. In the case of
libstdc++, for example, std::string contains a self-referential pointer: when
the string is small enough, the data() pointer refers to the inside of the
string. Rust-moving it would cause the pointer to refer back to the old
object, which would cause undefined behavior.
If a struct or class contains a std::string as a subobject by value, or any
other non-Rust-movable object, then that struct or class is itself also not
Rust-movable. (If you somehow were able to Rust-move the parent object, this
would also Rust-move the string, causing the very same issues.)
Instead, what you can do is change the type of the field, so that it doesn't contain the problematic type by value. Instead, it can hold the non-Rust-movable type by pointer.
BEST PRACTICE: Except where necessary for better Rust interop, this is not
good C++ style. When you use this trick, document why, and try to limit it to
types close to the interop boundary. If possible, instead of boxing T, make
T itself rust-movable. (This is not easy for standard library types, but if
the type is under your control, it may be as easy as adding
ABSL_ATTRIBUTE_TRIVIAL_ABI.)
unique_ptr
NOTE: The following is non-portable, and only works in libc++ with the unstable ABI. If you aren't sure about whether you are using the unstable ABI, it is likely that you are not, but you might want to check in with your local toolchain maintainer.
If you tightly control your dependencies, you might be using
libc++'s unstable ABI. The unstable ABI, among other things, makes
unique_ptr<T> Rust-movable. In fact, it is Rust-movable even if T itself is
not.
This means that if a particular field is making its parent type
non-Rust-movable, one fix is to wrap it in a unique_ptr:
struct Person {
std::string name;
int age;
}
struct Person {
// boxed to make Person rust-movable: <internal link>/cpp/cookbook#boxing
std::unique_ptr<std::string> name;
int age;
}
Raw pointers
BEST PRACTICE: This should only be used in codebases that do not use a
Rust-movable unique_ptr or unique_ptr equivalent. Consider wrapping this in
an ABSL_ATTRIBUTE_TRIVIAL_ABI type which resembles unique_ptr, instead.
When not using libc++'s unstable ABI, the most straightforward way to make a
field Rust-movable is to instead use a raw pointer, and delete it in the
destructor (as if it were held by a unique_ptr).
struct Person {
std::string name;
int age;
}
struct ABSL_ATTRIBUTE_TRIVIAL_ABI Person {
// Owned, boxed to make Person rust-movable: <internal link>/cpp/cookbook#boxing
std::string* name;
int age;
~Person() {
delete name;
}
}
(Note the use of ABSL_ATTRIBUTE_TRIVIAL_ABI: because we added a destructor, we
also need to add ABSL_ATTRIBUTE_TRIVIAL_ABI to indicate that the destructor
does not care about the address of Person.)
Renaming functions for Rust
Overloaded functions cannot be called from Rust (yet: b/213280424). To make them available anyway, you can define new non-overloaded functions with different names:
void Foo(int x);
void Foo(float x);
void Foo(int x);
void Foo(float x);
// For Rust callers: <internal link>/cpp/cookbook#renaming
inline void FooInt(int x) {return Foo(x);}
// For Rust callers: <internal link>/cpp/cookbook#renaming
inline void FooFloat(float x) {return Foo(x);}
Working around blocking bugs in Crubit
Crubit is still in development, and has bugs which can completely stop your work if Crubit was in the critical path. These can take the form of parsing errors or crashes when Crubit runs, or else generated bindings which do not compile.
The following workarounds can help get you moving again:
Disable Crubit on a declaration
If a declaration causes hard failures within Crubit, that declaration alone can be disabled using the CRUBIT_DO_NOT_BIND attribute macro, defined in support/annotations.h. This must be paired with an additional entry in rs_bindings_from_cc/bazel_support/generate_bindings.bzl, recording the name of the item.
To mail the CL performing this change, use AUTO_MANAGE=testing:TGP to the CL description.
NOTE: By disabling Crubit on this declaration, items which depend on it may also, in turn, not receive bindings. For example, if it declares a type, then functions which accept or return that type will also not receive bindings.
Disable Crubit on a header
If an entire header is giving problems (e.g. is unparseable), then it can be
removed from consideration by Crubit. Once disabled, Crubit will avoid reading
the header directly, although it is still included via #include preprocessor
directives.
Add the target name and header name to public_headers_to_remove in
rs_bindings_from_cc/bazel_support/rust_bindings_from_cc_aspect.bzl.
See the example in
rs_bindings_from_cc/test/disable/disable_header/.
To mail the CL performing this change, use AUTO_MANAGE=testing:TGP to the CL description.
NOTE: By disabling Crubit on this header, items which are defined in that header will not receive bindings. For example, this means that functions which use types defined in that header will also not get bindings, even if the function was defined in a header that was not disabled.
When possible, it's preferable to use a smaller fix. For example, if the same header is owned by two targets, it's preferable to move the header into a third target, depended on by both. That way, functions which use types defined in that header will still get bindings, in both targets.
Best Practices Writing Rust Bindings for Existing C++ Libraries
Introduction
This document is an attempt at guidance for how Rust changes can be made to existing C++ libraries, including core foundational libraries.
For an introduction, see Rust Bindings for C++ Libraries.
Code Organization
For technical reasons, it is generally necessary for the C++ library and its Rust bindings to be the same Bazel target. It is not possible to define the Rust bindings for a target as a completely separate and independent target. The automatically generated bindings, and their configuration, must be on and in the C++ target itself.
The reasons why are fairly technical, and you can stop reading here if you're OK with this.
Technical Justification
Crubit generates bindings using Bazel aspects: given an arbitrary C++ Bazel target, Crubit generates, in an aspect, the Rust library which wraps it. To users it appears as if the Bazel target was both a C++ and a Rust library.
This is necessary for the same reason that it's necessary for protocol buffers.
And, just like protocol buffers, this means that we don't have a rust_library
target where we could customize its behavior using Bazel attributes.
Specifically, we cannot use a regular Bazel rule for bindings generation because the rule cannot generate bindings for transitive dependencies: if A depends on B, then bindings(A) depends on bindings(B), so that bindings(A) can wrap functions in A that return types from B, and so on. (See FAQ: Why can't we use separate rules?)
Because bindings are generated in an aspect, and not a rule, there are only two places to configure the bindings of a target A:
- In the source code of the target receiving Rust support, using configuration pragmas or attributes. (This is similar to protocol buffers.)
- In the BUILD file, on the target receiving Rust support, via
aspect_hints. Aspect hints are a storage location for configuration data, readable by the aspect, placed directly on the target that the aspect runs on.
Generally speaking, it's better to modify the source code than to configure externally via aspect hints. However, some source code annotations are nonstandard and can have performance implications (see b/321933939). In addition to this, source code is not readable from the build system itself, and so where configuring a target requires customizing the build graph, these must go in aspect hints.
For these reasons, currently most publicly available methods of customizing bindings occur in aspect hints.
In any case, any configuration or support for Rust is done directly to the target.
Example
To enable Crubit on a C++ target, one actually modifies the target itself,
adding aspect_hints = ["//features:supported"]. This must
be an aspect hint, not a source code annotation, for all of the above reasons:
- It makes the build faster and more resilient: when Crubit is disabled on a target, Bazel needs to know so it can completely avoid running Crubit on it.
- There is no stable, reliable, and style-approved header-wide pragma we can
use for enabling/disabling Crubit, but
aspect_hintsdoes work.
FAQ: Why can't we use separate rules?
A library A, and its bindings bindings(A), must be linked together in the
build graph: if B uses a type from A, then bindings(B) uses a type from
bindings(A).
Crucially, this also goes in reverse: if a Rust library C uses a type from
bindings(A), then reverse_bindings(C) uses a type from A. This forms a
natural dependency cycle: the build graph must understand both the link from A
to bindings(A), and the link from bindings(A) to A.
Crubit resolves this by making A and bindings(A) the same target in the
build graph: bindings for a target are obtained by reading an aspect on the
target.
It is not possible to make A one build target, and bindings(A) a separate
build target, call it X:
- We cannot literally configure on
Athat its bindings are in a different targetX, because this ends up producing a real dependency cycle, as mentioned above: ifbindings(A)=X, thenreverse_bindings(X)=A. - We cannot avoid the cycle by creating the dependency "lazily", or
"dynamically" based on e.g. a naming scheme during Bazel analysis. Bazel
dependencies cannot be discovered dynamically; once Bazel reaches this point
of evaluation, dependencies need to be fully resolved: labels in
depsare no longer strings in this stage, they are edges in a dependency graph. That graph must not have cycles. - In some limited cases, we can hardcode the relationship within Crubit:
Crubit is actually two aspects, each of which handles a single direction of
interop. So Crubit can hardcode inside of itself that
bindings(A)=X, and in the other half, thatreverse_bindings(X)=A. This requires that Crubit itself depends onAandX. Therefore, to avoid another dependency cycle, neitherAnorXcan depend/use Crubit in their transitive dependencies. This is not feasible except in very isolated cases. Currently, we only do this for the Rust and C++ standard libraries.
To compare with another similar technology, PyCLIF avoids this problem because it only supports "one-directional" interop, and so it doesn't need to avoid dependency cycles. Crubit is bidirectional, and this comes with some technical restrictions.
FAQ: Why are there extra dependencies in deps(target)?
Because the Rust bindings are created using an aspect on the C++ target,
everything that the Rust bindings need to depend on will appear in a Bazel query
/ depserver query for deps(target).
For example, if you wanted to add some extra source file to the Rust bindings,
you might specify them in aspect_hints. This file will show up in
deps(target).
These Rust-only deps are not used at all in pure-C++ builds (the Bazel actions registered by them won't be executed), but they will show up in the dependency graph anyway, due to how Bazel query and depserver track dependencies.
NOTE: In particular, if your project has tests that count/limit the transitive dependencies of a C++ binary, they will overcount the dependencies, and the overcounting will get worse as Rust support is rolled out through the C++ build graph.
Wrapping and type bridging vs direct use of types
Crubit automatically generates layout-compatible Rust equivalents of C++ types.
When the C++ type is Rust-movable, the
Crubit-generated Rust type is Rust-movable, these can be used by value, by
pointer, in struct fields, arrays, and any other compound data type. A C++
pointer const T* can become a Rust *const T, and a C++ T field can become
a Rust T field, and so on, with few restrictions.
For example, the following C++ type:
struct Vec2d {
float x;
float y;
};
Becomes (roughly) the following Rust type:
#![allow(unused)] fn main() { #[repr(C)] struct Vec2d { pub x: f32, pub y: f32, } }
These have an identical layout, and so a C++ pointer or field containing a C++
Vec2d is exactly equivalent to a Rust pointer or field containing a Rust
Vec2d.
(See Types for more information about layout-compatibility.)
Because of this, it is often not required to manually write any new types. The bindings generated by Crubit will produce a working type automatically.
When to wrap a type
There are, still, a handful of reasons to manually write "wrapper" types which encapsulate or replace the original C++ type (or its Crubit-generated Rust type).
- If the type is not naturally Rust-movable, but it's important for the Rust type to be Rust-movable. It may be possible to make changes to the C++ code to make the type Rust-movable using some of the strategies described in the cookbook. This allows the greatest flexibility, as the type becomes usable in almost every context. But if that is not possible, writing a new "wrapper" type can keep Rust programmers productive.
- Some Rust types have very special semantics, which are impossible to
implement in the bindings for a C++ type. For example, Rust has special
support for
ResultandOptionin error handling via the?operator, which cannot yet be implemented byStatusorstd::optionalusing stable Rust features. These privileged Rust types can be used instead of the equivalent C++ types, as a wrapper type.
In these cases, we may bridge to a wrapper type as a workaround, while we hopefully fix the underlying issues that mean we cannot directly use the underlying type. This offers us a subset of the API we want, and allows continued progress.
Why not to wrap a type
Wrapper types work best when passed by value: if you return a T in C++, the
corresponding Rust function can automatically convert it to and return a
WrappedT.
However, no conversion is possible for references or fields, which really are
the original type, with its size and alignment and address in memory - to make
this work transparently requires an ever-expanding network of wrapper types, one
for every compound data type that might contain T:
Tmust becomeWrappedTconst T&, if it is supported at all, must become something likeTRef<'a>, or a dynamically sized&TView.std::vector<T>, if it is supported at all, must become something likeTVector.struct MyStruct {T x;}must become a wrappedWrappedMyStruct.- ...
The problems introduced by wrapper types can easily outweigh the benefits that they bring. Crubit aims to reduce their necessity to zero over time.
Bad reasons to wrap a type
In most other circumstances where one might want to reach for wrapper types, alternatives exist:
-
If we want to use a wrapper type in order to give the type a nicer Rust API, then, as an alternative, one can customize the Rust API of the wrapped type using an aspect hint. You can define new methods and trait implementations to the side, without altering any C++ code.
-
If we want to use a wrapper type in order to change the type invariants – to make them stricter or looser – this is fine, as long as it doesn't replace the not-as-nice type. For example, if a C++ API returns
std::string(bytes, "probably" UTF-8), the Rust equivalent should not return a RustString(Unicode, definitely UTF-8). Changing type invariants in-place causes some APIs to become impossible to call, and causes the Rust and C++ ecosystems to diverge and become incompatible. The bindings should be high fidelity. Wrapper types of this form should be optional, and available equally to both C++ and Rust to avoid fragmenting the ecosystem.
Fidelity
Anything possible in C++ should be possible in Rust. See
The Rust API for a given C++ API should not try to make the interface "better" at more than a superficial level, because it can compromise the ability of other teams to write new Rust code, or port existing C++ code to Rust.
Good changes:
- Changing method names, especially to names that Rust callers might expect.
For example, changing
Status::ok()(C++) toStatus::is_ok()(Rust) – Rust callers expect many of these boolean functions to be prefixed withis_. - Adding new APIs that Rust users expect. For example, trait implementations
that allow the type to better interoperate with the Rust ecosystem, or
functions which accept a
Pathor&strin addition to a raw C++string_view. - Reifying C++ comments around lifetime or safety as actual lifetime
annotations or
unsafedeclarations.
If the Rust type is outright unnatural to use, people won't use it, and it's worse for the ecosystem to have two APIs than one API.
Bad changes:
- Removing deprecated APIs which still have C++ callers.
- Placing new requirements on Rust callers that were not placed on C++ callers, such as requiring UTF-8 when C++ does not.
Customizing bindings using annotations
[TOC]
The Rust bindings for a C++ declaration can be customized using an attribute
macro from <crubit/support/annotations.h>.
For instance:
- A function can be marked
unsafein Rust, even if Crubit would otherwise assume it was safe, usingCRUBIT_UNSAFE. - Missing bindings for an item can be treated as an error, instead of ignored,
using
CRUBIT_MUST_BIND. - An item can be given a different name in Rust using
CRUBIT_RUST_NAME("rust_name_here").
More information:
- Dependency:
//support:annotations - Include:
#include <crubit/support/annotations.h> - Full API documentation: support/annotations.h
Example
Given the following C++ header:
cs/file:examples/cpp/unsafe_attributes/example.h symbol:SafeSignatureButAnnotatedUnsafe
Crubit will generate the following bindings:
cs/file:examples/cpp/unsafe_attributes/example_generated.rs symbol:SafeSignatureButAnnotatedUnsafe
Rust bindings for C++ functions
Rust code can call (non-member) functions defined in C++, provided that the parameter and return types are supported by Crubit:
- If a parameter or return type is a primitive type, then the bindings for the function use the corresponding Rust type.
- Similarly, if a parameter or return type is a pointer type, then the bindings for the function use the corresponding Rust pointer type.
- If the type is a user-defined type, such as a class type or enum, then the bindings for the function use the bindings for that type.
Additionally, code can call member functions defined in C++ if the parameter and return types are supported by Crubit (see above). Currently, member functions are translated as non-method associated functions.
Examples
Functions
Given the following C++ header:
cs/file:examples/cpp/function/example.h function:add_two_integers
Crubit will generate the following bindings, with a safe public function that calls into the corresponding FFI glue:
cs/file:examples/cpp/function/example_generated.rs function:add_two_integers
Methods
Given the following C++ header:
cs/file:examples/cpp/method/example.h class:Bar
Crubit will generate the following bindings:
cs/file:examples/cpp/method/example_generated.rs class:Bar
cs/file:examples/cpp/method/example_generated.rs snippet:0,6 "impl Bar"
unsafe functions
Which C++ functions are marked unsafe in Rust?
By default, the Rust binding to a C++ function is marked as safe or unsafe
based on the types of its parameters. If a C++ function accepts only simple
types like integers, the resulting Rust binding will be marked as safe.
Functions which accept a raw pointer are automatically marked as unsafe.
This behavior can be overridden using the CRUBIT_UNSAFE,
CRUBIT_UNSAFE_MARK_SAFE and CRUBIT_OVERRIDE_UNSAFE(is_unsafe) macros.
For example, given the following C++ header:
cs/file:examples/cpp/unsafe_attributes/example.h content:^([^/#\n])[^\n]*
Crubit will generate the following bindings:
cs/file:examples/cpp/unsafe_attributes/example_generated.rs content:^([^/\n])([^!\n]|$)[^\n]*
Correct usage of unsafe
Functions marked unsafe cannot be called outside of an unsafe block. In
order to avoid undefined behavior when using unsafe, callers must:
-
Ensure that the pointer being passed to C++ is a valid C++ pointer. In particular, it must not be dangling (e.g.
Nonnull::dangling()). -
Ensure that the safety conditions documented in C++ are upheld. For example, if the C++ function accepts a reference or non-null pointer, then do not pass in
0 as *const _.
Soundness
Note that many "safe" C++ functions may still trigger undefined behavior if used
incorrectly. Regardless of whether a C++ function is marked as unsafe, calls
into C++ will only be memory-safe if the caller verifies that all function
preconditions are met.
Function Attributes
Function attributes are not currently supported. Functions marked
[[noreturn]], [[nodiscard]], etc. do not have bindings.
Rust bindings for C++ classes and structs
A C++ class or struct is mapped to a Rust struct with the same fields. If
any subobject of the class cannot be represented in Rust, the class itself will
still have bindings, but
the relevant subobject will be private.
To have bindings, the class must be "Rust-movable". For example, any trivial or "POD" class is Rust-movable.
Example
Given the following C++ header:
cs/file:examples/cpp/trivial_struct/example.h class:Position
Crubit will generate a struct with the same layout:
cs/file:examples/cpp/trivial_struct/example_generated.rs class:Position
For an example of a Rust-movable class with a destructor, see examples/cpp/trivial_abi_struct/.
Fields
The fields on the Rust struct type are the corresponding Rust types:
- If the C++ field has primitive type, then the Rust field uses the corresponding Rust type.
- Similarly, if the C++ field has pointer type, then the Rust field has the corresponding Rust pointer type.
- If the field has a user-defined type, such as a class type or enum, then the bindings for the function use the bindings for that type.
Unsupported fields
Subobjects that do not receive bindings are made private, and replaced with an
opaque blob of [MaybeUninit<u8>; N], as well as a comment in the generated
source code explaining why the subobject could not receive bindings. For
example, since inheritance is not supported, the space of the object occupied by
a base class will instead be this opaque blob of bytes.
Specifically, the following subobjects are hidden and replaced with opaque blobs:
- Base class subobjects
- Non-
publicfields (privateorprotectedfields) - Fields that have nontrivial destructors
- Fields whose type does not have bindings
- Fields that have any unrecognized attribute, including
no_unique_address
A Rust struct with opaque blobs is ABI-incompatible with the C++ struct or class that it corresponds to. As a consequence, if the struct is used for FFI outside of Crubit, it should not be passed by value. Within Crubit, it can't be passed by value in function pointers, but can otherwise be used as normal.
Rust-movable classes
For a type to be passed or returned by value in Rust, it must be "Rust-movable":
the class must be able to be "teleported" in memory during its lifetime, as if
by using memcpy and then discarding the old location without running any
destruction logic. This means that it can be present in Rust using normal
objects and pointers and references, without using Pin.
For example, a string_view is Rust-movable. In fact, every trivially copyable
type is Rust-movable
However, unlike Rust, many types in C++ are not Rust-movable. For example, a
std::string might be implemented using the "short string optimization", in a
fashion similar to this:
class String {
union {
size_t length;
char inline_data[sizeof(length)];
};
char* data; // either points to `inline_data`, or the heap.
public:
size_t size() {
if (data == (char*)this) {
return strlen(data);
} else {
return length;
}
}
// ...
};
This class is self-referential: the data pointer may point to inline_data,
which is inside the object itself. If we bitwise copy the object to a new
location, as in a "Rust move" or as with memcpy, then the data pointer will
remain bitwise identical, and point into the old object. It becomes a
dangling pointer!
C++ allows self-referential types. In C++, fields can and often do point at
other fields, because assignment is overloadable: the assignment operator can be
modified to, when copying or moving the string, also "fix up" the data pointer
so that it points to the new location in the new object, instead of dangling.
Rust does not do this. In Rust, assignment is always a "trivial relocation" --
assignment runs no code when copying or moving an object, and copies the bytes
as they are. This would break on the String type defined above, or any other
self-referential type.
Unfortunately, any class with a user-defined copy/move operation or destructor
might be self-referential, and so by default they are not Rust-movable. If a
class has a user-defined destructor or copy/move constructor/assignment
operator, and "should be" Rust-movable, it must explicitly declare that it is
safe to perform a Rust move, using the attribute
ABSL_ATTRIBUTE_TRIVIAL_ABI.
This attribute allows a class to be trivially relocated, even though it defines
an operation that would ordinarily disable trivial relocation.
For example, in the unstable libc++ ABI we use within Google, a unique_ptr<T>
is Rust-movable, because it applies ABSL_ATTRIBUTE_TRIVIAL_ABI. This is safe
to do, for unique_ptr, because its exact location in memory does not matter,
and paired move/destroy operations can be replaced with Rust move operations.
Requirements
The exact requirements for a class to be Rust-movable are subject to change, because they are still being defined within Clang and within the C++ standard. But at the least:
- Any trivially copyable type is also Rust-movable.
- Any
classorstructtype with only Rust-movable fields and base classes is Rust-movable, unless:- it is not
ABSL_ATTRIBUTE_TRIVIAL_ABIand defines a copy/move constructor, copy/move assignment operator, or destructor, or, - it is otherwise nontrivial, e.g., from defining a
virtualmember function.
- it is not
Some examples of Rust-movable types:
- any primitive type (integers, character types, floats, etc.)
- raw pointers
string_viewstruct tm, or any other type in the C standard libraryunique_ptr, in the Clang unstable ABI.absl::Status
Some examples of types that are not Rust-movable:
- (For now)
std::string,std::vector, and other nontrivial standard library types. - (For now)
absl::flat_hash_map,absl::AnyInvocable, and other nontrivial types used throughout the C++ ecosystem, even outside the standard library. absl::Mutex,absl::Notification, and other non-movable types.
Attributes
Crubit does not support most attributes on structs and their fields. If a struct
is marked using any attribute other than alignment or
ABSL_ATTRIBUTE_TRIVIAL_ABI, it will not receive bindings. If a field is marked
using any other attribute, it will be replaced with a private opaque blob.
Rust bindings for C++ enums
A C++ enum is mapped to a Rust struct with a similar API to a Rust enum.
- The enumerated constants are present as associated constants:
MyEnum::kFooin C++ isMyEnum::kFooin Rust. - The enum can be converted to and from its underlying type using
FromandInto. For example,static_cast<int32_t>(x)isi32::from(x)in Rust, and vice versastatic_cast<MyEnum>(x)isMyEnum::from(x).
However, a C++ enum is not a Rust enum. Some features of Rust enums are not supported:
- C++ enums must be converted using
FromandInto, notas. - C++ enums do not have exhaustive pattern matching.
Example
Given the following C++ header:
cs/file:examples/cpp/enum/example.h class:Color
Crubit will generate the following bindings:
cs/file:examples/cpp/enum/example_generated.rs class:Color
Why isn't it an enum?
A C++ enum cannot be translated directly to a Rust enum, because C++ enums
are "representationally non-exhaustive": a C++ enum can have any value
supported by the underlying type, even one not listed in the enumerators. For
example, in the enum above, static_cast<Color>(42) is a valid instance of
Color, even though none of kRed, kBlue, or kGreen have that value.
Rust enums, in contrast, are representationally exhaustive. An enum declares a
closed set of valid discriminants, and it is undefined behavior to
attempt to create an enum with a value outside of that set, whether it's via
transmute, a raw pointer cast, or Crubit. The behavior is undefined the moment
the invalid value is created, even if it is never used.
Since a value like static_cast<Color>(42) is not in the list of enumerators, a
Rust enum cannot be used to represent an arbitrary C++ enum. Instead, the
Rust bindings are a struct. This struct is given the most natural and
enum-like API possible, though there are still gaps. (Casts using as, for
example, will not work with a C++ enum.)
What about #[non_exhaustive]?
The #[non_exhaustive] attribute on an enum communicates to external
crates that more variants may be added in the future, and so a match requires
a wildcard branch. Within the defining crate, non_exhaustive has no effect. It
remains undefined behavior to transmute from integers not declared by the
enum.
C++ bindings for Rust libraries
Rust libraries can be used directly from C++. This page documents roughly what that entails, and additional subpages (available in the left-hand navigation) document specific aspects of the generated bindings.
Tip: The code examples below are pulled straight from examples/rust/function/. The other examples in examples/rust/ are also useful. If you prefer just copy-pasting something, start there.
How to use Crubit
Crubit allows you to call some Rust interfaces from C++. It supports
functions (including methods), structs, and even
enums as "opaque" objects. Crubit does not support advanced
features like generics or dynamic dispatch with dyn.
The rest of this document goes over how to create a Rust library that can be called from C++, and how to actually use it from C++. The quick summary is:
-
All
rust_librarytargets can receive C++ bindings. -
To use the bindings for a target
//path/to:example_crate, you must create a C++ rule exporting the bindings, usingcc_bindings_from_rust(name="any_name_here", crate=":example_crate"). -
The header name is the Rust target's label with a
.happended: to include the header for the Rust library//path/to:example_crate, you use#include "path/to/example_crate.h". -
The namespace name is the Rust target name, e.g.
example_crate. To change the namespace, usecc_bindings_from_rust_library_config, described below. -
To see the generated C++ API, right click the
"path/to/example_crate.h"include in Cider, and select "Go to Definition".NOTE: In some cases the generated file in Cider may be out of date. If it isn't refreshing, you can manually inspect the bindings using the workaround command in b/391395849.
Write a rust_library target
The first part of creating a library that can be used by Crubit is to write a
rust_library target. For example:
cs/file:examples/rust/function/example.rs content:^[^/].*
In the BUILD file, in addition to defining the rust_library, you should also
define the cc_bindings_from_rust target to make it easier to use from C++:
cs/file:examples/rust/function/BUILD symbol:example_crate|example_crate_cc_api
Example: If your Rust library is named //path/to:example_crate, then the C++
header file is "path/to/example_crate.h", and the C++ namespace is
example_crate by default.
Use a Rust library from C++
C++ build rules do not have a rust_deps parameter, so to depend on the C++
bindings for a target, they must depend on the cc_bindings_from_rust rule.
For example:
cs/file:examples/rust/function/BUILD symbol:main
cs/file:examples/rust/function/main.cc content:^[^/\n].*
NOTE: Other than for declaring the dependency, all other information about the
generated bindings comes from the actual rust_library rule. For example, the
#include for the above is #include "examples/rust/function/example_crate.h", not
example_crate_cc_api.h.
(Optional) Customize the generated C++ API
Give it a better namespace
The crate name might make a poor namespace. In addition, typically, multiple C++
headers and build targets share the same namespace. To customize the namespace
name, use cc_bindings_from_rust_library_config:
cs/file:examples/rust/library_config/BUILD symbol:custom_namespace|example_crate
Now, instead of the crate name, the generated bindings will use the namespace name you provided:
cs/file:examples/rust/library_config/main.cc content:^[^/\n].*
Look at the generated bindings
There are two ways to look at the generated header file:
-
Click through the
#includein Cider. Given the following C++ code:#include "path/to/example_crate.h"If you right click the file path, and select "Go to Definition", you will be taken to a file starting with
// Automatically @generated C++ bindings. -
Run
bazel build //path/to:example_crate --config=crubit-genfiles, and openbazel-bin/path/to/example_crate.hin your text editor of choice.
Common Errors
Unsupported features
Some features are either unsupported, or else only supported with experimental
feature flags (
For a particularly notable example, references are only supported as function parameters, and only in a subset of cases that we can prove does not add aliasing UB to C++ callers.
The way to work around this kind of problem, in all cases, is to wrap or hide the problematic interface behind an interface Crubit can handle:
- Use raw pointers instead of references, if this use of references falls into a case Crubit does not support.
- Hide unsupported types behind a wrapper type. For example, a
Vec<T>is not supported by Crubit, butpub struct MyStruct(Vec<i32>);is.
C++ bindings for Rust functions
C++ code can call functions defined in Rust, provided that the parameter and return types are supported by Crubit:
- If a parameter or return type is a fundamental type, then the bindings for the function use the corresponding Rust type.
- Similarly, if a parameter or return type is a pointer type, then the bindings for the function use the corresponding Rust pointer type.
- If the type is a user-defined type, such as a struct or enum, then the bindings for the function use the bindings for that type.
As a special case, functions also support reference parameters to supported types, with some restrictions to ensure safety. See References.
Example
Given the following Rust crate:
cs/file:examples/rust/function/example.rs function:add_two_integers
Crubit will generate the following function declaration, which calls into accompanying glue code:
cs/file:examples/rust/function/example_generated.h function:add_two_integers
unsafe functions
C++ does not have an unsafe marker at this time. In the future, Crubit may
introduce a way to mark unsafe functions to help increase the reliability of
C++ callers.
References
In general, Rust references are not exposed to C++. However, some Rust functions which accept reference parameters do get mapped to C++ functions accepting C++ references:
- All references must have an unbound parameter lifetime – not
'static, for example. - Only the parameter itself can be a reference type. References to references, vectors of references, etc. are still unsupported.
- If there is a
mutreference parameter, it is the only reference parameter.
This set of rules is intended to describe a safe subset of Rust functions, which do not introduce substantial aliasing risk to a mixed C++/Rust codebase.
For example, the following Rust functions will receive C++ bindings, and can be called from C++:
#![allow(unused)] fn main() { fn foo(&self) {} fn foo(_: &mut i32) {} fn foo(_: &i32, _: &i32) {} }
However, none of the below will receive bindings:
#![allow(unused)] fn main() { fn foo(_: &'static i32) {} // 'static lifetime is bound fn foo(_: &&i32) {} // Reference in non-parameter type fn foo(_: &mut i32, _: &i32) {} // More than one reference, one of which is mut fn foo(_: &'a i32) {} // 'a is not a lifetime parameter of `foo` }
Returned references are still not supported, and references which are bound to
some lifetime (e.g. 'static) are also still not supported.
If you wish to accept more than one reference/pointer in C++, a raw pointer
(*const T, *mut T) can be used instead. However, all of the usual unsafe
caveats apply.
C++ bindings for Rust structs
A Rust struct is mapped to a C++ class/struct with the same fields. If any
field cannot be represented in C++, the struct itself will still have bindings,
but the relevant field will be private.
To receive C++ bindings, the struct must be movable in C++. See
Movable Types.
Example
Given the following Rust module:
cs/file:examples/rust/struct/example.rs class:Struct
Crubit will generate the following bindings:
cs/file:examples/rust/struct/example_generated.h class:CRUBIT_INTERNAL_RUST_TYPE|Struct
Fields
The fields on the C++ class are the corresponding Rust types:
- If the Rust field has primitive type, then the C++ field uses the corresponding C++ type.
- Similarly, if the Rust field has pointer type, then the C++ field has the corresponding C++ pointer type.
- If the field has a user-defined type, such as a struct or enum, then the bindings for the function use the bindings for that type.
Unsupported fields
Fields that do not receive bindings are made private, and replaced with an
opaque blob of maybe-uninitialized bytes, as well as a comment in the generated
source code explaining why the field could not receive bindings. For example,
since String is not supported, the space of the object occupied by a String
field will instead be this opaque blob of bytes:
#![allow(unused)] fn main() { // Rust: `my_field` is some unsupported type, such as `String` pub my_field: String, }
// C++: `my_field` becomes `private`, and its type is replaced by bytes.
private: unsigned char my_field[24]
Specifically, the following subobjects are hidden and replaced with opaque blobs:
- Non-public fields (
privateorpub(...)fields). - Fields that implement
Drop. - Fields whose type does not have bindings.
- Fields that have an unrecognized or unsupported attribute.
C++ movable
To receive C++ bindings, the struct must be movable in C++. See
Movable Types.
C++ bindings for Rust enums
A Rust enum is mapped to an opaque C++ type. C++ code cannot create a specific
variant, but can call functions accepting or returning an enum.
To receive C++ bindings, the enum must be movable in C++. See
Movable Types.
Example
Given the following Rust crate:
cs/file:examples/rust/enum/example.rs class:Color
Crubit will generate the following bindings:
cs/file:examples/rust/enum/example_generated.h class:CRUBIT_INTERNAL_RUST_TYPE|Color
Why isn't it a C++ enum?
A repr(i32) or fieldless repr(C) enum is very similar to a C++ enum.
However, Rust enums are exhaustive: any value not explicitly listed in the
enum declaration does not exist, and it is
undefined behavior
to attempt to create one.
C++ enums, in contrast, are "non-exhaustive": a C++ enum can have any
value supported by the underlying type, even one not listed in the enumerators.
For example, if the above example were a C++ enum, static_cast<Color>(42)
would be a valid instance of Color, even though neither Red, Blue, nor
Green have that value.
In order to prevent invalid Rust values from being produced by C++, a C++ enum
cannot be used to represent a Rust enum. Instead, the C++ bindings are a
struct, even for fieldless enums.
C++ movable
To receive C++ bindings, the enum must be movable in C++. See
Movable Types.
Generating C++ enums from Rust enums
By default, a Rust enum is mapped to an opaque C++ type (see
C++ bindings for Rust enums). However, Crubit can try to map Rust
enums to C++ enums if requested using the #[cpp_enum] attribute. C++ code
can use such enums like any other C++ enum.
But #[cpp_enum] cannot be used with exhaustive Rust enums. It may only be
used on non-exhaustive enums, such as those created with #[open_enum] from the
open_enum crate. Therefore, to generate C++ enum bindings, you must annotate
your Rust enum with #[cpp_enum], #[repr(...)] (where ... is an integer
type like i32), and #[open_enum].
C++ enums are non-exhaustive by default, meaning they can hold values other than
the explicitly named enumerators. #[open_enum] generates a Rust enum that is
similarly non-exhaustive. Additionally, C++ allows multiple enumerators to have
the same value, which can be enabled in Rust by using
#[open_enum(allow_alias)].
Example
Given the following Rust crate that uses #[cpp_enum] and
#[open_enum(allow_alias)]:
cs/file:examples/rust/cpp_enum/example.rs class:Color
Crubit will generate the following bindings:
cs/file:examples/rust/cpp_enum/example_generated.h class:CRUBIT_INTERNAL_RUST_TYPE|Color
C++ bindings for Rust type aliases.
A rust Rust type aliases, such as pub type X = ...;, is mapped to the equivalent
C++ type alias, such as using X = ...;.
Limitations:
- The type must be a supported type.
- The alias must not be generic: aliases with generic parameters, such as
pub type X<T> = ..., are not supported.
Example
Given the following Rust crate:
cs/file:examples/rust/type_alias/example.rs content:\bpub\ type\b
Crubit will generate the following bindings:
cs/file:examples/rust/type_alias/example_generated.h content:\busing\b
C++ bindings for Rust use declarations
Crubit supports use declarations for functions and types, mapping them to
equivalent using declarations in C++.
Limitations:
- The
usedeclaration must refer to a function or type.- If it refers to a function, it must not rename the function.
- The
usedeclaration must import exactly one entity per name. For example,pub use m::x;is supported ifxrefers to a function, or to a type, but not if it refers to both a function and a type.
Example
Given the following Rust crate:
cs/file:examples/rust/use_declaration/example.rs content:\bpub\ use\b
Crubit will generate the following bindings:
cs/file:examples/rust/use_declaration/example_generated.h content:\busing\b
Movable types
Crubit requires types to be "movable" to be passed by value: if a Rust type does not logically support a C++ move operation, then it can receive bindings, but it cannot be passed by value.
A Rust type can be made movable in C++ in one of three ways:
- Copyable: the Rust type implements
Clone. - Trivially move-constructible and destructible: the Rust type
does not have a
destructor. (It does
not implement
Drop, and nor do any of its fields.) - Non-trivially move-constructible: the Rust type has a
destructor, but implements
Default.
The easiest way to ensure your type is useful to end users, even if it is
changed in the future, is to implement Clone and Default. This makes the
type default-constructible and copyable1, as well as efficiently
movable.
Copyable
If the Rust type implements Clone, then the C++ type will be copyable:
- Copy construction has the same behavior as
Clone::clone. - Copy assignment has the same behavior as
Clone::clone_from.
Because the type is copyable, it is also movable, at worst by a copy operation.
Trivially move-constructible and destructible
If no logic occurs during destruction, because the type doesn't implement
Drop, and none of its fields do, then the C++ type will be trivially-movable
and trivially-destructible:
- Move construction and assignment copy the bytes of the object, with the same behavior as a Rust move operation.
NOTE: All Copy types are guaranteed to be trivially move-constructible and
destructible.
If the Rust type is Copy, then the moved-from object is guaranteed to hold its
old value, and be valid for all operations.
Otherwise, the object is only valid for assignment and destruction, and the behavior of performing any other operation is undefined.
Non-trivially move-constructible
If the Rust type is not trivially movable and destructible, but implements
Default, then the resulting C++ type will be (non-trivially) move
constructible:
- Move construction has the same behavior as
std::mem::take(): it copies the bytes to the new object, as if by a Rust move, and replaces the moved-from object withDefault::default(). - Move assignment copies the bytes to the new object, as if by a Rust move, and replaces the moved-from object with an unspecified but valid object.
Why is this required?
In general, Crubit needs to be able to move objects as part of the implementation of pass-by-value, even in C++17, due to platform ABI restrictions. Even without this requirement, types are not very useful in C++ if they are not movable.
Unlike Rust, C++ has no "destructive move". There is no way to change an object's location in memory, only to create a new object with the same value, and leave behind something in the old (still valid) object. Sometimes, what's left behind is an identical copy of the object state: this is a copy operation, implemented by the C++ copy constructor or copy assignment operator. But sometimes, copying is expensive, and instead what we might leave behind is some kind of junk value. It still must be a valid object (at least so that its destructor and assignment operator can be invoked), but it might represent some invalid or moved-from state.
For example, to "move" a unique_ptr (the C++ equivalent of Box) from one
variable to another, you copy the bytes, and then replace the old location with
a special null value representing an unoccupied / moved-from unique_ptr. This
is why unique_ptr must be nullable in the C++ type system: otherwise, it
could not be moved!
The combination of default-constructible and copyable is so important for making types useful in C++ that it even has a name: "semiregular"
High-level design of C++/Rust interop
This document describes the high-level design choices of Crubit, a C++/Rust Bidirectional Interop Tool.
[TOC]
C++/Rust interop goal
The primary goal of Crubit is to enable Rust to be used side-by-side with C++ in large existing codebases.
In the short term we would like to focus on codebases that roughly follow the Google C++ style guide to improve the interop fidelity. Other, more diverse codebases are possible prospective users in the long term, and their needs will be addressed by customization and extension points.
C++/Rust interop requirements
In support of the interop goal, we identify the following requirements:
- Enable using existing C++ libraries from Rust with high fidelity
- High fidelity means that interop will make C++ APIs available in Rust, even when those API projections would not be idiomatic, ergonomic, or safe in Rust, to facilitate cheap, small step incremental migration workflow. Based on the experience of other cross-language interoperability systems and language migrations (for example, Objective-C/Swift, Java/Kotlin, JavaScript/TypeScript), we believe that working in a mixed C++/Rust codebase would be significantly harder if some C++ APIs were not available in Rust.
- Interop will bridge C++ constructs to Rust constructs only when the semantics match closely. Bridging large semantic gaps creates a risk of making C++ APIs unusable in Rust, as well as a risk of creating performance problems. For example, interop will not bridge destructive Rust moves and non-destructive C++ moves; instead it will make C++ move constructors and move assignment operators available to use in Rust code. As another example, interop will not bridge C++ templates and Rust generics by default.
- Interop should be performant, as close to having no runtime cost as possible. The performance costs of the interop should be documented, and where possible, intuitive to the user.
- Interop should be ergonomic and safe, as long as ergonomic and safety accommodations do not hurt performance or fidelity. Where a tradeoff is possible, the interop will choose performance and fidelity over ergonomics; the user will be allowed to override this choice.
- Enable owners of the C++ API to control their Rust API projection, for example, with attributes in C++ headers and by extending generated bindings with a manually implemented overlay. Such an overlay will wrap or extend generated bindings to improve ergonomics and safety.
- Enable using Rust libraries from C++
- However, using C++ libraries from Rust has a higher priority than using Rust libraries from C++.
- Put little to no barriers to entry
- Ideally, no boilerplate code needs to be written in order to start using a C++ library from Rust. Adding some extra information can make the generated bindings more ergonomic to use.
- The amount of duplicated API information is minimized.
- Future evolution of C++ APIs should be minimally hindered by the presence of Rust users.
Proposal and high-level design
We propose to develop our own C++/Rust interop tooling. There are no existing tools that satisfy all of our requirements. Modifying an existing tool to fulfill these requirements would take more effort than building a new tool from scratch or might require forking its codebase given that some existing tools have goals that conflict with our goals.
See the "alternatives considered" section for a discussion of existing tools.
Source of information about C++ API
Interop tooling will read C++ headers, as they contain the information needed to generate Rust API projections and the necessary glue code. Interop tooling that is used during builds will not read C++ source files, to maintain the principle that C++ API information is only located in headers, and that a C++ library can't break the build of its dependencies by changing source files.
Some interop-adjacent tools (e.g., large-scale refactoring tools that seed the initial set of lifetime annotations) will also read C++ sources. These tools will not be used during builds.
Pros
- Minimal barrier to entry: minimal amount of manual work is required to
start using a C++ library from Rust.
- Encourages leaf projects to start incrementally adopting Rust in new code, or incrementally rewriting C++ targets in Rust.
- C++ API information is located only in headers, regardless of the language that the API consumer is written in (C++ or Rust).
- Interop tooling that generates Rust API projections from a C++ header can
get exactly the same information that the C++ compiler has when processing
a translation unit that uses one of the APIs declared within that header.
- Interop tooling can generate the most performant calls to C++ APIs, without C++-side thunks that translate the C++ ABI into a C ABI.
- Interop tooling can autodetect implementation details that are critical for interop but are not a part of the API surface (for example, the size and alignment of C++ classes that have private data members).
- In alternative solutions, users need to repeat these implementation details in sidecar files. Interop can verify that the specified information is correct through static assertions in generated C++ code, but the overall user experience is inferior.
Cons
- Having to read C++ headers makes interop tooling more complex.
- The Rust projection of the C++ API is only visible in machine-generated
files.
- These are not trivially accessible.
- There is a limit on how readable these files can be made.
- We can mitigate these issues by building tooling that shows the Rust view of a C++ header (for example in Code Search, or in editors as an alternative go-to-definition target).
Customizability
Interop tooling will be sufficiently customizable to accommodate the unique needs of different C++ libraries in the codebase. Interop should be customizable enough to accommodate existing codebases. C++ API owners can:
- Guide how interop tooling generates Rust API projections from C++
headers. For example, headers can provide:
- Custom Rust names for C++ function overloads (instead of applying the general interop strategy for function overloads),
- Custom Rust names for overloaded C++ operators,
- Custom Rust lifetimes for pointers and references mentioned in the C++ API,
- Nullability information for pointers in the C++ API,
- Assertions (verified at compile time) and promises (not verified by tooling) that certain C++ types are Rust-movable.
- Provide custom logic to bridge types, for example, mapping C++
absl::StatusOrto RustResult. - Provide API overlays that improve the automatically generated Rust API.
- For example, the overlays could inject additional methods into automatically generated Rust types or hide some of the generated methods.
More intrusive customization techniques will be useful for template and macro-heavy libraries where the baseline import rules just won't work. We believe customizability will be an essential enabler for providing high-fidelity interop.
Source of additional information that customizes C++ API projection into Rust
Where C++ headers don't already provide all information necessary for interop tooling to generate a Rust API projection, we will add such information to C++ headers whenever possible. If it is not desirable to edit a certain C++ header, extra information can be stored in a sidecar file.
Examples of additional information that interop tooling will need:
- Nullability annotations. C++ APIs often expose pointers that are
documented or assumed by convention to be never null, but can't be
refactored to references due to language limitations (for example,
std::vector<MyProtobuf *>). If C++ headers don't provide nullability information for pointers in a machine-readable form, interop tooling has to conservatively mark all C++ pointers as nullable in the Rust API projection. The Rust compiler will then force users to write unnecessary (and untestable) null checks. - Lifetimes of references and pointers in C++ headers are not described in a machine-readable way (and sometimes are not even documented in prose). Lifetime information is essential to generate safe and idiomatic Rust APIs from C++ headers.
Additional information is stored in C++ headers
Pros
- Additional information needed for C++/Rust interop will be expressed as
annotations on existing syntactic elements in C++.
- The annotations are located in the most logical place.
- The annotations are more likely to be noticed and updated by C++ API owners.
- API owners retain full control over how the API looks in Rust.
- C++ users may find lifetime and nullability annotations useful. For example, information about lifetimes is highly important to C++ and Rust users alike.
- C++ API definitions are only written once, minimizing duplication and maintenance burden.
Cons
- Annotations that benefit Rust users can bother C++ API owners who don't
care about Rust. Especially at the beginning of integrating Rust into an
existing codebase, C++ API owners can push back on adding annotations.
- To encourage adoption of annotations, we can develop tooling for C++ that uses lifetime and nullability annotations to find bugs in C++ code.
- The pushback is likely to be short-term: if Rust takes off in a C++ codebase, C++ library owners in that codebase will need to care about Rust users and how their API looks in Rust.
- There may be headers that we cannot (or would not want to) change, for
example, headers in third-party code, headers that are open-sourced, or when
first-party owners are not cooperating.
- We can apply the sidecar strategy to these headers.
Additional information is stored in sidecar files
Additional information needed for C++/Rust interop can be stored in sidecar files, similarly to Swift APINotes, CLIF etc. If sidecar files get sufficiently broad adoption (for example, if annotating third-party code turns out to be sufficiently important that optimizing C++/Rust interop ergonomics there would be worth it), it would make sense to write sidecar files in a Rust-like language, as that provides the most natural way to define Rust APIs.
Pros
- Sidecar files enable more broad adoption of annotations by providing additional interop information without modifying C++ headers. Sidecar files will allow us to annotate headers in third-party code, headers that can't adopt annotations for technical reasons, or headers owned by first-party owners who are not cooperating.
Cons
- Like in the Use Rust code to customize API projection into Rust alternative, some part of C++ API information is duplicated, which is a burden for the C++ API owners.
- The projection of C++ APIs to Rust is defined in a new language.
- C++ API owners and Rust users will have to learn this language.
- If we expect wide adoption of sidecar files, we will need to create tooling to parse, edit, and run LSCs against this language.
- Annotations in sidecar files are more prone to become out of sync with the
C++ code. When making changes to C++ code, engineers are less likely to
notice and update the annotations in sidecar files.
- Presubmits can catch some cases of desynchronization between C++ headers and sidecar filles. However, presubmit errors that remind engineers to edit more files create an inferior user experience.
- Sidecar files create extra friction to modify the code. Where previously
one had to edit only a C++ header and a C++ source file, now one also likely
needs to update a sidecar file.
- When engineers realize that they need to update a sidecar file, opening another file and finding the right place to update creates extra friction to modify code.
- Once engineers understand the extra maintenance burden associated with sidecar files that tend to go out of sync with headers, they will be less likely to adopt annotations in the first place.
Glue code generation
C++/Rust interop tooling will generate executable glue code and type definitions
in Rust and in C++ (not just merely extern "C" function declarations) in order
to achieve the following goals:
- Enable instantiating C++ templates from Rust, and monomorphizing Rust
generics from C++. Enable Rust types to participate in C++ inheritance
hierarchies.
- For example, imagine Rust code using an object of type
std::vector<MyProtobuf>, while C++ code in the same program is never instantiating this type. The Bazelrust_librarytarget that mentions this type must therefore be responsible for instantiating this template and linking the resulting executable code into the final program. We propose that this instantiation happens in an automatically generated "glue" C++ translation unit that is a part of thatrust_library.
- For example, imagine Rust code using an object of type
- Enable automatically wrapping C++ code to be more ergonomic in Rust. For
example:
extern "C"functions in Rust are necessarily unsafe (it is a language rule). We would like the vast majority of C++ API projections into Rust to be safe. In the current Rust language, we can achieve that only by wrapping the unsafeextern "C"function in a safe function marked with#[inline(always)].- C++ API owners can provide rules for automatic type bridging, for
example, mapping C++
absl::StatusOrto RustResult. This conversion necessitates generation of a Rust wrapper function around a C++ entry point that takes advantage of such type bridging.
- Provide stable locations (C++ modules, Rust crates) that "own" the types
from the language point of view.
- For example, when we project a C++ type into Rust, its Rust definition must be located in a Rust crate. Furthermore, all Rust users of this type must observe it as being defined in the same crate in order for every users to consider that they use the same type. Indeed, this is a rule in Rust, that types defined in different crates are unrelated types.
- When we project a Rust type into C++ we could repeat its C++ definition in C++ code any number of times (for example, in every C++ user of a Rust type). This is technically fine because C++ allows the same type to be defined multiple types within a program. Nevertheless, such duplication is error-prone.
Glue code is generated as C++ and Rust source code
Interop tooling will generate glue code as C++ and Rust source files, which are then compiled with an unmodified compiler for that language. The alternative is to generate LLVM IR or object files with machine code directly from interop tooling.
Pros
- It is easy to inject customizations provided by API owners into generated
source code.
- The customizations will be written in the target language, making it (hopefully) intuitive to write them.
- Generated source code can be easily inspected by compiler engineers while debugging interop problems and compiler bugs.
- Generated source code can be inspected and understood by interop users,
who are not compiler experts.
- LLVM IR wouldn't be meaningful to them.
- Generated source code is processed by the regular toolchain like any other
code in the project.
- It automatically benefits from all performance optimizations and sanitizers that are newly implemented in Clang and Rust compilers.
- We avoid adding a new tool that generates unique LLVM IR patterns.
- We avoid making the job of the C++ toolchain maintainers harder.
Cons
- Interop tooling will be limited to generating LLVM IR and machine code that Clang and Rust compilers can generate.
Glue code and API projections will assume implementation details of the target execution environment
To provide the most ergonomic and performant interop, C++/Rust interop tooling will allow the target codebase to opt into assuming various implementation details of the target execution environment. For example:
- When calling C++ from Rust, interop tooling can either wrap C++ functions in thunks with a C calling convention, or call C++ entry points directly. Thunks cause code bloat and can collectively add up to become a performance problem, so it is desirable to call C++ entry points from Rust directly. Interop tooling can do that only if it may assume a specific target platform and C++ ABI.
Implementation details of the target execution environment that are considered stable enough will be reflected in API projections, for example:
- The C++ standard does not specify sizes of integer types (
short,int,longetc.) To map them to Rust, interop tooling will need to assume a size that they have on the platform that targets in practice. The alternative would be to create target-agnostic integer types (for example,Intin Swift is a strong typedef forInt32on 32-bit targets, andInt64on 64-bit targets), but this makes it harder to provide idiomatic, transparent, high-performance interop. - The C++ standard does not specify whether standard library types like
std::vectorare in any sense Rust-movable; it is an implementation detail. Universal interop tooling would have to conservatively assume non-Rust-movable types. Interop tooling specific to certain environments can rely on libc++ providing a Rust-movablestd::vectorand project it into Rust in a much more ergonomic way.
Pros
- Interop tooling will generate the most performant code sequences to call
foreign language functions.
- If interop tooling generates portable code, it would have some overhead. The overhead can be eliminated by C++ and Rust optimizers at least in some cases, but at the cost of increased build times. For example, eliminating thunks would require turning on LTO, which is not fast, and usually only used for release builds. It is much preferable to not generate thunks in the first place, if the target platform does not need them.
- Ergonomics of API projections will be improved.
- For example, whether a C++ type is Rust-movable or not is an implementation detail in C++, transparent to C++ users of that type, but it makes a huge ergonomic difference in the Rust API projection.
Cons
- C++ code will have additional evolution constraints.
- For example, changing a type from Rust-movable to non-Rust-movable is a non-API-breaking change for C++ users, but it would break Rust users.
- It would be more difficult to switch internal environments to a different C++ standard library.
- Code that is deployed in environments that have incompatible
implementation details won't be able to use this C++/Rust interop system.
- Alternatively, these executables would have to bring a suitable execution environment with them (e.g., a copy of libc++).
Interop tooling should be maintainable and evolvable for a long time
We should design and implement C++/Rust interop tooling in such a way that we can maintain and evolve it for more than a decade. If Rust becomes tightly integrated into an existing C++ project, specific requirements for interop and API projection rules will keep changing. The more Rust adoption we will have, the more library and team-specific interop customizations we will have to support, and the more it will make sense for the performance team to tweak generated code to implement sweeping optimizations. These kinds of changes should be readily possible, and they should not create conflicts of interest between diferent users of the interop tooling.
Interop tooling should facilitate C++ to Rust migration
C++/Rust interop tooling should try to create a favorable environment for migrating C++ code to Rust. Specifically, projections of C++ APIs into Rust should be implementable in Rust. This way, a C++ library can be converted from C++ into Rust transparently for its users, as its public API won't change.
Alternatives Considered: Design decisions
Repeat C++ API completely in a separate IDL
Instead of reading C++ headers in the interop tooling, we would require the user to repeat the C++ API in some other form, for example, in a Rust-based IDL like in the cxx crate, or in sidecar files in a completely new format.
Pros
- Interop tooling can be simpler if it does not have to read C++ headers. But even under this alternative approach, tooling might want to read C++ headers, nullifying this advantage. For example, tooling might want to automatically generate an initial Rust snippet or to suggest in presubmits to adjust the Rust code that mirrors a C++ API when that C++ API changes.
- The most natural way to define Rust APIs is by using Rust code or Rust-like syntax in sidecar files.
- Available Rust APIs are defined in easily accessible checked-in files.
- API definitions written by a human might have higher quality, on average.
Cons
- A big part of the C++ API needs to be duplicated to reliably match the
Rust code with the C++ declarations. The initial code can be generated by
tooling, but it has to be kept in sync. This is a burden for the C++ API
owners, potentially a bigger one than allowing annotations in C++ headers.
- There is a risk that C++ API owners might refuse to own IDL files.
- The need to create a sidecar file creates a barrier to start using C++
libraries from Rust.
- While the duplication overhead is justifiable for widely-used libraries, it is relatively high for libraries with few users and binaries, making it less likely that leaf teams will start adopting Rust.
- When the C++ API is changed, the Rust definitions become out-of-sync with it. Tooling needs to detect this, and the Rust definitions need to be changed (either manually or tool-assisted).
- There is no effective way to verify Rust binding code at the presubmit time of a C++ library other than building downstream projects.
- Mapping Rust API definitions to the original C++ API definitions is more complicated and error-prone. For example, how would we target a specific overload of a function or constructor?
- There is a risk that individual teams will build team-specific tooling that generates IDL files from C++ headers or generates both IDL files and C++ headers from a single source. These solutions are unlikely to scale to existing large codebases and will likely only work for that specific team.
Use Rust code to customize API projection into Rust
An alternative to storing additional information in C++ headers is to put it into Rust code. For example, the cxx crate requires users to re-state the C++ API in Rust syntax, adding information about lifetimes and nullability. The pros and cons of this choice are the same as when defining a special IDL that repeats the C++ API completely (see above).
Generate glue code in binary formats
Instead of generating glue code as textual sources, interop tooling could use Clang and LLVM APIs to emit object files with C++ glue code and use Rust compiler APIs to generate rmeta and rlib files with Rust glue code.
Pros
- More flexibility in the code that can be generated. Controlling LLVM IR
generation allows interop tooling to generate code that an unmodified
compiler can't generate from textual source code. For example, the Rust
language does not have any constructs that map to
linkonce_odrfunctions in LLVM IR; if the interop tooling embedded the Rust compiler as a library and had more control over how it generates the IR, we could make that happen.
Cons
- Injecting customizations provided by API owners is harder.
- LLVM, Clang, and Rust compiler APIs are not stable. The format of Rust metadata files is not stable either. The larger the API subset we consume from Clang and Rust, the more difficult it becomes to maintain the tooling.
- To generate object files the interop tooling has to ensure that its
Clang/LLVM version and configuration is identical with the Clang compiler
used to build other C++ code.
- We can solve this problem, but it makes the system more fragile, compared to using existing C++ and Rust compilers to compile generated sources.
- From time to time LLVM introduces bugs that cause miscompilations. If interop tooling embeds LLVM, we would be adding another tool that toolchain engineers will need to look into when debugging a miscompilation. We would be making the job of C++ toolchain maintainers harder.
Alternatives Considered: Existing tools
bindgen
bindgen automatically generates
Rust bindings from C and C++ headers, which it consumes using libclang. The
generated bindings are pure Rust code that interfaces with C and C++ using
Rust’s built-in FFI for C
(#[repr(C)] to indicate that a struct should use C memory layout and extern "C" to indicate that a function should use a C calling convention). C++
functions are handled by generating a Rust extern "C" function that has the
same ABI as the C++ function and attaching a link_name attribute with the
mangled name.
See here for an in-depth description of the use of bindgen in Stylo, a Rust component in Firefox.
Pros
- The oldest and the most mature of the existing C++ interop tools (developed since Feb 2012).
Cons
- Deficiencies in safety and ergonomics, for example:
- References are imported as pointers. No lifetimes, no null-safety.
- Constructors and destructors are not called automatically.
- Overloads are distinguished by a numbered suffix in Rust. These numbers clutter the source code and are hard to remember, as they have no meaning. Adding overloads can change the numbering and hence break Rust callers.
- It is impossible to use C++ inline functions and templates from Rust because of bindgen’s architecture1. The architecture is unlikely to change, and therefore, this is a dealbreaker.
Evaluation
bindgen could be used in a project that has very limited C++ interop needs. However, creating safe and ergonomic wrappers for the generated bindings would require additional effort. Our vision and goals for C++ interop are very different from what bindgen provides.
cbindgen
cbindgen automatically generates C or C++ headers for Rust libraries which expose a public C API.
Pros
- An old and mature tool (developed since March 2017).
Cons
-
Shallow understanding of Rust's modules and types.
cbindgen's docs point out that "A major limitation of cbindgen is that it does not understand Rust's module system or namespacing. This means that if cbindgen sees that it needs the definition for MyType and there exists two things in your project with the type name MyType, it won't know what to do. Currently, cbindgen's behaviour is unspecified if this happens."- This limitation seems mostly caused by building
cbindgenon top of thesyncrate.synis able to parse Rust source code into an AST, but there is no facility at thesynlevel for type deduction or module traversal. Building such functionality would require replicating parts of therustccompiler intocbindgen, or alternatively rewritingcbindgenon top of therustc_drivercrate).
-
Support of only
extern "C"functions.- Supporting Rust functions that use the default calling convention would
require generating not only C/C++ headers, but also generating Rust
source with
extern "C"thunks that trampoline into the original function (requiring thatcbindgenstarts generating Rust sources).
- Supporting Rust functions that use the default calling convention would
require generating not only C/C++ headers, but also generating Rust
source with
-
Support of only
#[repr(C)]structs.- Default memory layout of Rust structs is
unspecified
and therefore cannot be determined by code examination at the
synlevel. - Even if the memory layout could be determined, the layout can change in
a future compiler version, or change depending on compilation command
line flags. To prevent using stale layout information, the
auto-generated FFI code should therefore include compile-time assertions
that the layout didn't change from the FFI generation time. The
assertions should be present both in the generated C/C++ headers and
on the Rust side (requiring that
cbindgenstarts generating Rust sources). The assertions would effectively verify that the FFI generation is driven by the build system (i.e. by Bazel, or Cargo, or GN/ninja, rather than manually) and that the integration between the FFI tools and the build system doesn't have any bugs (e.g. that it faithfully replicates all relevent compilation flags).
- Default memory layout of Rust structs is
unspecified
and therefore cannot be determined by code examination at the
Evaluation
cbindgen could be used in a project that can create a narrow extern "C" /
#[repr(C)] API and that is ready to manage the risk of incorrect name/module
resolution. Wrapping additional Rust APIs would require extra effort.
Take-aways for Crubit design
Notes and observations about cbindgen can guide some design aspects of
Crubit's cc_bindings_from_rs tool
(that similarly to cbindgen generates C++ bindings for Rust crates).
Using internal compiler knowledge (e.g. memory layout of structs, name and type
resolution) requires that cc_bindings_from_rs depends on
rustc_driver and other internal crates of rustc. The API of these crates is
unstable which might increase the risk and maintenance cost of Crubit.
Nevertheless, our experience with maintaining tools based on (also unstable)
Clang APIs suggests that this extra risk and cost is likely going to be
acceptable.
Build determinism requires that the Rust compiler produces the same output for
the same set of inputs (the same compiler version, the same command-line flags,
the same sources, etc.). This means that (despite
conservative reservations about layout determinism)
it should be okay to assume that cc_bindings_from_rs and rustc invocations
will observe the same memory layout of structs, but this requires that
cc_bindings_from_rs is built against exactly the same version of
rustc_driver libraries as rustc. (This should also be reinforced by
compile-time assertions in the generated FFI layer.)
cxx
cxx generates Rust bindings for C++ APIs and vice versa from an interface definition language (IDL) included inline in Rust source code. cxx generates Rust and C++ source code from IDL definitions. To check that the IDL definitions match the actual C++ API, cxx inserts static assertions2 into the generated C++ code; it does not, however, read the C++ headers itself. cxx contains built-in bindings for various Rust and C++ standard library types that are not customizable.
As far as we understand, cxx has the following design constraints and goals:
- Ship a stable product for its intended audience.
- As a consequence, improvements such as integrating move semantics are not going to be accepted soon. We understand that cxx is not a vehicle for experimentation. cxx maintainers would prefer us to first show that our ideas work in a fork of cxx or in a different system, such as autocxx, and that our improvements pull their weight given the added complexity.
- Remain simple and transparent. There is a limit on the amount of
complexity that will be tolerated.
- There is a chance that improvements such as modeling C++ move semantics or various attempts at eliminating thunks will not be ever accepted in upstream cxx.
- Non-goal: Automatically provide high fidelity interop.
- cxx is designed for the use case of an executable where C++ and Rust parts communicate through a narrow interface.
- Non-goal: Automatically provide the most performant interop in as many
cases as possible. For example:
- cxx does not attempt to eliminate C++-side thunks. Instead, using LTO is recommended.
- cxx considers it acceptable to allocate all objects of "opaque" types on the heap. Users who find these heap allocations unacceptable for performance reasons are expected to implement a different C++ entry point that does not hit this limitation and bind it to Rust instead of the original C++ API. Heap allocation is acceptable for many C++ classes in most environments, but the exceptions are important enough for us that this is a major restriction.
Pros
- Mature and ergonomic enough today for mixing C++ and Rust in existing codebases with limited C++ interop needs.
- We avoid being on a tech island.
Cons
- cxx’s stability goal makes it hard to experiment with how the Rust API looks.
- Our goals are unlikely to align well with the goals of the intended user audience of cxx. We would be pulling cxx in directions that make it a worse product for its current users.
- Almost no customizability. Users who are not satisfied with what cxx does are expected to wrap the target C++ API in a different C++ API that is more friendly to cxx.
- cxx tries to be compatible with most standard C++ implementations found in the real world, so it cannot take advantage of unique guarantees provided by the target execution environment.
Evaluation
cxx could be used in projects with limited C++/Rust interop requirements. However, we would not be able to implement many interop features that we consider essential (for example, move semantics, templates).
autocxx
autocxx automatically generates Rust bindings from C++ headers. As the name implies, it automatically generates IDL definitions for cxx, which then produces the actual bindings. In addition, autocxx generates its own Rust and C++ code to extend the Rust API beyond what cxx itself would provide, for example to support passing POD types by value. autocxx consumes C++ headers indirectly by first running bindgen on them and then parsing the Rust code output by bindgen.
autocxx’s design goals are similar to our own in this document.
We did a case study on using an existing project's C++ API from Rust using autocxx.
Pros
- Low barrier to entry: Bindings are generated from C++ headers, no need to write duplicate API definitions.
- Ergonomic mappings for many C++ constructs.
- Open to contributions that change the generated Rust APIs or make architectural changes.
Cons
- Relatively new and immature.
- Cannot (yet) consume complex headers without errors. We’ve managed to import some actual Spanner headers, but there are still enough outstanding issues that we can’t yet do anything useful with Spanner.
- Architecture can make modifications difficult. autocxx is built on top
of two other tools, bindgen and cxx, and the interfaces between these
components can make it harder to make a modification than it would be in a
monolithic tool. Specifically:
- autocxx uses bindgen to generate a description of the C++ API that it can parse easily (as opposed to trying to parse C++ headers either directly or using Clang APIs). Since bindgen was not intended for this purpose, its output lacks some information that autocxx needs, so autocxx has forked bindgen to adapt it to its needs. The forked version emits additional information about the C++ API in the form of attributes attached to various API elements.
- bindgen in turn is built on the libclang API, which doesn’t surface all of the functionality available through Clang’s C++ API. Adding features to libclang requires additional effort and has a 6 month lead time to appear in a stable release (to become eligible to be used from bindgen).
- When errors occur, it can be hard to figure out which of the components is responsible.
- Adding features can require touching multiple components, which requires commits to multiple repositories.
Evaluation
We initially intended to use autocxx to prototype various interop ideas and potentially as a basis for a field trial. We still believe this would be feasible, but after trying to modify autocxx and its bindgen fork during an internal C++/Rust interop study, we feel that autocxx’s complex architecture is enough of an impediment that we could achieve our goals with less total effort by creating an interop tool from scratch that consists of a single codebase and uses the Clang C++ API to directly interface with Clang.
Doing so would require either generating C++ source code or interfacing deeply enough with Clang to generate object code for inline functions and template instantiation.
And tricks such as suitable type conversions that force the C++ compiler to perform appropriate checks at compile time.
Lifetime Annotations for C++
Summary: We propose a scheme for annotating lifetimes for references and pointers in C++.
Note: This is a living document that is intended to always reflect the most current semantics and syntax of the lifetime annotations.
Introduction
This document proposes an attribute-based annotation scheme for C++ that describes object lifetime contracts. Lifetime annotations serve the following goals:
- They allow relatively cheap, scalable, local static analysis to find many common cases of heap-use-after-free and stack-use-after-return bugs.
- They allow other static analysis algorithms to be less conservative in their modeling of the C++ object graph and potential mutations done to it.
- They serve as documentation of an API’s lifetime contract, which is often not described in the prose documentation of the API.
- They enable better C++/Rust and C++/Swift interoperability.
The annotation scheme is inspired by Rust lifetimes, but it is adapted to C++ so that it can be incrementally rolled out to existing C++ codebases. Furthermore, the annotations can be automatically added to an existing codebase by a tool that infers the annotations based on the current behavior of each function’s implementation.
While the annotation scheme can express a large subset of Rust’s lifetime
semantics, we have omitted some constructs that we do not expect to be necessary
for our purposes. For example,
lifetime bounds
(e.g. 'a: 'b or T: 'a) may be needed rarely enough that we can do without
them, and
higher-ranked trait bounds
(e.g. where for<'a> F: Fn(&'a i32)) are possible only for function types,
which is what they are usually needed for.
We are aware of two existing schemes for annotating lifetimes and flagging
lifetime violations in C++; we describe them in the sections
“Alternative considered: [[clang::lifetimebound]]”
and
“Alternative considered: P1179 / -Wdangling-gsl”
below. Both of these schemes have limitations that make them unsuitable for our
purposes. We plan to enable our lifetime analysis to understand the existing
annotations by translating them into our annotation syntax internally (where
possible).
Proposal
Examples
To give a feel for how the annotations work in practice, we will first show some examples.
Here is a simple example:
const std::string& [[lifetime(a)]] smaller(
const std::string& [[lifetime(a)]] s1,
const std::string& [[lifetime(a)]] s2) {
if (s1 < s2) {
return s1;
} else {
return s2;
}
}
The annotation states that both s1 and s2 may be referred to by the return
value of the function. This implies that the lifetime of the return value is the
shorter of the lifetimes of s1 and s2. In Rust, this example would be
expressed as follows:
#![allow(unused)] fn main() { pub fn smaller<'a>(s1: &'a String, s2: &'a String) -> &'a String; }
Note how the syntax is broadly similar. The main difference is that, unlike in Rust, our proposal does not require lifetimes to be declared.
A lifetime annotation placed after a member function refers to the lifetime of the object the member function is called on:
struct string {
// The returned pointer should not outlive ``*this``.
const char *[[lifetime(a)]] data() const [[lifetime(a)]];
};
Similar to Rust, [[lifetime(static)]] is used to denote a static lifetime. A
common pattern is for a class to have a static function returning a reference to
some default value:
class Options final {
public:
// ...
static const Options &[[lifetime(static)]] DefaultOptions();
// ...
};
The attribute can be applied to references that appear inside a more complex type expression. For example:
const std::vector<const A *[[lifetime(static)]]> &[[lifetime(static)]]
get_static_as();
This expresses that both the reference to the vector and the pointers to the
As contained inside it have static lifetimes.
This roughly corresponds to the following in Rust (with the difference that, unlike C++ pointers, Rust references cannot be null):
#![allow(unused)] fn main() { fn get_static_as() -> &'static CxxVector<&'static A>; }
Lifetimes
Lifetimes are associated with certain types that we call reference-like types. A reference-like type is one of the following:
- A pointer (except pointers to functions and pointers to members)
- A reference (except references to functions)
- A user-defined type that has been annotated as having lifetime parameters. (We will explain user-defined reference-like types in detail in a later section.)
The reason that pointers to functions and references to functions do not have
lifetimes is to be consistent with Rust, where fn types do not have lifetimes
either. In C++, the function that a pointer or reference refers to almost always
exists for the duration of the program execution. There are some exceptions,
such as functions created by a JIT compiler or functions in plugins loaded and
unloaded at runtime. Such functions may be destroyed before the program exits,
but we consider them to be unusual enough that we don't support annotating their
lifetimes.
Pointers to members don't have lifetimes because they aren't pointers in the narrower sense. A pointer to member doesn't refer to a specific object in memory; rather, it can be used to refer to a specific member of any object of a given type. In implementation terms, a pointer to member is not an address but an offset.
Lifetimes are annotated using the new attribute lifetime1. The attribute
takes one or several lifetime names as arguments.
Appendix A contains a formal
description of the attribute syntax.
For brevity, lifetimes may be implicitly inferred in some situations; this is referred to as lifetime elision, and we describe the specific rules for this later.
There are two lifetime names with special meaning:
static: A lifetime that lasts for the duration of the program.unsafe: A lifetime that cannot otherwise be represented correctly using lifetime annotations. We will discuss the semantics of an unsafe lifetime in more detail below.
In addition, there are two types of lifetimes that cannot be named in a lifetime attribute but that are implicitly associated with reference-like types in certain situations:
- Local lifetime: The lifetime of a pointer to a variable with automatic storage duration.
- Unknown lifetime: A lifetime that has not been annotated and cannot be
implicitly inferred.
The concept of unknown lifetimes is important because it allows us to migrate a codebase to lifetime annotations incrementally. Tools that verify lifetime correctness should assume that operations involving unknown lifetimes are lifetime-correct; this avoids generating large numbers of nuisance errors for code that has not been annotated yet. Note that this makes unknown lifetimes fundamentally different from unsafe lifetimes.
We call static, unsafe, local, and unknown lifetimes constant lifetimes. We call all other lifetimes variable lifetimes; this reflects the fact that they may be substituted by other lifetimes.
The lifetime attribute can be applied to reference-like types in function
signatures, variable declarations (including member variable declarations),
alias declarations, and to user-defined reference-like types when referring to
static members of such types. The sections below give details on how the
attribute can be applied to these constructs and what the semantics are in each
case.
Note that, unlike in Rust, lifetimes are not part of the type. For the purposes of C++ semantics (e.g. function overloading), two types that differ only in their lifetime annotations are considered the same type. This is by design: We don’t want to change the semantics of existing code by adding lifetimes, and this is one of the reasons we have chosen to use C++ attributes; the C++ standard allows compilers to ignore attributes they don’t know, which implies that they have no effect on the C++ semantics.
Lifetime-correctness
The implementation of a function must be lifetime-correct. This section explains what that means.
Most expressions propagate lifetimes in ways that are straightforward. We will therefore explain lifetime-correctness rules only for those cases that are non-trivial.
Dereferencing a pointer or accessing the value referred to by a reference is lifetime-correct in exactly the following cases:
- If its lifetime is static or a variable lifetime
- If its lifetime is local and the access happens during the lifetime of the corresponding local variable.
Dereferencing a pointer with unknown lifetime or accessing the value referred to by a reference with unknown lifetime is not lifetime-correct, but tools should not emit lifetime verification errors in these cases.
operator new returns a pointer with unsafe lifetime. operator delete takes a
pointer parameter that has unsafe lifetime.
Initializing or assigning an object of reference-like type with another object is always correct if the lifetimes of the two objects are the same.
In addition, there are a number of cases where it is permissible to initialize or assign an object of reference-like type with another object that has different lifetimes. We call such an operation a lifetime conversion.
To define lifetime correctness of conversions, we first need to define what it means for one lifetime to outlive another:
- Any lifetime outlives itself.
- The
staticlifetime outlives any variable or local lifetime. - Any variable lifetime outlives any local lifetime.
- A local lifetime
local1outlives another local lifetimelocal2if the object associated withlocal1outlives the object associated withlocal2according to C++’s lifetime rules. - The unsafe lifetime does not outlive any lifetime except itself, and no other lifetime outlives the unsafe lifetime.
- The unknown lifetime does not outlive any lifetime except itself, and no other lifetime outlives the unknown lifetime. However, tools should not emit lifetime verification errors for lifetime conversions involving unknown lifetimes.
Note that no variable lifetime a outlives any other variable lifetime b; our
annotation scheme does not permit specifying lifetime bounds between lifetimes
in the way that
Rust does.
Here are the rules for the correctness of lifetime conversions:
- Lifetime-converting a non-const pointer of type
T_from *[[lifetime(l_from)]]to typeT_to *[[lifetime(l_to)]]is lifetime-correct if and only ifl_fromoutlivesl_to, and- any lifetimes in
T_fromandT_toare identical.
- Lifetime-converting a const pointer of type
T_from * const [[lifetime(l_from)]]to typeT_to * const [[lifetime(l_to)]]is lifetime-correct if and only ifl_fromoutlivesl_to, and- converting
T_fromtoT_tois lifetime-correct.
- The rules for converting references are analogous to those for converting pointers.
- An object of a class
Twith lifetime parameters may not be converted to an object of the same classTbut with different lifetime parameters; see also the sections on variance and special member functions.
We will describe the lifetime-correctness rules for certain other constructs in the specific sections that deal with those constructs below.
lifetime_cast
To permit building safe abstractions on top of APIs that use unsafe lifetimes,
we provide a way to cast unsafe lifetimes to safe lifetimes and vice versa using
a function template called lifetime_cast2. A lifetime_cast is similar to
C++ cast operations such as const_cast and static_cast but may only be used
to change lifetimes.
Obviously, code that uses lifetime_cast must guarantee that the operation is
actually lifetime-correct, i.e. that there is no risk of a use-after-free. Like
unsafe code in Rust, uses of lifetime_cast should therefore be carefully
reviewed and constrained to small parts of the codebase.
lifetime_cast is a function template defined suitably such that the call
lifetime_cast<T>(e) evaluates to e and does not perform any copy or move
operations. Tools will assume that the lifetimes of the result are those
specified in the template argument for T. Apart from lifetime attributes, T
must be the same as decltype(e).
A typical use case for lifetime_cast would be building a container such as
std::vector on top of raw memory allocation primitives such as operator new.
For example, one of the constructors for a vector might look like this:
template <class T>
void vector<T>::vector(size_t size) [[lifetime(a)]]
: size_(size), capacity_(size), data_(lifetime_cast<T *[[lifetime(a)]]>(new
T[size])) {}
Concise syntax using macros
Even with lifetime elision, there is a potential concern that the annotations
will introduce excessive clutter. A lifetime in Rust typically requires only two
characters, e.g. 'a. In contrast, the attribute proposed above,
[[lifetime(a)]], requires at least 15 characters, or more if the attribute is
scoped inside a namespace.
To reduce verbosity, we suggest providing a macro with a short name that expands
to the actual lifetime attribute. The single-character macro name “$” is not
in widespread use in many codebases3; a codebase maintainer would obviously
want to consider carefully what to use it for, but we think lifetimes could be a
worthwhile use. In addition to a general $(lifetime)
macro, we could also define lifetime macros $a through
$z to allow an even more concise annotation. As an example, this is
what the smaller() example from the beginning would look like with
this concise syntax:
const std::string &$a smaller(
const std::string &$a s1,
const std::string &$a s2);
For a more extensive example, see
appendix B, which shows
what std::string_view would look like with these annotations.
Every codebase can of course define its own macro shortcuts that work within the
context of the codebase. A more traditional and still concise macro name would
be LT, with additional macros LT_A through LT_Z for concise single-letter
lifetimes.
For brevity, in the examples that follow, we will use the $ convention.
Pointers and References
As already noted, pointers and references can be annotated with a lifetime, which specifies the lifetime of the object the pointer or reference refers to (the pointee). The lifetime of the pointee must outlive the lifetime of the pointer or reference itself.
For example, let’s look at the example of a double pointer int * $a * $b. The
annotation $b on the outer pointer specifies the lifetime of the inner pointer
of type int *; the annotation $a on the inner pointer specifies the lifetime
of the int. When these lifetime variables are substituted with constant
lifetimes, the lifetime substituted for $a must outlive the lifetime
substituted for $b. This ensures that the int lives for at least as long as
the int * pointer that refers to it.
Functions
Lifetime attributes may be placed in the parameter types and return type of a
function or function type. In addition, for non-static member functions, a
lifetime attribute may be placed after the function declaration to describe the
lifetime of the object the member function is called on, i.e. the lifetime of
the implicit this parameter.
If a translation unit contains multiple declarations of the same function (including its definition), the lifetime attributes in all declarations must be the same.
As in Rust, a function is considered to be parameterized by the lifetimes that
appear in its signature. To express this, a lifetime_param attribute
containing the variable lifetime parameters may be placed in front of the
function definition, like this:
[[lifetime_param(a)]]
int *$a ReturnPtr(int *$a p) {
return p;
}
However, for brevity, this lifetime_param attribute may and should be left out
in most cases. The exception to this is when the signature of the function
contains a function type that itself contains lifetimes; in this case, a
lifetime_param attribute must be added to disambiguate whether the lifetime
should be considered a parameter of the function type or the function. For
example:
// Lifetime $a is a parameter of the function type int*(int*).
void AddCallback(std::function<int *$a(int *$a) [[lifetime_param(a)]]> f);
// Lifetime $a is a parameter of the function AddCallback().
[[lifetime_param(a)]]
void AddCallback(std::function<int *$a(int *$a)> f, int *$a p);
Lifetime parameters on function types are analogous to higher-ranked trait bounds in Rust; unlike Rust, however, we only allow this concept in the context of function types, which is where it is typically required.
TODO: Show an example where we're passing a pointer to a local variable into the callback and discuss how this is allowed in the HRTB case but not the other case.
Lifetime-converting a function pointer from to a function pointer to of the
same type but with different lifetimes is lifetime-correct if from has either
the same lifetimes as to or lifetimes that are more permissive. This means
that we must be able to substitute the lifetime parameters of from with
lifetime parameters of to such that:
-
Every parameter of
tois lifetime-convertible to the corresponding parameter offrom. (Note the direction of the conversion, which is reversed from what one might initially expect. The idea is thatfromneeds to be able to stand in forto, so we need to be able to convert the parameters oftoto the parameters offrom.4) -
The return type of
fromis lifetime-convertible toto.
Similarly, a virtual member function Derived::f that overrides a base class
function Base::f must have either the same lifetimes or lifetimes that are
more permissive. This means that there we must be able to substitute the
lifetime parameters of Derived::f with lifetime parameters of Base::f such
that:
- Every parameter of
Base::fis lifetime-convertible to the corresponding parameter ofDerived::f. - The return type of
Derived::fis lifetime-convertible toBase::f.
A function call is lifetime-correct if the lifetime parameters of the callee can be substituted by lifetimes from the caller in such a way that converting all arguments to the respective parameter lifetimes is lifetime-correct. If no such substitution can be found, the function call is not lifetime-correct.
Here is an example that illustrates how this works:
void copy_ptr(int *$x from, int *$x *$y to) {
*to = from;
}
int *$a return_ptr(int *$a p) {
int* copy;
copy_ptr(p, ©);
return copy;
}
First of all, the copy pointer is inferred to have lifetime $a because it is
used in the return statement. Let’s use the name $local1 for the lifetime of
the copy variable itself.
Now let’s look at the call to copy_ptr. If we make the substitutions $x =
$a and $y = $local1, we see that the lifetimes of the arguments are
identical to those of the parameters, so it is trivially correct to
lifetime-convert them.
Assume now that return_ptr had been declared with different lifetimes for its
parameter and return type:
int *$a return_ptr(int *$b p) {
int* copy;
copy_ptr(p, ©); // Error, not lifetime-correct.
return copy;
}
Again, the copy pointer has lifetime $a. If we choose the substitution
$x = $a, we can lifetime-convert the second argument but not the first
argument (we need an int *$a but we have an int *$b). If we choose $x =
$b, we can lifetime-convert the first argument but not the second argument (we
need an int *$b * but we have an int *$a *).
Because there is no substitution we can make for $x that allows a
lifetime-correct conversion of the arguments of copy_ptr to the respective
parameter lifetimes, the call is not lifetime-correct.
Lifetime elision
As in Rust, to avoid unnecessary annotation clutter, we allow lifetime annotations to be elided (omitted) from a function signature when they conform to certain regular patterns. Lifetime elision is merely a shorthand for these regular lifetime patterns. Elided lifetimes are treated exactly as if they had been spelled out explicitly; in particular, they are subject to lifetime verification, so they are just as safe as explicitly annotated lifetimes.
We adopt the same lifetime elision rules as Rust. We will expand on the rationale for this below, but first let us present the rules.
We call lifetimes on parameters input lifetimes and lifetimes on return values output lifetimes. There are three rules:
- Each input lifetime that is elided (i.e. not stated explicitly) becomes a distinct lifetime.
- If there is exactly one input lifetime (whether stated explicitly or elided), that lifetime is assigned to all elided output lifetimes.
- If there are multiple input lifetimes but one of them applies to the
implicit
thisparameter, that lifetime is assigned to all elided output lifetimes.
If a function signature contains a function type (in a parameter or the return value), lifetime elision is performed separately for any lifetimes that occur in this function type, independent of the lifetimes in the surrounding function signature. Any elided lifetimes within the function type become lifetime parameters of the function type. See also the discussion of lifetime parameters on function types in this section.
Lifetime elision rules have two requirements:
- They need to be easy for a programmer to remember and apply.
- They should be applicable to as many functions as possible, i.e. they should maximize the percentage of functions whose lifetime semantics correspond to the elided lifetimes. Put differently, they should minimize the percentage of functions which need explicit, non-elided lifetimes.
There is some alignment between these requirements, but some tension too. Working out what the best set of rules is likely requires quite a bit of testing. Instead of doing this, we have for the time being adopted the same set of rules that Rust uses, which presumably have a lot of collective experience embedded in them. The underlying assumption is that Rust and C++ functions do similar things with lifetimes in their interfaces; this assumption seems passable, though surely not perfect. An added benefit of using the Rust rules is that programmers using both languages don't need to keep two sets of rules in their head.
Once we have static analysis tooling that can run on real-world codebases, we may do some tweaking of the lifetime elision rules, but there would need to be clear benefits to justify giving up commonality with Rust.
Introducing lifetimes to a codebase will have to happen incrementally, and this requires some additional considerations. During the transition, there will be some files that have not yet been annotated, and we may indeed decide to exclude some parts of the code base from annotation permanently. Lifetime elision should not be applied to files that have not been annotated or verified for lifetime correctness; instead, the lifetimes should be assumed to be unknown, as described above.
We propose using a pragma or suitable comment string to mark source files where lifetime elision is allowed, e.g.:
#pragma clang lifetime_elision
Static member variables and non-member variables
Static member variable declarations and non-member variable declarations need not contain lifetime attributes but may do so for clarity.
In general, it may not even be possible to annotate a local variable correctly with the current lifetime annotation syntax. This happens when a local variable may refer to objects of different, unrelated lifetimes. Such a situation is entirely permissible; lifetime inference and verification tools need to deal with this by using a richer internal representation for the lifetimes of local variables.
If a variable has static storage duration, all lifetimes in its type are
implicitly assumed to be static. Any manual annotations that are present may
only specify the lifetimes static or unsafe.
Taking the address of a static member variable or non-member variable yields a
pointer with a lifetime that depends on the variable’s storage duration. If the
variable has static storage duration, the pointer has static lifetime. If the
variable has automatic storage duration, the pointer has a local lifetime.
Classes and non-static member variables
A class may be annotated with one or several lifetime parameters by placing the
new attribute lifetime_param in the class declaration, and a class annotated
in this way is considered to be a reference-like type. All declarations of a
class must be annotated with the same lifetime parameters. (See
appendix A for a formal
description of the attribute syntax.)
When lifetime parameters are substituted with constant lifetime arguments, all of these lifetime arguments must outlive the lifetime of the object they are applied to. This is analogous to the corresponding rule for pointers and references.
Lifetime parameters are necessary when an object of the class contains
references to data that has a different lifetime than the object itself; the
standard C++ types
std::string_view
and std::span
are examples of this.
The lifetime parameters may be used in the declarations of non-static member functions and non-static member variables of the class.
As an example, here is how parts of std::string_view might be annotated5:
class [[lifetime_param(a)]] string_view {
string_view(const char *$a data, size_type len)
: ptr_(data), len_(len) {}
const char *$a data() const { return ptr_; }
string_view $a substr(size_t pos, size_t count) const;
private:
const char *$a ptr_;
size_t len_;
};
All reference-like types in the declaration of a non-static member variable must
be annotated with the lifetimes static, unsafe, or one of the lifetime
parameters of the class.
If a class contains owning pointers to manually allocated memory, these pointers
will typically be annotated with an unsafe lifetime. Collection types such as
std::vector are examples of this. Member functions that provide access to the
owned memory will typically perform a lifetime_cast to the lifetime of the
owning object. For example, std::vector::at() has the lifetime signature T& $a std::vector<T>::at(size_type) $a.
A class is not required to use any of its lifetime parameters; it may declare lifetime parameters solely for the purpose of associating a lifetime with objects of the class.
Derived classes
Derived classes inherit the lifetime parameters of their base classes. It is not permissible to add lifetime parameters to a derived class; in other words, all lifetime parameters need to be declared on the base class. If a derived class has multiple base classes, only one of these base classes may declare lifetime parameters.
TODO: Having a derived class “silently” inherit the lifetime parameters of its base classes isn’t great because it doesn't make the lifetime parameters of the derived class visible at the place where it is defined. We should instead consider requiring the lifetime parameters to be re-declared.
The motivation for this rule is to cover the case where a call to a virtual member function in the base class may access member variables of reference-like type in a derived class. A similar situation exists when casting a pointer from the base class to the derived class. In both cases, we want all lifetimes that are relevant to the derived class to be known on the base class.
Special member functions
Special member functions can be annotated with lifetimes just like other member functions, but they deserve special attention because they can be implicitly declared and because they are central to the semantics of C++ value types.
The default constructor and destructor are trivial as they only take a single
reference-like parameter, the implicit this parameter, so we will not discuss
them further.
The lifetimes in the copy and move operations for a type A without lifetime
parameters are as follows (using $s and $o as mnemonics for “self” and
“other”):
A(const A& $o) $s;
A(A&& $o) $s;
A& $s operator=(const A& $o) $s;
A& $s operator=(A&& $o) $s;
Conveniently, these are the lifetimes that are implied by lifetime elision, so they would be omitted in practice.
The implication of these lifetimes is that it is possible to move or assign an
object of type A to another object with a different lifetime.
The situation is slightly more complicated for a type with lifetime parameters. As an example, consider the following class:
struct [[lifetime_param(p)]] B {
int* $p p;
};
(The special member functions are implicitly defaulted.)
The lifetimes of the special member functions on B are as follows:
B(const B $p & $o) $s;
B(B $p && $o) $s;
B& $s operator=(const B $p & $o) $s;
B& $s operator=(B $p && $o) $s;
Note that while the lifetimes of the “self” and “other” objects themselves are
different, their lifetime parameters are the same. This implies that the copy
and move operations cannot extend the lifetime of B::p.
The lifetimes above are not the same as those implied by lifetime elision. Classes with lifetime parameters that use the defaulted copy and move operations need to add explicitly defaulted definitions for these operations.
Alias declarations
Alias declarations can declare lifetime parameters in a similar way to classes.
These lifetime parameters can then be used on the right-hand side of the alias
declaration. In addition, any alias declaration, whether it has lifetime
parameters or not, can use the lifetimes static and unsafe on its right-hand
side.
If an alias declaration is contained inside a class, its right-hand side may not use any lifetime parameters of that class. This is because, in general, an instance of the alias type has no connection to an instance of the class.
Here is an example for an alias declaration with lifetime parameters, again
using std::string_view:
class [[lifetime_param(a)]] string_view {
public:
// ...
using const_iterator [[lifetime_param(i)]] = const char *$i;
const_iterator $a begin() const;
const_iterator $a end() const;
// ...
};
Note that the lifetime_param attribute comes after the type alias name,
whereas in a class declaration it comes before the class name. This may seem
inconsistent, but the placement is dictated by the C++ grammar.
So far, we have pretended that string_view is a class, but it is in fact
itself an alias declaration for basic_string_view<char>, and this alias
declaration therefore has a lifetime parameter:
template <class T> class [[lifetime_param(a)]] basic_string_view {
// ...
};
using string_view [[lifetime_param(a)]] = basic_string_view<char> $a;
The interpretation of this is that string_view is a type with a lifetime
parameter a, and that this lifetime parameter should be forwarded to the
lifetime parameter of basic_string_view<char>.
Templates
A function template or class template may be annotated with lifetime
attributes and, in the case of class templates, a lifetime_param attribute,
just like a non-template function or class.
Explicit template instantiations may not contain lifetime attributes.
Lifetime-correctness of a template may, in general, depend on the template arguments. A template is lifetime-correct if there exists at least one set of arguments for which no specialization exists, that do not result in substitution failure, and for which the specialized template is lifetime-correct.
In general, therefore, lifetimes can only be inferred and verified on a template instantiation. This implies that inference and verification may need to be done multiple times if the same template instantiation is used in multiple translation units. This is slightly unfortunate, but there does not seem to be a good way around it, and it mirrors the fact that such a template instantiation is also compiled multiple times.
To the extent that it is possible to infer and verify lifetimes on the template itself, independent of the template arguments, tooling should do this. In other words, a lifetime-correctness error should be flagged if there is no set of template arguments for which the specialized template is lifetime-correct. Lifetimes should be inferred if they are correct for any template arguments for which no specialization exists and which do not result in a substitution failure.
Partial template specializations should be treated the same way as primary template definitions: Tooling should infer and verify lifetimes on the partial specialization to the extent that this can be done independent of the template arguments.
Full template specializations should be treated the same way as non-template functions and classes: Lifetimes should be inferred and verified on the full template specialization.
When analyzing code that uses a template for which partial or full specializations exist, tooling must of course make sure to refer to the correct specialization.
Function templates
A function template’s type arguments as well as other dependent types may, in general, be reference-like types. Therefore, when a function template instantiation is used (either by calling it or by taking its address), tooling should do the following:
- Verify the lifetime-correctness of the function template instantiation.
- Infer lifetimes for all reference-like types in the signature of the function template instantiation, except for reference-like types that occur in the function template itself and are already annotated with lifetimes there.
The lifetimes inferred for the specialized function template should be used when inferring and verifying lifetimes of functions that use the specialized function template.
Class templates
The type arguments to a class template may be reference-like types. A class
template that is specialized with reference-like types in this way is itself
considered to be a reference-like type. The specialized class template has a
lifetime parameter for each reference-like type that occurs in the template
arguments; these lifetime parameters are in addition to any lifetime parameters
that are annotated on the class template itself using the lifetime_param
attribute. The lifetime parameters associated with a template argument are
implicitly propagated to all uses of that argument in the class template.
TODO: Add a discussion of template template arguments
Lifetimes are assigned to a specialized class template’s lifetime arguments as for any other reference-like type, i.e. depending on the context in which the specialized class template is used they may be explicitly annotated, implied by lifetime elision, or inferred. However, there is a syntactical difference: When lifetimes are explicitly annotated, they are placed in the template arguments instead of after the type, as they would be for other lifetime parameters. For example, here a function that takes a vector of pointers and returns an element of the vector:
int* $a get_ith(const std::vector<int* $a>& $b v, size_t i) {
return v[i];
}
TODO: Discuss dependent types.
A member function of a specialized class template need not be lifetime-correct for all possible assignments of the lifetime parameters associated with the template arguments.
Instead, we only require that every ODR-use of a member function of a specialized class template is lifetime-correct for the lifetimes assigned to the lifetime parameters for that particular use.
A (slightly contrived) example will help to illustrate why these rules are written the way they are.
template <class From, class To>
struct Convert {
To convert(From from) { return from; }
};
void constify(int* [[lifetime(a)]] p,
const int *[[lifetime(a)]] *[[lifetime(b)]] pp) {
Convert<int*, const int*> c;
*pp = c.convert(p);
}
The specialized class template Convert<int*, const int*> has two lifetime
parameters: one lifetime parameter (which we will call x) for the int*
template argument, and one lifetime parameter (which we will call y) for the
const int* template argument.
The Convert::convert() member function is not lifetime-correct if we consider
x and y to be arbitrary variable lifetimes, as it is not lifetime-correct to
lifetime-convert an int *with lifetime x to a const int * with lifetime
y.
However, for the use of Convert in the declaration of c, we infer that both
x and y should be substituted by the lifetime a. Convert::convert() is
lifetime-correct when x and y are substituted in this way.
TODO: Do we need to make this distinction between lifetime parameters and the lifetimes they are substituted with, or can we make the substitution directly?
Variance
As in Rust, we need to establish some variance rules for type and lifetime parameters, but the specific rules differ slightly from Rust.
- Const references and pointers
const T &andconst T *are covariant with respect toT. - Non-const references and pointers
T &andT *are invariant with respect toT. - Class templates are invariant with respect to their type parameters (including lifetimes contained in them).
- All lifetime-parameterized types (classes and alias declarations) are invariant with respect to their lifetime parameters.
The last two rules differ from Rust, which infers the variance of type and lifetime parameters on user-defined types. Unlike Rust generics, C++ class templates are invariant with respect to their type parameters6, and we want to be consistent with this.
Regarding lifetime parameters on types, we restrict ourselves to invariance for simplicity. Rust infers the variance of lifetime parameters from the way they are used in the definition of the type, but in C++, this is impossible to do, at least on a single-translation-unit basis, as a lifetime-parameterized class may only be forward-declared in the current translation unit. For simplicity, and consistency with template parameters, we have therefore decided that lifetime parameters will always be invariant, as we expect this to be sufficient in practice. If this turns out to be too limiting, we may need to provide a way of annotating the variance of lifetime parameters.
Alternative considered: [[clang::lifetimebound]]
Clang already provides a limited ability to annotate lifetimes with the
[[clang::lifetimebound]] attribute7.
Quoting from the documentation:
The
lifetimeboundattribute on a function parameter or implicit object parameter indicates that objects that are referred to by that parameter may also be referred to by the return value of the annotated function (or, for a parameter of a constructor, by the value of the constructed object).
If the lifetime annotation is applied to aggregates (arrays and simple structs), those aggregates are considered to refer to any pointers or references transitively contained within them.
Here, again, is the smaller() example, but annotated with
[[clang::lifetimebound]]:
const std::string& smaller(
const std::string& s1 [[clang::lifetimebound]],
const std::string& s2 [[clang::lifetimebound]]);
The attribute may also be applied to a member function to indicate that the
lifetime of the return value corresponds to the lifetime of the object. Here is
an example from the [[clang::lifetimebound]] documentation:
struct string {
// The returned pointer should not outlive ``*this``.
const char *data() const [[clang::lifetimebound]];
};
This is an example of the very common case where a member function returns a pointer or reference to part of the object, or to another object owned by it.
The [[clang::lifetimebound]] attribute provides a way to express lifetimes in
many common scenarios, but it does have its limitations:
-
There is no way to differentiate between different lifetimes.
-
There is no way to annotate a static lifetime.
-
The attribute attaches to function parameters and always implicitly refers to the outermost reference-like type8; it is not possible to attach it to part of a type (e.g. to the
T *in aconst std::vector<T *> &). -
The single lifetime is implicitly applied to the outermost reference-like type in the function’s return type (or the value of the constructed object, in the case of a constructor). Again, it is not possible to associate the lifetime with inner reference types in the return value (e.g. the
T *inconst std::vector<T *> &). -
The lifetime of a constructor parameter can be associated with the lifetime of the object being constructed, i.e. with the lifetime of the
thispointer, but this isn’t possible in other member functions. In other words, a member function cannot associate the lifetime of a parameter with the lifetime of the object the member function is called on. -
There is no way to add a lifetime parameter to a struct.
Alternative considered: P1179 / -Wdangling-gsl
The WG21 proposal P1179 describes a static analysis that aims to prevent many common types of use-after-free. It uses an attribute-based annotation scheme to describe the lifetime contracts of functions and to annotate user-defined types containing indirections.
Preliminary implementations of this scheme exist in MSVC and a
fork of Clang. In addition, Clang
trunk implements statement-local warnings inspired by the scheme, which are
enabled by the on-by-default flag -Wdangling-gsl.
The scheme has both advantages and disadvantages compared to the scheme proposed here:
- Advantages
- Can express independent pre- and postconditions for lifetimes, e.g. to
annotate
std::swap(ptr1, ptr2), where the lifetimes of the pointers after the call are swapped compared to before the call. - Can diagnose some cases of iterator invalidation.
- Can express independent pre- and postconditions for lifetimes, e.g. to
annotate
- Disadvantages
- User-defined types can only be annotated as having one of a class of fairly specific lifetime semantics (“SharedOwner”, “Owner”, “Pointer”); arbitrary annotation of classes with lifetime parameters is not possible.
- Cannot refer to lifetimes of pointers in template arguments, e.g. no way
to express
int *$a return_first(const vector<int *$a> &$b v); - Annotations can be verbose and syntactically removed from the objects they refer to.
We believe the limitations of this scheme will restrict its usefulness in the use cases we are interested in. A more in-depth comparison of P1179 with our proposed scheme can be found here.
Appendix A: Lifetime attribute specification
This appendix describes where lifetime attributes may appear and what arguments they can take.
Temporary syntax
We are currently still experimenting with the exact syntax and semantics for the
lifetime annotations. While we are doing so, we will use the general-purpose
annotate and
annotate_type
attributes as stand-ins for the new attributes proposed below.
Attribute definitions
We introduce two new attributes, lifetime and lifetime_param. In practice,
these would be scoped to a namespace (probably clang), but for ease of
exposition, we assume they are in the global namespace.
Attribute lifetime_param
This attribute may be applied to the following:
- A class definition (more formally, it may appear in the attribute-specifier-seq of a class-head)
- An alias-declaration (specifically, the attribute-specifier-seq following the identifier)
The attribute takes one or more arguments. Each of these arguments must be an identifier9; each argument defines a lifetime parameter for the corresponding class.
If the class definition or alias declaration is nested within a class that
itself has a lifetime_param attribute, none of the lifetime parameter names of
the outer class may be used as lifetime parameter names on the nested class
definition or alias declaration.
Attribute lifetime
This attribute may be applied to the following:
- Types and pointer operators in a function declaration, member function
declaration, non-static member variable declaration, or alias declaration
More formally, within the return type, the trailing-return-type or the parameter-declaration-clause of a function declaration or member function declaration, within a non-static member variable declaration, or within the defining-type-id of an alias declaration the attribute may appear:- In the attribute-specifier-seq of a decl-specifier-seq
- In the attribute-specifier-seq of a type-specifier-seq
- In the attribute-specifier-seq of a ptr-operator (both within a declarator and an abstract-declarator)
- A non-static member function declaration
More formally, within a member-declarator for a non-static member function, the attribute may appear in the attribute-specifier-seq of the parameters-and-qualifiers.
The attribute takes one or more arguments, each of which must be an identifier
or the keyword static. We call these identifiers lifetime names.
In addition, the following constraints apply:
-
When the
lifetimeattribute is applied to a type, the type must be a class type or alias declaration whose definition contains alifetime_paramattribute.The
lifetimeattribute must have the same number of arguments as thelifetime_paramattribute on the corresponding class or alias declaration. (These arguments define lifetime parameters for the object instance.) -
When the lifetime attribute is applied to a pointer operator, it must take exactly one argument. (This defines a lifetime for the object referenced by the pointer operator.).
-
When the
lifetimeattribute is applied to a non-static member function declaration, it must take exactly one argument. (This defines a lifetime for the implicit object parameter). -
Every lifetime name that appears in a function’s return value must either be
staticor also appear either in- the function’s parameter list, or
- the
lifetimeattribute for the implicit object parameter (in the case of a non-static member function), or - the
lifetime_paramattribute of the class (in the case of a non-static member function).
-
For every constructor of a class that has a
lifetime_paramattribute, every lifetime name that appears in thelifetime_paramattribute must appear in the constructor’s parameter list. -
Every lifetime name that appears in a non-static member variable declaration must either be
staticor one of the lifetime parameters declared in alifetime_paramattribute on the class containing the member variable declaration. -
Every lifetime name that appears in the defining-type-id of an alias declaration must either be
staticor one of the lifetime parameters declared in alifetime_paramattribute on the alias declaration. Note that if the alias declaration is nested within a class that also has lifetime parameters, those lifetime parameters may not appear in the defining-type-id of the alias declaration.
Appendix B: std::string_view with lifetime annotations
To illustrate how lifetime annotations work on a larger code sample, here is an
annotated version of interesting parts of std::string_view. To keep the code
clear, we have omitted basic_string_view and simply stamped out string_view
for the template arguments used in its definition.
// Lifetime "s" is mnemonic for "lifetime parameter of string_view"
class LIFETIME_PARAM(s) string_view {
public:
using const_pointer LIFETIME_PARAM(iter_lifetime) = const char *$(iter_lifetime);
using const_reference LIFETIME_PARAM(iter_lifetime) = const char &$(iter_lifetime);
using const_iterator LIFETIME_PARAM(iter_lifetime) = const char *$(iter_lifetime);
using iterator LIFETIME_PARAM(iter_lifetime) = const_iterator $(iter_lifetime);
using const_reverse_iterator LIFETIME_PARAM(iter_lifetime) =
std::reverse_iterator<const_iterator $(iter_lifetime)>;
using reverse_iterator LIFETIME_PARAM(iter_lifetime) =
const_reverse_iterator $(iter_lifetime);
using size_type = size_t;
static constexpr size_type npos = static_cast<size_type>(-1);
constexpr string_view() noexcept;
constexpr string_view(const string_view $s & other) noexcept = default;
constexpr string_view(const char* $s data, size_type len);
constexpr const_iterator $s begin() const noexcept;
constexpr const_iterator $s end() const noexcept;
constexpr const_reverse_iterator $s rbegin() const noexcept;
constexpr const_reverse_iterator $s rend() const noexcept;
constexpr const_reference $s front() const;
constexpr const_reference $s back() const;
constexpr const_pointer $s data() const noexcept;
constexpr const_reference $s operator[](size_type i) const;
constexpr const_reference $s at(size_type i) const;
// The annotation cannot express that the lifetime parameter of `this` and
// `other` are swapped after the call, so we have to be overly restrictive and
// require `this` and `other` to have the same lifetime parameter.
constexpr void swap(string_view $s & other) noexcept;
// Output buffer may have a different lifetime than this string view's data.
size_type copy(char* buf, size_type n, size_type pos = 0) const;
// Returned substring has the same lifetime parameter as this `string_view`.
constexpr string_view $s substr(size_type pos = 0, size_type n = npos) const;
// `string_view` to compare against does not need to share the same lifetime.
constexpr int compare(string_view x) const noexcept;
private:
const char* $s ptr_;
size_type length_;
};
Notes
The attribute will be scoped to some suitable namespace, but for ease of exposition we assume here that it is placed in the global namespace.
lifetime_cast will be placed in a suitable namespace, but for ease of
exposition, we assume here that it is in the global namespace.
“$” is not part of the standard set of characters allowed in C++
identifiers (including macro names), but the C++ standard permits
implementations to allow additional implementation-defined characters, and
gcc, Clang, and MSVC allow $ as an implementation-defined character.
More formally, this is because function types are contravariant in their parameter types.
For simplicity, we are showing std::string_view as if it was a
non-template type.
Unless converting constructors and conversion constructors are used to simulate variance.
This attribute is inspired by the C++ Standards Committee paper P0936R0.
Quoting Richard Smith: "The Clang attribute behaves as if each type has exactly one associated lifetime, and the attribute says in which cases the outermost lifetime of a parameter matches the outermost lifetime of the return value.”
Note that this automatically disallows the special lifetime name static,
which is allowed in lifetime attributes. We make no other constraints on
identifiers, but codebases that want to use the lifetime annotations for
C++ / Rust interop may want to enforce a rule that prohibits invalid Rust
identifiers (e.g. Rust keywords) in the lifetime_param and lifetime
attributes..
Static Analysis for C++ Lifetimes
Summary: We describe a static analysis that infers lifetimes in C++ function signatures.
NOTE: This document describes the approach we are currently pursuing but it is a) incomplete, and b) out of date. It has become clear that we are still making changes to the static analysis frequently enough that it does not seem worth updating a document in parallel with those changes. Once the static analysis appears reasonably stable, we plan to update this document to describe it.
Introduction
Lifetime analysis has two goals:
- Infer lifetime annotations to put in C++ function signatures, using the attributes described in this doc.
- Verify lifetime-correctness of function bodies.
To infer and verify lifetimes, we perform a pointer analysis1. For each pointer or other reference-like type, a pointer analysis determines a points-to set consisting of the storage locations it may point to.
There are different approaches to pointer analysis that can be classified according to various properties. The pointer analysis we perform here has the following properties:
- Intraprocedural, context-insensitive. We analyze each function individually and do not take into account how it is called from different callsites.
- Array-insensitive. We treat all elements in an array containing a reference-like type as having the same lifetime.
- Field-insensitive. We treat member variables of reference-like type as having the same lifetime as the object they are contained in (unless they carry a lifetime annotation).
- Flow-sensitive. When analyzing a function, we take statement ordering and control flow into account. We believe flow sensitivity is important to avoid inferring overly restrictive lifetimes and emitting false positive errors.
The pointer analysis we perform is relatively coarse-grained in that we do not distinguish between different storage locations with the same lifetime; equivalently, we can say that we identify a storage location merely by its lifetime.
A points-to set is therefore just a set of lifetimes; a reference-like object is also simply identified by its lifetime. The state that is tracked during the analysis is therefore just a mapping from a lifetime (identifying the reference-like object) to a set of lifetimes (identifying the storage locations it may point to).
This coarse-grained approach simplifies the analysis and is sufficient for our purposes because we are only attempting to infer and verify statements about lifetimes.
Analysis of a translation unit
We analyze all functions in a translation unit for which we have a definition.
We attempt to analyze all of these functions in topological order so that callees are analyzed before callers. Where recursion makes this impossible, we analyze the functions that take part in the recursive cycle in arbitrary order. We accept that this may make it impossible to infer lifetimes for functions in a recursive cycle.
Analysis of a function
As explained in the introduction, we identify an object (often called a storage location in pointer analysis) merely by its lifetime.
A points-to set is therefore simply a set of lifetimes. It represents the set of objects that a reference-like type or glvalue can be referencing at some point of execution of the program. We will sometimes refer to the objects in a points-to set as pointees.
We associate each local variable in the function with a different local lifetime. This serves two purposes: a) It reflects the fact that local variables do, in general, have different lifetimes, and this is important for lifetime verification. b) It allows us to associate a different points-to set with different local variables of reference-like type, and this is required to make the analysis precise enough.
We perform a data-flow analysis using the Clang dataflow framework (documentation) to propagate points-to sets through the function. After the analysis is complete, we produce lifetime annotations from the points-to sets; if these lifetime annotations are different from existing annotations (ignoring pure renamings), we output the new annotations as suggested edits.
The data-flow analysis tracks the following state:
- For each reference-like object (identified by its lifetime), the points-to set of that reference-like object
- For each expression of reference-like type, the points-to set of the expression
- For each glvalue expression, the points-to set representing the glvalue’s referent
- If the function’s return type is a reference or pointer type, a points-to set for the return value
The join operation on points-to sets means taking the union of the two sets.
The initial state for the data flow analysis is produced as follows:
- Associate each parameter of reference-like type with a points-to set containing a new unique regular lifetime representing the pointee.
- If the pointee is itself of reference-like type, recursively associate that pointee with a points-to set containing a new regular lifetime, and so on.
During the analysis, we propagate points-to sets through expressions and update the points-to sets of reference-like objects.
After the analysis is complete, we obtain lifetime annotations by examining the points-to sets of all parameters of reference-like type and the return value (if applicable), descending into pointees that are themselves of reference-like type.
For every points-to set, we look at the set of lifetimes of its pointees. If there are multiple lifetimes, they are substituted by a single lifetime. This lifetime then becomes the lifetime of the corresponding reference or pointer type in the signature.
Here are some examples:
void foo(int* from, int** to) {
// from_pointee (int): '1
// to_pointee (int *): '2
// to_pointee_pointee (int): '3
// from: { from_pointee }
// to: { to_pointee }
// to_pointee: { to_pointee_pointee }
*to = from;
// to_pointee: { from, to_pointee_pointee }
}
TODO: Explain. Also talk about why, after the assignment *to = from, we keep
to_pointee_pointee in the points-to set and how we can, in some cases,
eliminate it. (Distinguish between scalar and aggregate pointees -- the latter
are arrays, for example. We can only delete existing pointees if *to has a
single pointee and it's scalar.)
int* target(int* p1, int* p2) {
// p1_pointee (int): '1
// p1: { p1_pointee }
// p2_pointee (int): '2
// p2: { p2_pointee }
int** pp;
if (foo()) {
pp = &p1; // pp: { p1 }
} else {
pp = &p2; // pp: { p2 }
}
// pp: { p1, p2 }
int local = 42;
*pp = &local; // glvalue on left side is { p1, p2 }, so:
// p1: { p1_pointee, local }
// p2: { p2_pointee, local }
return p1; // rval: { p1_pointee, local }
}
TODO: Explain. Also mention how this is an example where we have two pointees on the left hand side, so we can't eliminate existing pointees from p1 and p2.
Function calls
Here is how we handle function calls:
- Create a mapping from callee lifetimes to points-to sets. For each variable lifetime that occurs in the callee's parameter list, find the union of all points-to sets in those argument positions to yield a mapping from lifetimes to points-to sets.
- Propagate points-to sets to output parameters. For each lifetime
'lin an invariant argument position, replace the argument's existing points-to set with the points-to set established for'lin Step 1. - Step 3: Determine points-to set for the return value. If the return
value is of reference-like type with lifetime
'l, find the points-to set established for'lin Step 1; this becomes the points-to set for the call expression's value.
If the 'static lifetime occurs in output parameters (i.e. in invariant
position) or in the return value, the callee may be returning references to
pointees that do not occur as inputs to the callee. Therefore, when we encounter
the 'static lifetime in these positions, we create new pointees for the
corresponding outputs.
Here is an example of how this works:
void copy_ptr(int *'x from, int *'x *'y to) {
*to = from;
}
int * get_lesser_of(int * arg1, int * arg2) {
// arg1_pointee (int): '1
// arg2_pointee (int): '2
// arg1: { arg1_pointee }
// arg2: { arg2_pointee }
int* result = arg2;
// result: { arg2_pointee }
if (*arg1 < *arg2) {
copy_ptr(arg1, &result);
// &result: { result }
// 'x pointees: { arg1_pointee, arg2_pointee }
// result: { arg1_pointee, arg2_pointee }
}
return result;
}
TODO: Continue exposition
Virtual member functions
Inferring lifetimes for virtual member functions is complicated by two factors:
- The lifetimes of the base class member function are constrained by the lifetimes of all of its overrides.
- The definitions of the overrides and the base class function (if it is not pure virtual) are typically contained in different translation units, and we plan to analyze each translation unit individually.
For more details, see this section in the lifetime annotation specification.
We will describe an approach that can infer and update lifetimes for virtual member functions progressively, as each translation unit is processed.
If a translation unit contains definitions for multiple overrides, or if it contains the definition of the the base class function and at least one override, we analyze these definitions in topological order from base class to more derived class.
If the definitions are contained in different translation units, we effectively process them in the same order because we analyze dependencies of a library before analyzing the library itself, and libraries containing derived classes generally depend on the library containing the base class.
TODO: The description above implicitly assumes we're talking about the initial change where we add lifetimes across the codebase. Discuss also how this applies when people are editing code.
When we encounter the definition of a virtual member function (whether it is the base class implementation or an override), we first perform lifetime inference on its implementation, as for any other function, and update the declaration of the member function in its containing class.
If the function is an override, call it Derived::f, we then update the
lifetimes of every base class function Base::f that it overrides. (There may
be several if there is a chain of overrides.) We do so as follows:
- If the declaration of
Base::fdoes not yet contain any lifetime annotations, annotate it with the lifetimes ofDerived::f. Because we process base class functions before derived class functions, this case can only occur ifBase::fis pure virtual. - If the existing lifetimes of
Base::fare more permissive than the lifetimes inferred forDerived::f, perform lifetime substitutions on the lifetimes ofBase::funtil they are at most as permissive as those ofDerived::f. - If the existing lifetimes of
Base::fat most as permissive as the lifetimes inferred forDerived::f, do nothing.
TODO: Can we ever get caught in a situation where neither the second nor the
third point above applies? I think we'll always be able to restrict the
lifetimes of Base::f until they're compatible with Derived::f, but this
needs a formal argument.
TODO: Discuss how the lifetime changes affect callers – may need to process them again.
TODO: Show an example
Templates
Templates pose a specific challenge to lifetime analysis:
- Reference-like types may occur in the template itself as well as in template arguments and dependent types.
- For reference-like types that occur in the template, we wish to infer and
check lifetimes on the template itself to the greatest extent possible. This
reflects the fact that, even though C++ templates are not really generics,
they are often used as if they were. However, the semantics of C++ templates
pose two difficulties here:
- Templates may be specialized, and we must be careful not to apply the lifetimes inferred on the primary template to the specialization.
- The inferred lifetimes and the lifetime correctness of a template may, in general, depend on the template arguments, even if the template arguments and dependent types do not contain any reference-like types. We show an example of this below.
The lifetime annotation specification defines what the semantics of lifetimes on templates should be but does not say how they should be implemented. That is the purpose of this section.
Example scenarios
Before we discuss generally how we will analyze templates, let us look at some scenarios that may occur.
As an example of why we want to be able to analyze templates themselves, let’s
take a look at part of a simplified implementation of std::vector:
template <class T>
class vector {
public:
vector(const vector& other);
T* $a begin() $a { return data_; }
T* $a end() $a { return data_ + size_; }
private:
T* data_;
size_t size_;
};
We should be able to infer the lifetimes of begin() and end() from the
template itself. These member functions operate only on pointers to T, and the
lifetime behavior of a pointer to T is independent of the type T itself2.
On the other hand, we cannot infer the lifetimes of the copy constructor. It
calls the copy constructor of T, and as
explained in the
lifetime annotation specification, copy and move operations can have two
different lifetime signatures.
Here is another example of how lifetimes can depend on a template argument:
template <int i>
int* return_ith(int* i0, int* i1) {
if (i == 0) {
return i0;
} else {
return i1;
}
}
This example is contrived, but it is certainly not implausible that a trait argument could affect behavior in a similar way.
While these examples do show the limitations of lifetime analysis on templates,
we likely won’t need to do anything subtle to detect them within the analysis.
In the case of the copy constructor of vector, we will notice when calling the
copy constructor of T that we’re doing member lookup on a dependent type and
that we can’t continue the analysis. In the case of return_ith(), we will be
able to analyze the function, but we will conclude that the lifetimes of all
pointers involved are the same. This is more restrictive than the result we
would obtain if we analyzed a template instantiation, but this limitation may be
acceptable.
General approach
The constraints described above imply that lifetime analysis of templates need to proceed in two phases:
-
Analysis of the template itself. We first attempt to infer lifetimes on the template itself, as well as any partial or full specializations, to the extent that the lifetimes do not depend on template arguments. If the inferred lifetimes are different from the function’s current (possibly elided) lifetimes, we generate a corresponding annotation. If we cannot infer lifetimes for the function, we annotate all lifetimes on the function as unsafe. This is required to distinguish this case from the situation where we were able to infer lifetimes and those lifetimes are elided.
TODO: Is there any alternative to marking the lifetimes unsafe? This isn't what we usually use unsafe lifetimes for, but I also don't really want to invent yet another syntax.
Performing lifetime analysis on the template itself, rather than only on instantiations, serves two purposes: a) It documents the lifetimes in the code, and b) it saves us from having to analyze every instantiation in cases where the lifetimes don’t depend on template arguments.
-
Analysis of template instantiations. In the following situations, we infer lifetimes on a function template instantiation or member function of a class template instantiation that is called in the translation unit we are analyzing:
- If the template itself contains reference-like types but does not provide lifetimes for these.
- If the template arguments contain reference-like types.
We use the inferred lifetimes when performing lifetime analysis on the callers of these functions, but we obviously cannot produce annotations for these inferred lifetimes.
As discussed in the lifetime annotation specification, any lifetimes in a template argument should be propagated to all uses of the argument. Clang does not provide a built-in mechanism for this, so this needs to be done in the lifetime analysis code.
Verifying lifetime correctness
TODO
Generating error messages
If we detect that there is a lifetime error – either because a function is returning a reference to a local or because there is a lifetime error inside the function – we want to produce an easily comprehensible error message that explains the error.
TODO: Explain how
Alternative considered
We previously considered an alternative approach that built a set of constraints between lifetimes involved in the function. Unfortunately, this approach produced wrong results on some fairly simple examples involving variable overwrites. A coworker identified a way to extend the approach in a way that overcame many of these limitations, but this extension introduced additional complexity. In the end, we decided that the approach based on points-to-sets was the simpler alternative.
Differences from Rust
The exclusivity rule
The borrow checker in Rust, in addition to checking lifetimes, also enforces the exclusivity rule: at any given time the program may have either one mutable reference or any number of immutable references to the same storage location.
The exclusivity rule protects against certain kinds of memory safety errors. For example, if it was applied to C++, it would catch the use after free here:
int test() {
std::vector<int> xs;
xs.push_back(10);
const int &x0 = xs[0]; // `x0` borrows `xs` here.
xs.push_back(20); // exclusivity error: `xs` is mutably borrowed here,
// overlapping with the `x0` borrow.
return x0; // `xs` is borrowed by `x0` at least until here
// because `x0` is used here.
}
Most C++ iterator invalidation bugs could be prevented by enforcing exclusivity: while there are outstanding iterators that borrow the container, the container can't be mutated.
From our experience porting woff2 from C++ to Rust, adjusting existing code to follow the exclusivity rule is one of the most difficult steps in porting. Therefore, it makes sense to separate rolling out lifetime checking from exclusivity checking. Lifetime checks without exclusivity checks don't guarantee memory safety, but they catch memory safety issues on their own, and should not require many adjustments to C++ code.
Exclusivity checking could be rolled out in an optional second step. This would not only provide additional memory safety to the C++ code but would facilitate a manual or automatic conversion of C++ code to Rust.
Spatial memory safety
Lifetime verification does not establish spatial memory safety, that is, it does not prove that all accesses are in bounds. Rust collections perform these checks at runtime.
Notes
Unless converting constructors and conversion constructors are used to simulate variance.
Note, however, that Clang is currently very conservative in assigning types to type-dependent expressions.
Struct Layout
C++ (in the Itanium ABI) extends the C layout rules, and so repr(C) isn't
enough. This pages documents the tweaks to Rust structs to give them the same
layout as C++ structs.
In particular:
- C++ classes and Rust structs must have the same alignment, so that
references can be exchanged without violating the alignment rules. This is
usually ensured by the regular
#[repr(C)]layout algorithm, but sometimes the interop tool needs to generate explicit#[repr(align(n))]annotations. - C++ classes and Rust structs must have the same size, so that arrays of objects can be exchanged.
- Public subobjects must have the same offsets in C++ and Rust versions of the structs.
Non-field data
Rust bindings introduce a __non_field_data: [MaybeUninit<u8>; N] field to
cover data within the object that is not part of individual fields. This
includes:
- Base classes.
- VTable pointers.
- Empty struct padding.
Empty Structs
One notable special case of this is the empty struct padding. An empty struct or
class (e.g. struct Empty{};) has size 1, while in Rust, it has size 0. To
make the layout match up, bindings for empty structs will always enforce that
the struct has size of at least 1, via __non_field_data.
(In C++, different array elements are guaranteed to have different addresses,
and also, arrays are guaranteed to be contiguous. Therefore, no object in C++
can have size 0. Rust, like C++, has only contiguous arrays, but unlike C++
Rust does not guarantee that distinct elements have distinct addresses.)
Potentially-overlapping objects
In C++, in some circumstances, the requirement that objects do not overlap is
relaxed: base classes and [[no_unique_address]] member variables can have
subsequent objects live inside of their tail padding. The most famous instance
of this is the
empty base class optimization (EBCO):
a base class with no data members is permitted to take up zero space inside of
derived classes.
NOTE: This has other, non-layout consequences for Rust: for example, it is not
safe to obtain two &mut references to overlapping objects, unless they are of
size 0. (To prevent this, classes that might be base classes are always
!Unpin.)
This is impossible to represent in a C-like struct. (Indeed, it's impossible to
represent even in a C++-like struct, before the introduction of
[[no_unique_address]]). Therefore, in Rust, we don't even try:
potentially-overlapping subobjects are replaced in the Rust layout by a
[MaybeUninit<u8>; N] field, where N is large enough to ensure that the next
subobject starts at the correct offset. The alignment of the struct is still
changed so that it matches the C++ alignment, but via #[repr(align(n))]
instead of by aligning the field.
Example
For example, consider these two C++ classes:
// This is a class, instead of a struct, to ensure that it is not POD for the
// purpose of layout. (The Itanium ABI disables the overlapping subobject
// optimization for POD types.)
class A {
int16_t x_;
int8_t y_;
};
struct B final : A {
int8_t z;
}
In memory, this may be laid out as so:
| x_ | x_ | y_ | z |
<------------> <->
A subobject | B
<------------------>
sizeof(A)
(also sizeof(B))
The correct representation for B, in Rust, is something like this:
#[repr(C)]
#[repr(align(2))] // match the alignment of the int16_t variable.
struct B {
// The We don't use a field of type `A`, because it would have a size of 4,
// and Rust wouldn't permit `z` to live inside of it.
// Nor do we align the array, for the same reason -- correct alignment must be
// achieved via the repr(align(2)) at the top.
__non_field_data : [MaybeUninit<u8>; 3];
pub z: i8,
}
Thunks for class template member functions
Problem definition
Given the C++ header below...
#pragma clang lifetime_elision
template <typename T>
class MyTemplate {
public:
MyTemplate(T value) : value_(value) {}
const T& GetValue() const;
private:
T value_;
};
using MyIntTemplate = MyTemplate<int>;
... Crubit will generate Rust bindings that can call into the
MyTemplate<int>::GetValue() member function. To support such calls, Crubit has
to generate a C++ thunk (to instantiate the class template and to provide a
symbol for a C-ABI-compatible function that Rust can call into):
extern "C" // <- C ABI
int const& __rust_thunk___ZNK10MyTemplateIiE8GetValueEv(
const class MyTemplate<int>* __this) {
return __this->GetValue();
}
There are other (non-template-related) scenarios that require generating
thunks (e.g. inline functions, or functions that use a custom calling
convention), but templates bring one extra requirement: a class template can
be defined in one header (say my_template.h) and used in multiple other
headers (e.g. library_foo/template_user1.h and
library_bar/template_user2.h). Because of this, the same thunk might need to
be present in multiple generated ..._rs_api_impl.cc files (e.g. in
library_foo_rs_api_impl.cc and library_bar_rs_api_impl.cc). This may lead to
duplicate symbol errors from the linker:
ld: error: duplicate symbol: __rust_thunk___ZNK10MyTemplateIiE8GetValueEv
Implemented solution: Encoding target name in the thunk name
One solution is to give each of the generated thunks a unique,
target/library-specific name, e.g.:
__rust_thunk___ZNK10MyTemplateIiE8GetValueEv__library_foo (note the
library_foo suffix).
Pros:
- Minimal extra code complexity (e.g. no need for templates-specific code
in thunk-related code in
src_code_gen.rs). - Obviously correct behavior-wise (e.g. since it is just like other thunks which we assume are implemented correctly).
Cons:
-
Performance guarantees are unclear. Binary size depends on link time optimization (LTO) recognizing that all the thunks are identical and deduplicating them.
- This seems to work in practice (at least for production binaries).
- Future work: add tests + consider asking LLVM to provide LTO guarantees
-
Requires escaping Bazel target names into valid C identifiers. See
ConvertToCcIdentifier(const BazelLabel&)inbazel_types.cc.
Alternative solutions
Function template
An alternative solution would be to use a function template that we immediately explicitly instantiate. These still generate the code we need, but their duplicated symbol definitions (across multiple binding crates) won't cause an ODR violation. It is expected that a single function template is instantiated multiple times in multiple translation units, therefore the linker silently merges these equivalent definitions.
Example:
// Thunk is expressed as a function template:
template <typename = void>
__attribute__((__always_inline__)) int const&
__rust_thunk___ZNK10MyTemplateIiE8GetValueEv(
const class MyTemplate<int>* __this) {
return __this->GetValue();
}
// Explicit instantiation of the function template:
// (to generate a symbol that `..._rs_api.rs` can call into)
template int const& __rust_thunk___ZNK10MyTemplateIiE8GetValueEv(
const class MyTemplate<int>* __this);
Pros:
- Naturally deduplicated (just depending on what C++ already does for function templates).
Cons:
- Assumes a particular ABI - a function template specialization uses the
calling convention prescribed by the platform C++ ABI. We know that
the Itanium ABI maps C++ sigatures to the C
ABI and
therefore will be compatible with the calling convention expected by the
generated
..._rs_api.rs. Further research is needed to investigate the guarantees offered by other platforms (e.g., the MSVC ABI). - Requires extra complexity to calculate the mangled name of the function
template specialization.
- Crubit doesn’t have a
clang::FunctionDeclcorresponding to the function-template-based thunk, and therefore Crubit can’t useclang::MangleContext::mangleNameto calculate the linkable/mangled name of the thunk. - Reimplementing
clang::MangleContext::mangleNamein Crubit seems fragile. One risk is bugs in Crubit's code that would make it behave differently from Clang (e.g. code review of the initial prototype identified that mangling compression was missing). Another risk is having to implement not justItaniumMangleContext, but alsoMicrosoftMangleContext. - One idea to avoid reimpliementing mangling is to explicitly specify
the name for the function template instantiation using
__asm__("abc")(sadly this doesn't seem to work - it may be a Clang bug).
- Crubit doesn’t have a
An abandoned prototype of this approach can be found in a (Google-internal) cl/450495903.
Explicit linkonce_odr attribute
Example:
extern "C"
int const& __rust_thunk___ZNK10MyTemplateIiE8GetValueEv(
const class MyTemplate<int>* __this)
__attribute__((linkonce_odr)) // <- THIS IS THE PROPOSAL
{
return __this->GetValue();
}
Pros:
- All the "pros" of the "Encoding target name in the thunk name" approach (simplicity + correctness of behavior)
- All the "pros" of the "Function template" template approach (deduplication)
Cons:
- Requires changing Clang to support the new attribute (e.g. requires convincing the Clang community that this is a language extension that is worth supporting). TODO(b/234889162): Send out a short RFC to gauge interest?
Rejected solutions
-
selectanydoesn't work with functions, only data members. Furthermore, we need something that maps tolinkonce_odr, and selectany maps only tolinkonce. -
__attribute__((weak))has the disadvantage that a weak definition can be overridden by a strong one. This rule makes weak definitions non-inlineable except in full-program LTO. C++ function template instead follows the ODR rule that says that all definitions must be equivalent, making them inlineable.
Unpin for C++ Types
SUMMARY: A C++ type is Unpin if it is Rust-movable (e.g., a trivial type, or a
nontrivial type which is [[clang::trivial_abi]]). Any such type can be used by
value or plain reference/pointer in interop, all non-Unpin types must instead
be used behind pinned pointers and references.
A C++ type T is Unpin if it is known to be a Rust-movable type
(move+destroy is logically equivalent to memcpy+release).
Unpin C++ types can be used like any other normal Rust type: they are always
safe to access by reference or by value. Non-Unpin types, in contrast, can
only be accessed behind pins such as Pin<&mut T>, or Pin<Box<T>>, because it
may not be safe to directly mutate. These types are never used directly by value
in Rust, because value-like assignment has incorrect semantics: it fails to run
C++ special members for non-Rust-movable types.
Note that not every object with an Unpin type is actually safe to hold in a
mutable reference. Objects with live aliases still must not be used with &mut,
and "potentially overlapping objects" can produce unexpected behavior in Rust.
(See Reference Safety.)
Rust-movable types
In C++, moving a value between locations in memory involves executing code to either initialize (move-construct) or overwrite (move-assign) the new location. The old location still exists, but is in a moved-from state, and must still be destroyed to release resources.
(For example, std::string x = std::move(y); will run the move constructor, so
that x contains the same value that y used to have before the move. The
variable y will still be a valid string, but might be empty, or might contain
some garbage value. The destructors for both x and y will run when they go
out of scope.)
Rust does not have move constructors or move assignment. In fact, there is no
way to customize what happens during moving or assignment: in Rust, moving or
swapping an object means changing its location in memory, as if by memcpy
without running the destructor logic in the old location. Another way of looking
at it is that it's as if an object moved around in memory over time: it is
constructed in one place, and then further operations and eventual destruction
might happen in other places. This is a Rust move.
Despite C++ moves using explicit construction and destruction calls, many C++ types could also have used the Rust movement model. We call such types Rust-movable types.
For example, a C++ std::unique_ptr, implemented in the obvious way, is
Rust-movable: its actual location in memory does not matter. In contrast, a
self-referential type is not Rust-movable, because to move it, you must also
update the pointer it has to itself. This is done inside the move constructor in
C++, but cannot be done in the Rust model, where the move operation is not
customizable.
Which types are Rust-movable?
For the purpose of Rust/C++ interop, we define a type to be Rust-movable if, and only if, it is "trivial for calls" in Clang. That is, either:
- It is actually trivial, or
- It uses
[[clang::trivial_abi]]to make itself trivial for calls
This definition is conservative: some types that could be considered
Rust-movable are not trivial for calls. (For example, std::unique_ptr uses
[[clang::trivial_abi]] only in the unstable libc++ ABI; the stable libc++ ABI
predates this attribute, and adding it now is ABI-breaking.)
This definition is, however, sound: all types which are trivial for calls are Rust-movable, because a type which is trivial for calls is Rust-moved when passed by value as a function argument.
Expanding Rust-movability
C++26 introduces a concept called "trivial relocation" and "trivially
relocatable types". These are types which have an alternate relocation operation
that does not throw exceptions or run the move constructor or destructor.
Ideally, a type would be Rust-movable if and only if it is trivially
relocatable, replaceable, and trivial relocation is tantamount to a memcpy.
(For example, perhaps T is Rust-movable if and only if any union containing
T is trivially relocatable.)
TODO: This is a work in progress.
Reference Safety
Not every object with an Unpin type can actually safely be pointed to by a
Rust reference.
Conventional aliasing
If a C++ reference mutably aliases, it is unsafe to pass to Rust as a Rust reference. Do not under any circumstance create aliasing Rust references, the behavior of doing so is undefined.
For example:
#![allow(unused)] fn main() { pub fn foo(_: &mut i32, _: &mut i32) {} }
It is Undefined Behavior to, in C++, call foo(x, x).
Tail padding
In C++, tail padding is not part of the object, and the space in the tail
padding can be taken up by other unrelated objects. Avoid creating a Rust
reference to a base class, or to a [[no_unique_address]] field, as these are
"potentially overlapping". This can cause surprising behavior, or unintended
aliasing and undefined behavior.
Consider the following struct:
struct A {};
struct B {
[[no_unique_address]] A field_1_;
char field_2_;
A& field_1() { return field_1_; }
char& field_2() { return field_2_; }
};
Here, while sizeof(A) is 1, it has no data, only tail padding. A C++
assignment to field_1_ will not write anything. And so C++ can store an
unrelated object inside of the tail padding. [[no_unique_address]] marks the
tail padding as available for use. field_2_ may actually be stored inside the
tail padding of field_1_, and the sizeof(B) may also be 1.
(Base classes also allow their tail padding to be reused, and the same example
works with struct B : A.)
static_assert(sizeof(A) == sizeof(B));
static_assert(offsetof(B, field_1) == offsetof(B, field_2));
Rust does not work this way. In Rust, tail padding is part of the object. Rust
references refer to the full span of the pointed-to object, including that tail
padding. And so a Rust reference to field_1_ would encompass field_2_ by
accident.
This means that the following code has undefined behavior via conventional aliasing, despite looking fairly innocent:
B b = ...;
// Rust: pub fn foo(_: &mut A, _: &mut u8)
foo(b.field_1, b.field_2); // C++
And the following Rust code would perform unintended mutations to field_2:
#![allow(unused)] fn main() { let mut b1: B = ...; let mut b2: B = ...; // This actually swaps field_2! std::mem::swap(&mut b1.field_1(), &mut b2.field_1()); }
C++20
In C++17 and earlier, there was only one way to create a potentially-overlapping
object: inheritance ("EBO").
Making inheritable types non-Unpin could have removed or mitigated the risk of
overlapping objects in C++17 and below.
However, as of C++20, any object can alias another in the tail padding.
C++20 introduced [[no_unique_address]], which makes tail padding available for
reuse for any type. Since [[no_unique_address]] may be used fairly extensively
in library code (it has no negative effects in C++), we can't assume that it
does not exist.
In modern C++, final types are not much safer than other types. One must be
careful when creating Rust references, to ensure that those Rust references
do not contain data in their tail padding, or otherwise alias, and there is no
way to guarantee this at the type level.