# User Guide Achieving optimal performance in GPU-centric workflows frequently requires customizing how GPU ("device") memory is allocated. RMM is a package that enables you to allocate device memory in a highly configurable way. For example, it enables you to allocate and use pools of GPU memory, or to use [managed memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) for allocations. You can also easily configure other libraries like Numba and CuPy to use RMM for allocating device memory. ## Installation See the project [README](https://github.com/rapidsai/rmm) for how to install RMM. ## Using RMM There are two ways to use RMM in Python code: 1. Using the `rmm.DeviceBuffer` API to explicitly create and manage device memory allocations 2. Transparently via external libraries such as CuPy and Numba RMM provides a `MemoryResource` abstraction to control _how_ device memory is allocated in both the above uses. ### `DeviceBuffer` Objects A `DeviceBuffer` represents an **untyped, uninitialized device memory allocation**. `DeviceBuffer`s can be created by providing the size of the allocation in bytes: ```python >>> import rmm >>> buf = rmm.DeviceBuffer(size=100) ``` The size of the allocation and the memory address associated with it can be accessed via the `.size` and `.ptr` attributes respectively: ```python >>> buf.size 100 >>> buf.ptr 140202544726016 ``` `DeviceBuffer`s can also be created by copying data from host memory: ```python >>> import rmm >>> import numpy as np >>> a = np.array([1, 2, 3], dtype='float64') >>> buf = rmm.DeviceBuffer.to_device(a.view("uint8")) # to_device expects an unsigned 8-bit dtype >>> buf.size 24 ``` Conversely, the data underlying a `DeviceBuffer` can be copied to the host: ```python >>> np.frombuffer(buf.tobytes()) array([1., 2., 3.]) ``` #### Prefetching a `DeviceBuffer` [CUDA Unified Memory]( https://developer.nvidia.com/blog/unified-memory-cuda-beginners/ ), also known as managed memory, can be allocated using an `rmm.mr.ManagedMemoryResource` explicitly, or by calling `rmm.reinitialize` with `managed_memory=True`. A `DeviceBuffer` backed by managed memory or other migratable memory (such as [HMM/ATS](https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/) memory) may be prefetched to a specified device, for example to reduce or eliminate page faults. ```python >>> import rmm >>> rmm.reinitialize(managed_memory=True) >>> buf = rmm.DeviceBuffer(size=100) >>> buf.prefetch() ``` The above example prefetches the `DeviceBuffer` memory to the current CUDA device on the stream that the `DeviceBuffer` last used (e.g. at construction). The destination device ID and stream are optional parameters. ```python >>> import rmm >>> rmm.reinitialize(managed_memory=True) >>> from rmm.pylibrmm.stream import Stream >>> stream = Stream() >>> buf = rmm.DeviceBuffer(size=100, stream=stream) >>> buf.prefetch(device=3, stream=stream) # prefetch to device on stream. ``` `DeviceBuffer.prefetch()` is a no-op if the `DeviceBuffer` is not backed by migratable memory. ### `MemoryResource` objects `MemoryResource` objects are used to configure how device memory allocations are made by RMM. By default if a `MemoryResource` is not set explicitly, RMM uses the `CudaMemoryResource`, which uses `cudaMalloc` for allocating device memory. `rmm.reinitialize()` provides an easy way to initialize RMM with specific memory resource options across multiple devices. See `help(rmm.reinitialize)` for full details. For lower-level control, the `rmm.mr.set_current_device_resource()` function can be used to set a different MemoryResource for the current CUDA device. For example, enabling the `ManagedMemoryResource` tells RMM to use `cudaMallocManaged` instead of `cudaMalloc` for allocating memory: ```python >>> import rmm >>> rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource()) ``` > :warning: The default resource must be set for any device **before** > allocating any device memory on that device. Setting or changing the > resource after device allocations have been made can lead to unexpected > behaviour or crashes. As another example, `PoolMemoryResource` allows you to allocate a large "pool" of device memory up-front. Subsequent allocations will draw from this pool of already allocated memory. The example below shows how to construct a PoolMemoryResource with an initial size of 1 GiB and a maximum size of 4 GiB. The pool uses `CudaMemoryResource` as its underlying ("upstream") memory resource: ```python >>> import rmm >>> pool = rmm.mr.PoolMemoryResource( ... rmm.mr.CudaMemoryResource(), ... initial_pool_size="1GiB", # equivalent to initial_pool_size=2**30 ... maximum_pool_size="4GiB" ... ) >>> rmm.mr.set_current_device_resource(pool) ``` Similarly, to use a pool of managed memory: ```python >>> import rmm >>> pool = rmm.mr.PoolMemoryResource( ... rmm.mr.ManagedMemoryResource(), ... initial_pool_size="1GiB", ... maximum_pool_size="4GiB" ... ) >>> rmm.mr.set_current_device_resource(pool) ``` Other `MemoryResource`s include: * `FixedSizeMemoryResource` for allocating fixed blocks of memory * `BinningMemoryResource` for allocating blocks within specified "bin" sizes from different memory resources `MemoryResource`s are highly configurable and can be composed together in different ways. See `help(rmm.mr)` for more information. ## Using RMM with third-party libraries A number of libraries provide hooks to control their device allocations. RMM provides implementations of these for [CuPy](https://cupy.dev), [numba](https://numba.readthedocs.io/en/stable/), and [PyTorch](https://pytorch.org) in the `rmm.allocators` submodule. All these approaches configure the library to use the _current_ RMM memory resource for device allocations. ### Using RMM with CuPy You can configure [CuPy](https://cupy.dev/) to use RMM for memory allocations by setting the CuPy CUDA allocator to `rmm.allocators.cupy.rmm_cupy_allocator`: ```python >>> from rmm.allocators.cupy import rmm_cupy_allocator >>> import cupy >>> cupy.cuda.set_allocator(rmm_cupy_allocator) ``` ### Using RMM with Numba You can configure [Numba](https://numba.readthedocs.io/en/stable/) to use RMM for memory allocations using the Numba [EMM Plugin](https://numba.readthedocs.io/en/stable/cuda/external-memory.html#setting-emm-plugin). This can be done in two ways: 1. Setting the environment variable `NUMBA_CUDA_MEMORY_MANAGER`: ```bash $ NUMBA_CUDA_MEMORY_MANAGER=rmm.allocators.numba python (args) ``` 2. Using the `set_memory_manager()` function provided by Numba: ```python >>> from numba import cuda >>> from rmm.allocators.numba import RMMNumbaManager >>> cuda.set_memory_manager(RMMNumbaManager) ``` ### Using RMM with PyTorch You can configure [PyTorch](https://pytorch.org/docs/stable/notes/cuda.html) to use RMM for memory allocations using their by configuring the current allocator. ```python >>> from rmm.allocators.torch import rmm_torch_allocator >>> import torch >>> torch.cuda.memory.change_current_allocator(rmm_torch_allocator) ``` ## Memory statistics and profiling RMM can profile memory usage and track memory statistics by using either of the following: - Use the context manager `rmm.statistics.statistics()` to enable statistics tracking for a specific code block. - Call `rmm.statistics.enable_statistics()` to enable statistics tracking globally. Common to both usages is that they modify the currently active RMM memory resource. The current device resource is wrapped with a `StatisticsResourceAdaptor` which must remain the topmost resource throughout the statistics tracking: ```python >>> import rmm >>> import rmm.statistics >>> # We start with the default cuda memory resource >>> rmm.mr.get_current_device_resource() >>> # When using statistics, we get a StatisticsResourceAdaptor with the context >>> with rmm.statistics.statistics(): ... rmm.mr.get_current_device_resource() >>> # We can also enable statistics globally >>> rmm.statistics.enable_statistics() >>> print(rmm.mr.get_current_device_resource()) ``` With statistics enabled, you can query statistics of the current and peak bytes and number of allocations performed by the current RMM memory resource: ```python >>> buf = rmm.DeviceBuffer(size=10) >>> rmm.statistics.get_statistics() Statistics(current_bytes=16, current_count=1, peak_bytes=16, peak_count=1, total_bytes=16, total_count=1) ``` ### Memory Profiler To profile a specific block of code, first enable memory statistics by calling `rmm.statistics.enable_statistics()`. To profile a function, use `profiler` as a function decorator: ```python >>> @rmm.statistics.profiler() ... def f(size): ... rmm.DeviceBuffer(size=size) >>> f(1000) >>> # By default, the profiler write to rmm.statistics.default_profiler_records >>> print(rmm.statistics.default_profiler_records.report()) Memory Profiling ================ Legends: ncalls - number of times the function or code block was called memory_peak - peak memory allocated in function or code block (in bytes) memory_total - total memory allocated in function or code block (in bytes) Ordered by: memory_peak ncalls memory_peak memory_total filename:lineno(function) 1 1,008 1,008 :1(f) ``` To profile a code block, use `profiler` as a context manager: ```python >>> with rmm.statistics.profiler(name="my code block"): ... rmm.DeviceBuffer(size=20) >>> print(rmm.statistics.default_profiler_records.report()) Memory Profiling ================ Legends: ncalls - number of times the function or code block was called memory_peak - peak memory allocated in function or code block (in bytes) memory_total - total memory allocated in function or code block (in bytes) Ordered by: memory_peak ncalls memory_peak memory_total filename:lineno(function) 1 1,008 1,008 :1(f) 1 32 32 my code block ``` The `profiler` supports nesting: ```python >>> with rmm.statistics.profiler(name="outer"): ... buf1 = rmm.DeviceBuffer(size=10) ... with rmm.statistics.profiler(name="inner"): ... buf2 = rmm.DeviceBuffer(size=10) >>> print(rmm.statistics.default_profiler_records.report()) Memory Profiling ================ Legends: ncalls - number of times the function or code block was called memory_peak - peak memory allocated in function or code block (in bytes) memory_total - total memory allocated in function or code block (in bytes) Ordered by: memory_peak ncalls memory_peak memory_total filename:lineno(function) 1 1,008 1,008 :1(f) 1 32 32 my code block 1 32 32 outer 1 16 16 inner ```