Skip to content

Commit

Permalink
RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
Browse files Browse the repository at this point in the history
64k pages introduce the situation in this diagram when the HCA 4k page
size is being used:

 +-------------------------------------------+ <--- 64k aligned VA
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+
 |                   o                       |
 |                                           |
 |                   o                       |
 |                                           |
 |                   o                       |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+ <--- Live HCA page
 |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO| <--- offset
 |                                           | <--- VA
 |                MR data                    |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+
 |                   o                       |
 |                                           |
 |                   o                       |
 |                                           |
 |                   o                       |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+

The VA addresses are coming from rdma-core in this diagram can be
arbitrary, but for 64k pages, the VA may be offset by some number of HCA
4k pages and followed by some number of HCA 4k pages.

The current iterator doesn't account for either the preceding 4k pages or
the following 4k pages.

Fix the issue by extending the ib_block_iter to contain the number of DMA
pages like comment [1] says and by using __sg_advance to start the
iterator at the first live HCA page.

The changes are contained in a parallel set of iterator start and next
functions that are umem aware and specific to umem since there is one user
of the rdma_for_each_block() without umem.

These two fixes prevents the extra pages before and after the user MR
data.

Fix the preceding pages by using the __sq_advance field to start at the
first 4k page containing MR data.

Fix the following pages by saving the number of pgsz blocks in the
iterator state and downcounting on each next.

This fix allows for the elimination of the small page crutch noted in the
Fixes.

Fixes: 10c75cc ("RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()")
Link: https://lore.kernel.org/r/20231129202143.1434-2-shiraz.saleem@intel.com
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
  • Loading branch information
mmarcini authored and jgunthorpe committed Dec 5, 2023
1 parent 2b78832 commit 4fbc3a5
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 7 deletions.
6 changes: 0 additions & 6 deletions drivers/infiniband/core/umem.c
Original file line number Diff line number Diff line change
Expand Up @@ -96,12 +96,6 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
return page_size;
}

/* rdma_for_each_block() has a bug if the page size is smaller than the
* page size used to build the umem. For now prevent smaller page sizes
* from being returned.
*/
pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);

/* The best result is the smallest page size that results in the minimum
* number of required pages. Compute the largest page size that could
* work based on VA address bits that don't change.
Expand Down
9 changes: 8 additions & 1 deletion include/rdma/ib_umem.h
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,13 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
{
__rdma_block_iter_start(biter, umem->sgt_append.sgt.sgl,
umem->sgt_append.sgt.nents, pgsz);
biter->__sg_advance = ib_umem_offset(umem) & ~(pgsz - 1);
biter->__sg_numblocks = ib_umem_num_dma_blocks(umem, pgsz);
}

static inline bool __rdma_umem_block_iter_next(struct ib_block_iter *biter)
{
return __rdma_block_iter_next(biter) && biter->__sg_numblocks--;
}

/**
Expand All @@ -92,7 +99,7 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
*/
#define rdma_umem_for_each_dma_block(umem, biter, pgsz) \
for (__rdma_umem_block_iter_start(biter, umem, pgsz); \
__rdma_block_iter_next(biter);)
__rdma_umem_block_iter_next(biter);)

#ifdef CONFIG_INFINIBAND_USER_MEM

Expand Down
1 change: 1 addition & 0 deletions include/rdma/ib_verbs.h
Original file line number Diff line number Diff line change
Expand Up @@ -2850,6 +2850,7 @@ struct ib_block_iter {
/* internal states */
struct scatterlist *__sg; /* sg holding the current aligned block */
dma_addr_t __dma_addr; /* unaligned DMA address of this block */
size_t __sg_numblocks; /* ib_umem_num_dma_blocks() */
unsigned int __sg_nents; /* number of SG entries */
unsigned int __sg_advance; /* number of bytes to advance in sg in next step */
unsigned int __pg_bit; /* alignment of current block */
Expand Down

0 comments on commit 4fbc3a5

Please sign in to comment.