Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch_by, to allow grouping elements together by their size #992

Open
JosephLenton opened this issue Sep 18, 2024 · 2 comments
Open

Comments

@JosephLenton
Copy link

JosephLenton commented Sep 18, 2024

Hey, I would like to propose adding a method batch_by (I don't know if this is the best name, however I'll use it here).

  • batch_by would allow you to group elements by their size.
  • You provide a maximum size, and a batch of elements aims to be less than that size.
  • If an element is too big, it is placed into a batch of one on it's own ... I'm not 100% sure if this should be the correct behaviour but it is what I have used in the past.
  • The batches preserve the order of the Iterator; it builds a batch as it goes and does not look ahead to find items to fit past batches.
  • I would also propose try_batch_by for dealing with Iterators of Result.

The function signature would be something like:

fn batch_by<N: Num>(max_batch_size: N, batch_fun: F) -> BatchBy<Self, V, F, N>
where
        Self: Sized,
        F: Fn(&Self::Item) -> N,
        N: Num + PartialOrd + Copy,

(I'm not 100% sure on the need for Num. I used it in an implementation I wrote, however I believe it could probably be dropped.)

Pseudo Example Code

const MAX_BATCH_SEND_SIZE: usize = 100;

// Data we are sending
let data_to_send = vec![
    "short-data",
    "very-very-very-...-very-long-data",
    "medium-length-data",
    "short-again",
    "medium-length-data-again",

    // ... imagine lots more data ...
];

// This is the function in use
let batches = data_to_send
    .into_iter()
    .batch_by(MAX_BATCH_SEND_SIZE, |data| data.len());

// Send batches of data
for batch in batches {
    let batch_to_send = batch.collect::<Vec<&'static str>>();
    send(batch_to_send).await?;
}

Motivations

I have personally needed this on several projects when grouping things to be sent on to an external service. For example on a real world project, I needed batch data into 10mb groups to send to ElasticSearch.

Comments

AFAIK there is nothing like this in Itertools. There are ways to group by key, or to find the largest or smallest. A means to say 'put these into groups of 10mb or less' .

I have written this before, so if there is interest I would be more than happy to write a PR for Itertools.

@phimuemue
Copy link
Member

phimuemue commented Sep 18, 2024

Hi, thanks for the idea.

I wonder if chunk_by, batching or (a generalized) coalesce might do the job, too. Did you try these and see how far you get?

Related: #436

@JosephLenton
Copy link
Author

One can achieve this using chunk_by by counting generations:

let mut current_batch_size = 0;
let mut generation = 0;
let batches = data_to_send
    .into_iter()
    .chunk_by(move |data| {
        let size = data.len();

        if current_batch_size + size > MAX_BATCH_SEND_SIZE {
            current_batch_size = 0;
            generation += 1;
        }

        current_batch_size += size;

        generation
    });

I actually ended up writing this twice due to getting the logic wrong the first time. In practice I would end up putting this into a helper function, so I can wrap that with tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants