C# 在 linq 中创建批处理

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13731796/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-10 09:37:15  来源:igfitidea点击:

Create batches in linq

c#linq

提问by BlakeH

Can someone suggest a way to create batches of a certain size in linq?

有人可以建议一种在 linq 中创建特定大小批次的方法吗?

Ideally I want to be able to perform operations in chunks of some configurable amount.

理想情况下,我希望能够以一些可配置数量的块执行操作。

采纳答案by Sergey Berezovskiy

You don't need to write any code. Use MoreLINQBatch method, which batches the source sequence into sized buckets (MoreLINQ is available as a NuGet package you can install):

您无需编写任何代码。使用MoreLINQBatch 方法,它将源序列分批处理到大小的桶中(MoreLINQ 可作为您可以安装的 NuGet 包使用):

int size = 10;
var batches = sequence.Batch(size);

Which is implemented as:

其实现为:

public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
                  this IEnumerable<TSource> source, int size)
{
    TSource[] bucket = null;
    var count = 0;

    foreach (var item in source)
    {
        if (bucket == null)
            bucket = new TSource[size];

        bucket[count++] = item;
        if (count != size)
            continue;

        yield return bucket;

        bucket = null;
        count = 0;
    }

    if (bucket != null && count > 0)
        yield return bucket.Take(count).ToArray();
}

回答by L.B

public static class MyExtensions
{
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> items,
                                                       int maxItems)
    {
        return items.Select((item, inx) => new { item, inx })
                    .GroupBy(x => x.inx / maxItems)
                    .Select(g => g.Select(x => x.item));
    }
}

and the usage would be:

用法是:

List<int> list = new List<int>() { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };

foreach(var batch in list.Batch(3))
{
    Console.WriteLine(String.Join(",",batch));
}

OUTPUT:

输出:

0,1,2
3,4,5
6,7,8
9

回答by Nick Whaley

All of the above perform terribly with large batches or low memory space. Had to write my own that will pipeline (notice no item accumulation anywhere):

以上所有这些在大批量或低内存空间的情况下都表现得非常糟糕。不得不写我自己的管道(注意任何地方都没有项目积累):

public static class BatchLinq {
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size) {
        if (size <= 0)
            throw new ArgumentOutOfRangeException("size", "Must be greater than zero.");

        using (IEnumerator<T> enumerator = source.GetEnumerator())
            while (enumerator.MoveNext())
                yield return TakeIEnumerator(enumerator, size);
    }

    private static IEnumerable<T> TakeIEnumerator<T>(IEnumerator<T> source, int size) {
        int i = 0;
        do
            yield return source.Current;
        while (++i < size && source.MoveNext());
    }
}

Edit:Known issue with this approach is that each batch must be enumerated and enumerated fully before moving to the next batch. For example this doesn't work:

编辑:这种方法的已知问题是,在移动到下一个批次之前,必须对每个批次进行枚举和枚举。例如,这不起作用:

//Select first item of every 100 items
Batch(list, 100).Select(b => b.First())

回答by nichom

    static IEnumerable<IEnumerable<T>> TakeBatch<T>(IEnumerable<T> ts,int batchSize)
    {
        return from @group in ts.Select((x, i) => new { x, i }).ToLookup(xi => xi.i / batchSize)
               select @group.Select(xi => xi.x);
    }

回答by Kaushik

I'm joining this very late but i found something more interesting.

我很晚才加入这个,但我发现了一些更有趣的东西。

So we can use here Skipand Takefor better performance.

所以我们可以在这里使用SkipTake获得更好的性能。

public static class MyExtensions
    {
        public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> items, int maxItems)
        {
            return items.Select((item, index) => new { item, index })
                        .GroupBy(x => x.index / maxItems)
                        .Select(g => g.Select(x => x.item));
        }

        public static IEnumerable<T> Batch2<T>(this IEnumerable<T> items, int skip, int take)
        {
            return items.Skip(skip).Take(take);
        }

    }

Next I checked with 100000 records. The looping only is taking more time in case of Batch

接下来我检查了 100000 条记录。循环只会在以下情况下花费更多时间Batch

Code Of console application.

控制台应用程序的代码。

static void Main(string[] args)
{
    List<string> Ids = GetData("First");
    List<string> Ids2 = GetData("tsriF");

    Stopwatch FirstWatch = new Stopwatch();
    FirstWatch.Start();
    foreach (var batch in Ids2.Batch(5000))
    {
        // Console.WriteLine("Batch Ouput:= " + string.Join(",", batch));
    }
    FirstWatch.Stop();
    Console.WriteLine("Done Processing time taken:= "+ FirstWatch.Elapsed.ToString());


    Stopwatch Second = new Stopwatch();

    Second.Start();
    int Length = Ids2.Count;
    int StartIndex = 0;
    int BatchSize = 5000;
    while (Length > 0)
    {
        var SecBatch = Ids2.Batch2(StartIndex, BatchSize);
        // Console.WriteLine("Second Batch Ouput:= " + string.Join(",", SecBatch));
        Length = Length - BatchSize;
        StartIndex += BatchSize;
    }

    Second.Stop();
    Console.WriteLine("Done Processing time taken Second:= " + Second.Elapsed.ToString());
    Console.ReadKey();
}

static List<string> GetData(string name)
{
    List<string> Data = new List<string>();
    for (int i = 0; i < 100000; i++)
    {
        Data.Add(string.Format("{0} {1}", name, i.ToString()));
    }

    return Data;
}

Time taken Is like this.

花费的时间是这样的。

First - 00:00:00.0708 , 00:00:00.0660

第一 - 00:00:00.0708 , 00:00:00.0660

Second (Take and Skip One) - 00:00:00.0008, 00:00:00.0008

第二个(选择并跳过一个) - 00:00:00.0008, 00:00:00.0008

回答by user4698855

Same approach as MoreLINQ, but using List instead of Array. I haven't done benchmarking, but readability matters more to some people:

与 MoreLINQ 相同的方法,但使用 List 而不是 Array。我没有做过基准测试,但可读性对某些人来说更重要:

    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
    {
        List<T> batch = new List<T>();

        foreach (var item in source)
        {
            batch.Add(item);

            if (batch.Count >= size)
            {
                yield return batch;
                batch.Clear();
            }
        }

        if (batch.Count > 0)
        {
            yield return batch;
        }
    }

回答by Matthew Strawbridge

If you start with sequencedefined as an IEnumerable<T>, and you know that it can safely be enumerated multiple times (e.g. because it is an array or a list), you can just use this simple pattern to process the elements in batches:

如果您从sequence定义为 an开始IEnumerable<T>,并且您知道它可以安全地枚举多次(例如,因为它是一个数组或列表),您可以使用这个简单的模式来批量处理元素:

while (sequence.Any())
{
    var batch = sequence.Take(10);
    sequence = sequence.Skip(10);

    // do whatever you need to do with each batch here
}

回答by infogulch

This is a fully lazy, low overhead, one-function implementation of Batch that doesn't do any accumulation. Based on (and fixes issues in) Nick Whaley's solutionwith help from EricRoller.

这是 Batch 的一种完全惰性、低开销、单一功能的实现,不进行任何累积。在EricRoller 的帮助下,基于(并修复了)Nick Whaley 的解决方案

Iteration comes directly from the underlying IEnumerable, so elements must be enumerated in strict order, and accessed no more than once. If some elements aren't consumed in an inner loop, they are discarded (and trying to access them again via a saved iterator will throw InvalidOperationException: Enumeration already finished.).

迭代直接来自底层 IEnumerable,因此元素必须以严格的顺序枚举,并且访问次数不能超过一次。如果某些元素在内部循环中没有被消耗,它们将被丢弃(并且尝试通过保存的迭代器再次访问它们将 throw InvalidOperationException: Enumeration already finished.)。

You can test a complete sample at .NET Fiddle.

您可以在.NET Fiddle测试完整的示例。

public static class BatchLinq
{
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
    {
        if (size <= 0)
            throw new ArgumentOutOfRangeException("size", "Must be greater than zero.");
        using (var enumerator = source.GetEnumerator())
            while (enumerator.MoveNext())
            {
                int i = 0;
                // Batch is a local function closing over `i` and `enumerator` that
                // executes the inner batch enumeration
                IEnumerable<T> Batch()
                {
                    do yield return enumerator.Current;
                    while (++i < size && enumerator.MoveNext());
                }

                yield return Batch();
                while (++i < size && enumerator.MoveNext()); // discard skipped items
            }
    }
}

回答by leat

I wrote a custom IEnumerable implementation that works without linq and guarantees a single enumeration over the data. It also accomplishes all this without requiring backing lists or arrays that cause memory explosions over large data sets.

我编写了一个自定义的 IEnumerable 实现,它在没有 linq 的情况下工作并保证对数据进行单一枚举。它还可以完成所有这些,而无需在大型数据集上导致内存爆炸的后备列表或数组。

Here are some basic tests:

下面是一些基本的测试:

    [Fact]
    public void ShouldPartition()
    {
        var ints = new List<int> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
        var data = ints.PartitionByMaxGroupSize(3);
        data.Count().Should().Be(4);

        data.Skip(0).First().Count().Should().Be(3);
        data.Skip(0).First().ToList()[0].Should().Be(0);
        data.Skip(0).First().ToList()[1].Should().Be(1);
        data.Skip(0).First().ToList()[2].Should().Be(2);

        data.Skip(1).First().Count().Should().Be(3);
        data.Skip(1).First().ToList()[0].Should().Be(3);
        data.Skip(1).First().ToList()[1].Should().Be(4);
        data.Skip(1).First().ToList()[2].Should().Be(5);

        data.Skip(2).First().Count().Should().Be(3);
        data.Skip(2).First().ToList()[0].Should().Be(6);
        data.Skip(2).First().ToList()[1].Should().Be(7);
        data.Skip(2).First().ToList()[2].Should().Be(8);

        data.Skip(3).First().Count().Should().Be(1);
        data.Skip(3).First().ToList()[0].Should().Be(9);
    }

The Extension Method to partition the data.

对数据进行分区的扩展方法。

/// <summary>
/// A set of extension methods for <see cref="IEnumerable{T}"/>. 
/// </summary>
public static class EnumerableExtender
{
    /// <summary>
    /// Splits an enumerable into chucks, by a maximum group size.
    /// </summary>
    /// <param name="source">The source to split</param>
    /// <param name="maxSize">The maximum number of items per group.</param>
    /// <typeparam name="T">The type of item to split</typeparam>
    /// <returns>A list of lists of the original items.</returns>
    public static IEnumerable<IEnumerable<T>> PartitionByMaxGroupSize<T>(this IEnumerable<T> source, int maxSize)
    {
        return new SplittingEnumerable<T>(source, maxSize);
    }
}

This is the implementing class

这是实现类

    using System.Collections;
    using System.Collections.Generic;

    internal class SplittingEnumerable<T> : IEnumerable<IEnumerable<T>>
    {
        private readonly IEnumerable<T> backing;
        private readonly int maxSize;
        private bool hasCurrent;
        private T lastItem;

        public SplittingEnumerable(IEnumerable<T> backing, int maxSize)
        {
            this.backing = backing;
            this.maxSize = maxSize;
        }

        public IEnumerator<IEnumerable<T>> GetEnumerator()
        {
            return new Enumerator(this, this.backing.GetEnumerator());
        }

        IEnumerator IEnumerable.GetEnumerator()
        {
            return this.GetEnumerator();
        }

        private class Enumerator : IEnumerator<IEnumerable<T>>
        {
            private readonly SplittingEnumerable<T> parent;
            private readonly IEnumerator<T> backingEnumerator;
            private NextEnumerable current;

            public Enumerator(SplittingEnumerable<T> parent, IEnumerator<T> backingEnumerator)
            {
                this.parent = parent;
                this.backingEnumerator = backingEnumerator;
                this.parent.hasCurrent = this.backingEnumerator.MoveNext();
                if (this.parent.hasCurrent)
                {
                    this.parent.lastItem = this.backingEnumerator.Current;
                }
            }

            public bool MoveNext()
            {
                if (this.current == null)
                {
                    this.current = new NextEnumerable(this.parent, this.backingEnumerator);
                    return true;
                }
                else
                {
                    if (!this.current.IsComplete)
                    {
                        using (var enumerator = this.current.GetEnumerator())
                        {
                            while (enumerator.MoveNext())
                            {
                            }
                        }
                    }
                }

                if (!this.parent.hasCurrent)
                {
                    return false;
                }

                this.current = new NextEnumerable(this.parent, this.backingEnumerator);
                return true;
            }

            public void Reset()
            {
                throw new System.NotImplementedException();
            }

            public IEnumerable<T> Current
            {
                get { return this.current; }
            }

            object IEnumerator.Current
            {
                get { return this.Current; }
            }

            public void Dispose()
            {
            }
        }

        private class NextEnumerable : IEnumerable<T>
        {
            private readonly SplittingEnumerable<T> splitter;
            private readonly IEnumerator<T> backingEnumerator;
            private int currentSize;

            public NextEnumerable(SplittingEnumerable<T> splitter, IEnumerator<T> backingEnumerator)
            {
                this.splitter = splitter;
                this.backingEnumerator = backingEnumerator;
            }

            public bool IsComplete { get; private set; }

            public IEnumerator<T> GetEnumerator()
            {
                return new NextEnumerator(this.splitter, this, this.backingEnumerator);
            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return this.GetEnumerator();
            }

            private class NextEnumerator : IEnumerator<T>
            {
                private readonly SplittingEnumerable<T> splitter;
                private readonly NextEnumerable parent;
                private readonly IEnumerator<T> enumerator;
                private T currentItem;

                public NextEnumerator(SplittingEnumerable<T> splitter, NextEnumerable parent, IEnumerator<T> enumerator)
                {
                    this.splitter = splitter;
                    this.parent = parent;
                    this.enumerator = enumerator;
                }

                public bool MoveNext()
                {
                    this.parent.currentSize += 1;
                    this.currentItem = this.splitter.lastItem;
                    var hasCcurent = this.splitter.hasCurrent;

                    this.parent.IsComplete = this.parent.currentSize > this.splitter.maxSize;

                    if (this.parent.IsComplete)
                    {
                        return false;
                    }

                    if (hasCcurent)
                    {
                        var result = this.enumerator.MoveNext();

                        this.splitter.lastItem = this.enumerator.Current;
                        this.splitter.hasCurrent = result;
                    }

                    return hasCcurent;
                }

                public void Reset()
                {
                    throw new System.NotImplementedException();
                }

                public T Current
                {
                    get { return this.currentItem; }
                }

                object IEnumerator.Current
                {
                    get { return this.Current; }
                }

                public void Dispose()
                {
                }
            }
        }
    }

回答by MrD at KookerellaLtd

So with a functional hat on, this appears trivial....but in C#, there are some significant downsides.

所以戴上功能帽,这看起来微不足道……但是在 C# 中,有一些明显的缺点。

you'd probably view this as an unfold of IEnumerable (google it and you'll probably end up in some Haskell docs, but there may be some F# stuff using unfold, if you know F#, squint at the Haskell docs and it will make sense).

您可能会将此视为 IEnumerable 的展开(谷歌它,您可能会在一些 Haskell 文档中结束,但可能有一些 F# 使用展开的东西,如果您了解 F#,请眯眼查看 Haskell 文档,它会使感觉)。

Unfold is related to fold ("aggregate") except rather than iterating through the input IEnumerable, it iterates through the output data structures (its a similar relationship between IEnumerable and IObservable, in fact I think IObservable does implement an "unfold" called generate...)

展开与折叠(“聚合”)有关,除了它不是遍历输入 IEnumerable,而是遍历输出数据结构(它在 IEnumerable 和 IObservable 之间有类似的关系,实际上我认为 IObservable 确实实现了一个名为 generate 的“展开”。 ..)

anyway first you need an unfold method, I think this works (unfortunately it will eventually blow the stack for large "lists"...you can write this safely in F# using yield! rather than concat);

无论如何,首先你需要一个展开方法,我认为这是有效的(不幸的是,它最终会炸毁大型“列表”的堆栈......你可以在 F# 中使用 yield! ​​而不是 concat 安全地编写它);

    static IEnumerable<T> Unfold<T, U>(Func<U, IEnumerable<Tuple<U, T>>> f, U seed)
    {
        var maybeNewSeedAndElement = f(seed);

        return maybeNewSeedAndElement.SelectMany(x => new[] { x.Item2 }.Concat(Unfold(f, x.Item1)));
    }

this is a bit obtuse because C# doesn't implement some of the things functional langauges take for granted...but it basically takes a seed and then generates a "Maybe" answer of the next element in the IEnumerable and the next seed (Maybe doesn't exist in C#, so we've used IEnumerable to fake it), and concatenates the rest of the answer (I can't vouch for the "O(n?)" complexity of this).

这有点迟钝,因为 C# 没有实现一些函数式语言认为理所当然的东西......但它基本上需要一个种子,然后生成 IEnumerable 中下一个元素和下一个种子(也许是在 C# 中不存在,所以我们使用 IEnumerable 来伪造它),并连接答案的其余部分(我不能保证“O(n?)”的复杂性)。

Once you've done that then;

一旦你这样做了;

    static IEnumerable<IEnumerable<T>> Batch<T>(IEnumerable<T> xs, int n)
    {
        return Unfold(ys =>
            {
                var head = ys.Take(n);
                var tail = ys.Skip(n);
                return head.Take(1).Select(_ => Tuple.Create(tail, head));
            },
            xs);
    }

it all looks quite clean...you take the "n" elements as the "next" element in the IEnumerable, and the "tail" is the rest of the unprocessed list.

这一切看起来都很干净……您将“n”元素作为 IEnumerable 中的“下一个”元素,“尾部”是未处理列表的其余部分。

if there is nothing in the head...you're over...you return "Nothing" (but faked as an empty IEnumerable>)...else you return the head element and the tail to process.

如果头部没有任何东西......你结束了......你返回“Nothing”(但假装为空的IEnumerable>)......否则你返回头部元素和尾部以进行处理。

you probably can do this using IObservable, there's probably a "Batch" like method already there, and you can probably use that.

你可能可以使用 IObservable 来做到这一点,那里可能已经有一个类似“批处理”的方法,你可能可以使用它。

If the risk of stack overflows worries (it probably should), then you should implement in F# (and there's probably some F# library (FSharpX?) already with this).

如果堆栈溢出的风险令人担忧(可能应该),那么您应该在 F# 中实现(并且可能已经有一些 F# 库(FSharpX?))。

(I have only done some rudimentary tests of this, so there may be the odd bugs in there).

(我只做了一些基本的测试,所以那里可能有奇怪的错误)。