使用 VBA 快速处理大型文本文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1489533/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Process Large Textfiles Quickly w/ VBA
提问by Fink
I'm having a hardtime speeding up the processing of a very large textfile (~100 Meg or so). I've made caution to be very diligent using the redim preserve calls, and yet the function still takes 5 minutes or so to run. The textfile is basically sub reports which i'm trying to parse out. I only have access to the large file. What is a person to do. Is VBA just that slow? Here is the code, the "Report" object is a class I created. Most of the reports are just a couple hundred lines, so thats why I choose 1000 for the ubound:
我很难加速处理一个非常大的文本文件(~100 Meg 左右)。我已经非常谨慎地使用 redim 保留调用,但该函数仍然需要 5 分钟左右的时间才能运行。文本文件基本上是我试图解析的子报告。我只能访问大文件。一个人做什么。VBA 就那么慢吗?这是代码,“报告”对象是我创建的一个类。大多数报告只有几百行,所以这就是我为 ubound 选择 1000 的原因:
Public Function GetPages(originalFilePath As String) As Collection
Dim myReport As report
Dim reportPageCollection As Collection
Dim startLine As Long
Dim endLine As Long
Dim fso As FileSystemObject
Dim file As textStream
Dim lineStr As String
Dim index As Long
Dim lines() As String
Set fso = New FileSystemObject
Set reportPageCollection = New Collection 'initialize the collection
Set file = fso.OpenTextFile(originalFilePath, ForReading)
ReDim lines(0 To 1000)
lineStr = file.ReadLine 'skip the first line so the loop doesnt add a blank report
lines(0) = lineStr
index = 1
Do Until file.AtEndOfLine 'loop through from the startline to find the end line
lineStr = file.ReadLine
If lineStr Like "1JOBNAME:*" Then 'next report, so we want to return an array of the single line
'load this page into our report page collection for further processing
Set myReport = New report
myReport.setDataLines = lines() 'Fill in 'ReportPage' Array
reportPageCollection.Add myReport 'add our report to the collection
'set up array for new report
ReDim lines(0 To 1000)
index = 0
lines(index) = lineStr
index = index + 1
Else
'============================ store into array
If index = UBound(lines) Then
ReDim Preserve lines(0 To UBound(lines) + 1000)
lines(index) = lineStr
index = index + 1
Else
lines(index) = lineStr
index = index + 1
End If
'============================
End If
Loop
file.Close
Set fso = Nothing
Set GetPages = reportPageCollection
End Function
结束函数
Any Help is appreciated. Thanks!
任何帮助表示赞赏。谢谢!
回答by Mike Woodhouse
I just grabbed a 73-meg, 1.2m line text file from my C:\ drive. It took 6 seconds to read through the whole thing, line by line in Excel VBA (doing nothing but reading). So the speed problem isn't obviously file-IO related.
我刚刚从我的 C:\ 驱动器中抓取了一个 73 兆、1.2m 行的文本文件。在 Excel VBA 中一行一行地通读整件事花了 6 秒钟(除了阅读什么都不做)。所以速度问题显然与文件 IO 无关。
A few observations:
一些观察:
- I'm nervous about having a variable named "file" when File is a class within the Scripting Runtime;
Do Until file.
AtEndOfLinestops almost immediately: you're at the end of a line as soon as you've read one. I think you wantDo Until file.AtEndOfStream
- the rest of your code looks OK, although I'd move all the stuff about adding lines to arrays into a method on your
report
class - is the file physically local? Or are you reading from a network drive? That might account for the problem. If so, consider reading the whole thing into a string and splitting it. 100MB isn't really that big. 9 seconds to do that with my 73MB file.
- You don't need to create a collection variable: GetPages already wants to be that collection
- 当 File 是脚本运行时中的一个类时,我很担心有一个名为“file”的变量;
Do Until file.
AtEndOfLine几乎立即停止:您一读完一行就到了行尾。我想你想要Do Until file.AtEndOfStream
- 您的其余代码看起来不错,尽管我会将所有有关向数组添加行的内容移到您的
report
类中的方法中 - 该文件在物理上是本地的吗?或者您正在从网络驱动器读取?这可能是问题的原因。如果是这样,请考虑将整个内容读入一个字符串并将其拆分。100MB 并没有那么大。用我的 73MB 文件完成这项工作需要 9 秒。
- 您不需要创建集合变量:GetPages 已经希望成为该集合
So your code might shrink to something like this:
所以你的代码可能会缩小到这样的:
Public Function GetPages(originalFilePath As String) As Collection
Dim myReport As report
Set GetPages = New Collection 'initialize the collection'
With New FileSystemObject ' no need to store an object'
With .OpenTextFile(originalFilePath, ForReading) ' ditto'
Set myReport = New report
myReport.AddLine .ReadLine
Do Until .AtEndOfStream
lineStr = file.ReadLine
If lineStr Like "1JOBNAME:*" Then
GetPages.Add myReport
Set myReport = New report
End If
myReport.AddLine lineStr ' all the array business happens here - much tidier'
Loop
End With ' TextStream goes out of scope & closes'
End With ' FileSystemObject goes out of scope, disappears'
End Function
Is there anything there that helps?
有什么可以帮助的吗?
回答by Oorang
There are a few tweaks you could make, the FSO object is known to be slower than VB's native IO. But I don't see anything really heinous in here. Before we go micro-optimizing let me ask a more basic question... Would these files happen to be on a shared drive or an ftp site? If so consider copying them down to a temp folder before processing them.
您可以进行一些调整,已知 FSO 对象比 VB 的本机 IO 慢。但我没有看到这里有什么真正令人发指的东西。在我们进行微优化之前,让我问一个更基本的问题......这些文件是否恰好位于共享驱动器或 ftp 站点上?如果是这样,请考虑在处理它们之前将它们复制到临时文件夹。