使用 C# 在所有文件中更好地搜索字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13993530/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Better Search for a string in all files using C#
提问by LCJ
After referring many blogs and articles, I have reached at the following code for searching for a string in all files inside a folder. It is working fine in my tests.
在参考了许多博客和文章后,我找到了以下代码,用于在文件夹内的所有文件中搜索字符串。它在我的测试中运行良好。
QUESTIONS
问题
- Is there a faster approach for this (using C#)?
- Is there any scenario that will fail with this code?
- 有没有更快的方法(使用 C#)?
- 是否有任何情况会因此代码而失败?
Note: I tested with very small files. Also very few number of files.
注意:我用非常小的文件进行了测试。文件数量也很少。
CODE
代码
static void Main()
{
string sourceFolder = @"C:\Test";
string searchWord = ".class1";
List<string> allFiles = new List<string>();
AddFileNamesToList(sourceFolder, allFiles);
foreach (string fileName in allFiles)
{
string contents = File.ReadAllText(fileName);
if (contents.Contains(searchWord))
{
Console.WriteLine(fileName);
}
}
Console.WriteLine(" ");
System.Console.ReadKey();
}
public static void AddFileNamesToList(string sourceDir, List<string> allFiles)
{
string[] fileEntries = Directory.GetFiles(sourceDir);
foreach (string fileName in fileEntries)
{
allFiles.Add(fileName);
}
//Recursion
string[] subdirectoryEntries = Directory.GetDirectories(sourceDir);
foreach (string item in subdirectoryEntries)
{
// Avoid "reparse points"
if ((File.GetAttributes(item) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
{
AddFileNamesToList(item, allFiles);
}
}
}
REFERENCE
参考
采纳答案by VladL
Instead of File.ReadAllText() better use
而不是 File.ReadAllText() 更好地使用
File.ReadLines(@"C:\file.txt");
It returns IEnumerable
(yielded) so you will not have to read the whole file if your string is found before the last line of the text file is reached
它返回IEnumerable
(产生),因此如果在到达文本文件的最后一行之前找到您的字符串,您将不必读取整个文件
回答by Brannon
I think your code will fail with an exception if you lack permission to open a file
.
我认为如果你缺少permission to open a file
.
Compare it with the code here: http://bgrep.codeplex.com/releases/view/36186
将其与此处的代码进行比较:http: //bgrep.codeplex.com/releases/view/36186
That latter code supports
后面的代码支持
- regular expression search and
- filters for file extensions
- 正则表达式搜索和
- 文件扩展名过滤器
-- things you should probably consider.
- 你可能应该考虑的事情。
回答by Jason Meckley
the main problem here is that you are searching all the files in real time for every search. there is also the possibility of file access conflicts if 2+ users are searching at the same time.
这里的主要问题是您正在为每次搜索实时搜索所有文件。如果 2 个以上的用户同时搜索,也有可能发生文件访问冲突。
to dramtically improve performance I would index the files ahead of time, and as they are edited/saved. store the indexed using something like lucene.netand then query the index (again using luence.net) and return the file names to the user. so the user never queries the files directly.
为了显着提高性能,我会提前索引文件,并在编辑/保存它们时。使用类似lucene.net 的东西存储索引,然后查询索引(再次使用luence.net)并将文件名返回给用户。所以用户永远不会直接查询文件。
if you follow the links in this SO Postyou may have a head start on implementing the indexing. I didn't follow the links, but it's worth a look.
如果您按照此SO Post 中的链接进行操作,您可能会在实施索引方面有一个良好的开端。我没有按照链接,但值得一看。
Just a heads up, this will be an intense shift from your current approach and will require
请注意,这将是您当前方法的重大转变,并且需要
- a service to monitor/index the files
- the UI project
- 监视/索引文件的服务
- 用户界面项目
回答by Serj-Tm
Instead of
Contains
better use algorithm Boyer-Moore search.Fail scenario: file have not read permission.
而不是
Contains
更好地使用算法 Boyer-Moore 搜索。失败场景:文件没有读取权限。
回答by Scott Chamberlain
I wrote somthing very similar, a couple of changes I would recommend.
我写了一些非常相似的东西,我会推荐一些更改。
- Use Directory.EnumerateDirectoriesinstead of GetDirectories, it returns immediately with a IEnumerable so you don't need to wait for it to finish reading all of the directories before processing.
- Use ReadLinesinstead of ReadAllText, this will only load one line in at a time in memory, this will be a big deal if you hit a large file.
- If you are using a new enough version of .NET use Parallel.ForEach, this will allow you to search multiple files at once.
- You may not be able to open the file, you need to check for read permissions or add to the manifestthat your program requires administrative privileges (you should still check though)
- 使用Directory.EnumerateDirectories而不是 GetDirectories,它会立即返回一个 IEnumerable,因此您无需在处理之前等待它完成读取所有目录。
- 使用ReadLines而不是 ReadAllText,这只会在内存中一次加载一行,如果你遇到一个大文件,这将是一个大问题。
- 如果您使用的是足够新的 .NET 版本,请使用Parallel.ForEach,这将允许您一次搜索多个文件。
- 您可能无法打开该文件,您需要检查读取权限或将您的程序需要管理权限的清单添加到清单中(您仍然应该检查)
I was creating a binary search tool, here is some snippets of what I wrote to give you a hand
我正在创建一个二进制搜索工具,这是我写的一些片段,以帮助您
private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
Parallel.ForEach(Directory.EnumerateFiles(_folder, _filter, SearchOption.AllDirectories), Search);
}
//_array contains the binary pattern I am searching for.
private void Search(string filePath)
{
if (Contains(filePath, _array))
{
//filePath points at a match.
}
}
private static bool Contains(string path, byte[] search)
{
//I am doing ReadAllBytes due to the fact that I am doing a binary search not a text search
// There are no "Lines" to seperate out on.
var file = File.ReadAllBytes(path);
var result = Parallel.For(0, file.Length - search.Length, (i, loopState) =>
{
if (file[i] == search[0])
{
byte[] localCache = new byte[search.Length];
Array.Copy(file, i, localCache, 0, search.Length);
if (Enumerable.SequenceEqual(localCache, search))
loopState.Stop();
}
});
return result.IsCompleted == false;
}
This uses two nested parallel loops. This design is terribly inefficient, and could be greatly improved by using the Booyer-Moore search algorithmbut I could not find a binary implementation and I did not have the time when I wrote it originally to implement it myself.
这使用两个嵌套的并行循环。这种设计效率极低,可以通过使用Booyer-Moore 搜索算法大大改进,但我找不到二进制实现,而且我最初编写它时没有时间自己实现它。