C# Tesseract OCR 简单示例

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16598390/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-10 01:25:46  来源:igfitidea点击:

Tesseract OCR simple example

c#ocrtesseract

提问by Will Robinson

Hi Can you anyone give me a simple example of testing Tesseract OCR preferably in C#.
I tried the demo found here. I download the English dataset and unzipped in C drive. and modified the code as followings:

嗨,你能给我一个简单的例子,最好在 C# 中测试 Tesseract OCR。
我尝试了此处找到的演示。我下载了英文数据集并解压到C盘。并将代码修改如下:

string path = @"C:\pic\mytext.jpg";
Bitmap image = new Bitmap(path);
Tesseract ocr = new Tesseract();
ocr.SetVariable("tessedit_char_whitelist", "0123456789"); // If digit only
ocr.Init(@"C:\tessdata\", "eng", false); // To use correct tessdata
List<tessnet2.Word> result = ocr.DoOCR(image, Rectangle.Empty);
foreach (tessnet2.Word word in result)
    Console.WriteLine("{0} : {1}", word.Confidence, word.Text);

Unfortunately the code doesn't work. the program dies at "ocr.Init(..." line. I couldn't even get an exception even using try-catch.

不幸的是,代码不起作用。程序在“ocr.Init(...”行结束。即使使用try-catch,我什至无法得到异常。

I was able to run the vietocr! but that is a very large project for me to follow. i need a simple example like above.

我能够运行vietocr!但这对我来说是一个非常大的项目。我需要一个像上面这样的简单例子。

Thanks

谢谢

回答by Rachel

Try updating the line to:

尝试将行更新为:

ocr.Init(@"C:\", "eng", false); // the path here should be the parent folder of tessdata

ocr.Init(@"C:\", "eng", false); // 这里的路径应该是tessdata的父文件夹

回答by Will Robinson

Ok. I found the solution here tessnet2 fails to loadthe Ans given by Adam

好的。我在这里找到了解决方案 tessnet2 failed to loadthe Ans given by Adam

Apparently i was using wrong version of tessdata. I was following the the source pageinstruction intuitively and that caused the problem.

显然我使用了错误版本的tessdata。我直观地遵循了源页面说明,这导致了问题。

it says

它说

Quick Tessnet2 usage

  1. Download binary here, add a reference of the assembly Tessnet2.dll to your .NET project.

  2. Download language data definition file hereand put it in tessdata directory. Tessdata directory and your exe must be in the same directory.

快速使用 Tessnet2

  1. 在此处下载二进制文件,将程序集 Tessnet2.dll 的引用添加到您的 .NET 项目中。

  2. 在此处下载语言数据定义文件并将其放在 tessdata 目录中。Tessdata 目录和您的 exe 必须在同一目录中。

After you download the binary, when you follow the link to download the language file, there are many language files. but none of them are right version. you need to select all version and go to next page for correct version (tesseract-2.00.eng)! They should either update download binary link to version 3 or put the the version 2 language file on the first page. Or at least bold mention the fact that this version issue is a big deal!

下载二进制文件后,当您按照链接下载语言文件时,会出现很多语言文件。但它们都不是正确的版本。您需要选择所有版本并转到下一页以获取正确版本(tesseract-2.00.eng)!他们应该将下载二进制链接更新到第 3 版,或者将第 2 版语言文件放在第一页上。或者至少大胆提及这个版本问题是一个大问题!

Anyway I found it. Thanks everyone.

反正我找到了。谢谢大家。

回答by Prasad

I had same problem, now its resolved. I have tesseract2, under this folders for 32 bit and 64 bit, I copied files 64 bit folder(as my system is 64 bit) to main folder ("Tesseract2") and under bin/Debug folder. Now my solution is working fine.

我有同样的问题,现在它解决了。我有 tesseract2,在这个 32 位和 64 位文件夹下,我将文件 64 位文件夹(因为我的系统是 64 位)复制到主文件夹(“Tesseract2”)和 bin/Debug 文件夹下。现在我的解决方案工作正常。

回答by Kaushal B

This worked for me, I had 3-4 more PDF to Text extractor and if one doesnot work the other one will ... tesseract in particular this code can be used on Windows 7, 8, Server 2008 . Hope this is helpful to you

这对我有用,我还有 3-4 个 PDF 到文本提取器,如果一个不起作用,另一个将......特别是 tesseract 这段代码可以在 Windows 7, 8, Server 2008 上使用。希望这对你有帮助

    do
    {
    // Sleep or Pause the Thread for 1 sec, if service is running too fast...
    Thread.Sleep(millisecondsTimeout: 1000);
    Guid tempGuid = ToSeqGuid();
    string newFileName = tempGuid.ToString().Split('-')[0];
    string outputFileName = appPath + "\pdf2png\" + fileNameithoutExtension + "-" + newFileName +
                            ".png";
    extractor.SaveCurrentImageToFile(outputFileName, ImageFormat.Png);
    // Create text file here using Tesseract
    foreach (var file in Directory.GetFiles(appPath + "\pdf2png"))
    {
        try
        {
            var pngFileName = Path.GetFileNameWithoutExtension(file);
            string[] myArguments =
            {
                "/C tesseract ", file,
                " " + appPath + "\png2text\" + pngFileName
            }; // /C for closing process automatically whent completes
            string strParam = String.Join(" ", myArguments);

            var myCmdProcess = new Process();
            var theProcess = new ProcessStartInfo("cmd.exe", strParam)
            {
                CreateNoWindow = true,
                UseShellExecute = false,
                RedirectStandardOutput = true,
                RedirectStandardError = true,
                WindowStyle = ProcessWindowStyle.Minimized
            }; // Keep the cmd.exe window minimized
            myCmdProcess.StartInfo = theProcess;
            myCmdProcess.Exited += myCmdProcess_Exited;
            myCmdProcess.Start();

            //if (process)
            {
                /*
                MessageBox.Show("cmd.exe process started: " + Environment.NewLine +
                                "Process Name: " + myCmdProcess.ProcessName +
                                Environment.NewLine + " Process Id: " + myCmdProcess.Id
                                + Environment.NewLine + "process.Handle: " +
                                myCmdProcess.Handle);
                */
                Process.EnterDebugMode();
                //ShowWindow(hWnd: process.Handle, nCmdShow: 2);
                /*
                MessageBox.Show("After EnterDebugMode() cmd.exe process Exited: " +
                                Environment.NewLine +
                                "Process Name: " + myCmdProcess.ProcessName +
                                Environment.NewLine + " Process Id: " + myCmdProcess.Id
                                + Environment.NewLine + "process.Handle: " +
                                myCmdProcess.Handle);
                */
                myCmdProcess.WaitForExit(60000);
                /*
                MessageBox.Show("After WaitForExit() cmd.exe process Exited: " +
                                Environment.NewLine +
                                "Process Name: " + myCmdProcess.ProcessName +
                                Environment.NewLine + " Process Id: " + myCmdProcess.Id
                                + Environment.NewLine + "process.Handle: " +
                                myCmdProcess.Handle);
                */
                myCmdProcess.Refresh();
                Process.LeaveDebugMode();
                //myCmdProcess.Dispose();
                /*
                MessageBox.Show("After LeaveDebugMode() cmd.exe process Exited: " +
                                Environment.NewLine);
                */
            }


            //process.Kill();
            // Waits for the process to complete task and exites automatically
            Thread.Sleep(millisecondsTimeout: 1000);

            // This works fine in Windows 7 Environment, and not in Windows 8
            // Try following code block
            // Check, if process is not comletey exited

            if (!myCmdProcess.HasExited)
            {
                //process.WaitForExit(2000); // Try to wait for exit 2 more seconds
                /*
                MessageBox.Show(" Process of cmd.exe was exited by WaitForExit(); Method " +
                                Environment.NewLine);
                */
                try
                {
                    // If not, then Kill the process
                    myCmdProcess.Kill();
                    //myCmdProcess.Dispose();
                    //if (!myCmdProcess.HasExited)
                    //{
                    //    myCmdProcess.Kill();
                    //}

                    MessageBox.Show(" Process of cmd.exe exited ( Killed ) successfully " +
                                    Environment.NewLine);
                }
                catch (System.ComponentModel.Win32Exception ex)
                {
                    MessageBox.Show(
                        " Exception: System.ComponentModel.Win32Exception " +
                        ex.ErrorCode + Environment.NewLine);
                }
                catch (NotSupportedException notSupporEx)
                {
                    MessageBox.Show(" Exception: NotSupportedException " +
                                    notSupporEx.Message +
                                    Environment.NewLine);
                }
                catch (InvalidOperationException invalidOperation)
                {
                    MessageBox.Show(
                        " Exception: InvalidOperationException " +
                        invalidOperation.Message + Environment.NewLine);
                    foreach (
                        var textFile in Directory.GetFiles(appPath + "\png2text", "*.txt",
                            SearchOption.AllDirectories))
                    {
                        loggingInfo += textFile +
                                       " In Reading Text from generated text file by Tesseract " +
                                       Environment.NewLine;
                        strBldr.Append(File.ReadAllText(textFile));
                    }
                    // Delete text file after reading text here
                    Directory.GetFiles(appPath + "\pdf2png").ToList().ForEach(File.Delete);
                    Directory.GetFiles(appPath + "\png2text").ToList().ForEach(File.Delete);
                }
            }
        }
        catch (Exception exception)
        {
            MessageBox.Show(
                " Cought Exception in Generating image do{...}while{...} function " +
                Environment.NewLine + exception.Message + Environment.NewLine);
        }
    }
    // Delete png image here
    Directory.GetFiles(appPath + "\pdf2png").ToList().ForEach(File.Delete);
    Thread.Sleep(millisecondsTimeout: 1000);
    // Read text from text file here
    foreach (var textFile in Directory.GetFiles(appPath + "\png2text", "*.txt",
        SearchOption.AllDirectories))
    {
        loggingInfo += textFile +
                       " In Reading Text from generated text file by Tesseract " +
                       Environment.NewLine;
        strBldr.Append(File.ReadAllText(textFile));
    }
    // Delete text file after reading text here
    Directory.GetFiles(appPath + "\png2text").ToList().ForEach(File.Delete);
} while (extractor.GetNextImage()); // Advance image enumeration... 

回答by Muadzir Aziz

In my case I had all these worked except for the correct character recognition.

在我的情况下,除了正确的字符识别外,所有这些都有效。

But you need to consider these few things:

但是你需要考虑以下几点:

  • Use correct tessnet2 library
  • use correct tessdata language version
  • tessdata should be somewhere out of your application folder where you can put in full path in the init parameter. use ocr.Init(@"c:\tessdata", "eng", true);
  • Debugging will cause you headache. Then you need to update your app.config use this. (I can't put the xml code here. give me your email i will email it to you)
  • 使用正确的 tessnet2 库
  • 使用正确的 tessdata 语言版本
  • tessdata 应该在您的应用程序文件夹之外的某个地方,您可以在其中放入 init 参数中的完整路径。用ocr.Init(@"c:\tessdata", "eng", true);
  • 调试会让你头疼。然后你需要更新你的 app.config 使用它。(我不能把 xml 代码放在这里。给我你的电子邮件,我将通过电子邮件发送给你)

hope that this helps

希望这有帮助

回答by Alex G

Here's a great working example project; Tesseract OCR Sample (Visual Studio) with Leptonica PreprocessingTesseract OCR Sample (Visual Studio) with Leptonica Preprocessing

这是一个很棒的工作示例项目;Tesseract OCR 示例 (Visual Studio) 和 Leptonica 预处理Tesseract OCR 示例 (Visual Studio) 和 Leptonica 预处理

Tesseract OCR 3.02.02 API can be confusing, so this guides you through including the Tesseract and Leptonica dll into a Visual Studio C++ Project, and provides a sample file which takes an image path to preprocess and OCR. The preprocessing script in Leptonica converts the input image into black and white book-like text.

Tesseract OCR 3.02.02 API 可能会令人困惑,因此这将指导您将 Tesseract 和 Leptonica dll 包含到 Visual Studio C++ 项目中,并提供一个示例文件,该文件采用图像路径进行预处理和 OCR。Leptonica 中的预处理脚本将输入图像转换为黑白书籍般的文本。

Setup

设置

To include this in your own projects, you will need to reference the header files and lib and copy the tessdata folders and dlls.

要将其包含在您自己的项目中,您需要引用头文件和 lib 并复制 tessdata 文件夹和 dll。

Copy the tesseract-include folder to the root folder of your project. Now Click on your project in Visual Studio Solution Explorer, and go to Project>Properties.

将 tesseract-include 文件夹复制到项目的根文件夹。现在在 Visual Studio 解决方案资源管理器中单击您的项目,然后转到项目>属性。

VC++ Directories>Include Directories:

VC++ 目录>包含目录:

..\tesseract-include\tesseract;..\tesseract-include\leptonica;$(IncludePath) C/C++>Preprocessor>Preprocessor Definitions:

..\tesseract-include\tesseract;..\tesseract-include\leptonica;$(IncludePath) C/C++>预处理器>预处理器定义:

_CRT_SECURE_NO_WARNINGS;%(PreprocessorDefinitions) C/C++>Linker>Input>Additional Dependencies:

_CRT_SECURE_NO_WARNINGS;%(PreprocessorDefinitions) C/C++>Linker>Input>Additional Dependencies:

..\tesseract-include\libtesseract302.lib;..\tesseract-include\liblept168.lib;%(AdditionalDependencies) Now you can include headers in your project's file:

..\tesseract-include\libtesseract302.lib;..\tesseract-include\liblept168.lib;%(AdditionalDependencies) 现在你可以在你的项目文件中包含头文件:

include

包括

include

包括

Now copy the two dll files in tesseract-include and the tessdata folder in Debug to the Output Directory of your project.

现在将tesseract-include中的两个dll文件和Debug中的tessdata文件夹复制到你项目的输出目录下。

When you initialize tesseract, you need to specify the location of the parent folder (!important) of the tessdata folder if it is not already the current directory of your executable file. You can copy my script, which assumes tessdata is installed in the executable's folder.

初始化tesseract时,如果tessdata文件夹还不是可执行文件的当前目录,则需要指定tessdata文件夹的父文件夹(!重要)的位置。您可以复制我的脚本,该脚本假定 tessdata 安装在可执行文件的文件夹中。

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); api->Init("D:\tessdataParentFolder\", ... Sample

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); api->Init("D:\tessdataParentFolder\", ... 示例

You can compile the provided sample, which takes one command line argument of the image path to use. The preprocess() function uses Leptonica to create a black and white book-like copy of the image which makes tesseract work with 90% accuracy. The ocr() function shows the functionality of the Tesseract API to return a string output. The toClipboard() can be used to save text to clipboard on Windows. You can copy these into your own projects.

您可以编译提供的示例,该示例需要使用图像路径的一个命令行参数。preprocess() 函数使用 Leptonica 创建图像的黑白书本状副本,这使 tesseract 以 90% 的准确率工作。ocr() 函数展示了 Tesseract API 返回字符串输出的功能。toClipboard() 可用于在 Windows 上将文本保存到剪贴板。您可以将这些复制到您自己的项目中。

回答by Adolfo Alejandro Araya

A simple example of testing Tesseract OCR in C#:

在 C# 中测试 Tesseract OCR 的简单示例:

    public static string GetText(Bitmap imgsource)
    {
        var ocrtext = string.Empty;
        using (var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default))
        {
            using (var img = PixConverter.ToPix(imgsource))
            {
                using (var page = engine.Process(img))
                {
                    ocrtext = page.GetText();
                }
            }
        }

        return ocrtext;
    }

Info: The tessdatafolder must exist in the repository: bin\Debug\

信息:tessdata文件夹必须存在于存储库中:bin\Debug\

回答by Doppelganger

I was able to get it to work by following these instructions.

我能够按照这些说明让它工作。

  • Download the sample codeTesseract sample code

  • Unzip it to a new location

  • Open ~\tesseract-samples-master\src\Tesseract.Samples.sln (I used Visual Studio 2017)

  • Install the Tesseract NuGet package for that project (or uninstall/reinstall as I had to) NuGet Tesseract

  • Uncomment the last two meaningful lines in Tesseract.Samples.Program.cs: Console.Write("Press any key to continue . . . "); Console.ReadKey(true);

  • Run (hit F5)

  • You should get this windows console output enter image description here

  • 下载示例代码Tesseract 示例代码

  • 将其解压缩到新位置

  • 打开 ~\tesseract-samples-master\src\Tesseract.Samples.sln(我用的是 Visual Studio 2017)

  • 为该项目安装 Tesseract NuGet 包(或卸载/重新安装) NuGet 立方体

  • 取消注释 Tesseract.Samples.Program.cs 中最后两行有意义的行: Console.Write("Press any key to continue . . . "); Console.ReadKey(true);

  • 运行(按 F5)

  • 你应该得到这个 Windows 控制台输出 在此处输入图片说明