java 从图像中去除背景噪音,使 OCR 的文本更清晰

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33881175/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-02 22:13:40  来源:igfitidea点击:

Remove background noise from image to make text more clear for OCR

javac++opencvocr

提问by Zy0n

I've written an application that segments an image based on the text regions within it, and extracts those regions as I see fit. What I'm attempting to do is clean the image so OCR (Tesseract) gives an accurate result. I have the following image as an example:

我编写了一个应用程序,该应用程序根据其中的文本区域对图像进行分段,并在我认为合适的情况下提取这些区域。我试图做的是清理图像,以便 OCR (Tesseract) 给出准确的结果。我以以下图像为例:

enter image description here

在此处输入图片说明

Running this through tesseract gives a widely inaccurate result. However cleaning up the image (using photoshop) to get the image as follows:

通过 tesseract 运行它会得到一个非常不准确的结果。但是清理图像(使用photoshop)以获得图像如下:

enter image description here

在此处输入图片说明

Gives exactly the result I would expect. The first image is already being run through the following method to clean it to that point:

给出了我所期望的结果。第一个图像已经通过以下方法运行以将其清理到该点:

 public Mat cleanImage (Mat srcImage) {
    Core.normalize(srcImage, srcImage, 0, 255, Core.NORM_MINMAX);
    Imgproc.threshold(srcImage, srcImage, 0, 255, Imgproc.THRESH_OTSU);
    Imgproc.erode(srcImage, srcImage, new Mat());
    Imgproc.dilate(srcImage, srcImage, new Mat(), new Point(0, 0), 9);
    return srcImage;
}

What more can I do to clean the first image so it resembles the second image?

我还能做些什么来清理第一个图像,使其与第二个图像相似?

Edit: This is the original image before it's run through the cleanImagefunction.

编辑:这是运行该cleanImage功能之前的原始图像。

enter image description here

在此处输入图片说明

回答by dhanushka

My answer is based on following assumptions. It's possible that none of them holds in your case.

我的回答是基于以下假设。在您的情况下,它们中的任何一个都可能不成立。

  • It's possible for you to impose a threshold for bounding box heights in the segmented region. Then you should be able to filter out other components.
  • You know the average stroke widths of the digits. Use this information to minimize the chance that the digits are connected to other regions. You can use distance transform and morphological operations for this.
  • 您可以为分割区域中的边界框高度设置阈值。然后您应该能够过滤掉其他组件。
  • 您知道数字的平均笔画宽度。使用此信息可最大程度地减少数字与其他区域相关联的机会。您可以为此使用距离变换和形态学操作。

This is my procedure for extracting the digits:

这是我提取数字的程序:

  • Apply Otsu threshold to the image otsu
  • Take the distance transform dist
  • Threshold the distance transformed image using the stroke-width ( = 8) constraint sw2

  • Apply morphological operation to disconnect ws2op

  • Filter bounding box heights and make a guess where the digits are

  • 对图像应用 Otsu 阈值 大津
  • 进行距离变换 区
  • 使用笔画宽度 (= 8) 约束对距离变换图像设置阈值 sw2

  • 应用形态学操作断开连接 ws2op

  • 过滤边界框高度并猜测数字的位置

stroke-width = 8 bbstroke-width = 10 bb2

笔画宽度 = 8 bb笔画宽度 = 10 bb2

EDIT

编辑

  • Prepare a mask using the convexhull of the found digit contours mask

  • Copy digits region to a clean image using the mask

  • 使用找到的数字轮廓的凸包准备掩码 面具

  • 使用遮罩将数字区域复制到干净的图像

stroke-width = 8 cl1

笔画宽度 = 8 CL1

stroke-width = 10 cl2

笔画宽度 = 10 CL2

My Tesseract knowledge is a bit rusty. As I remember you can get a confidence level for the characters. You may be able to filter out noise using this information if you still happen to detect noisy regions as character bounding boxes.

我的 Tesseract 知识有点生疏。我记得你可以获得角色的置信度。如果您仍然碰巧将噪声区域检测为字符边界框,则您可以使用此信息过滤掉噪声。

C++ Code

C++代码

Mat im = imread("aRh8C.png", 0);
// apply Otsu threshold
Mat bw;
threshold(im, bw, 0, 255, CV_THRESH_BINARY_INV | CV_THRESH_OTSU);
// take the distance transform
Mat dist;
distanceTransform(bw, dist, CV_DIST_L2, CV_DIST_MASK_PRECISE);
Mat dibw;
// threshold the distance transformed image
double SWTHRESH = 8;    // stroke width threshold
threshold(dist, dibw, SWTHRESH/2, 255, CV_THRESH_BINARY);
Mat kernel = getStructuringElement(MORPH_RECT, Size(3, 3));
// perform opening, in case digits are still connected
Mat morph;
morphologyEx(dibw, morph, CV_MOP_OPEN, kernel);
dibw.convertTo(dibw, CV_8U);
// find contours and filter
Mat cont;
morph.convertTo(cont, CV_8U);

Mat binary;
cvtColor(dibw, binary, CV_GRAY2BGR);

const double HTHRESH = im.rows * .5;    // height threshold
vector<vector<Point>> contours;
vector<Vec4i> hierarchy;
vector<Point> digits; // points corresponding to digit contours

findContours(cont, contours, hierarchy, CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE, Point(0, 0));
for(int idx = 0; idx >= 0; idx = hierarchy[idx][0])
{
    Rect rect = boundingRect(contours[idx]);
    if (rect.height > HTHRESH)
    {
        // append the points of this contour to digit points
        digits.insert(digits.end(), contours[idx].begin(), contours[idx].end());

        rectangle(binary, 
            Point(rect.x, rect.y), Point(rect.x + rect.width - 1, rect.y + rect.height - 1),
            Scalar(0, 0, 255), 1);
    }
}

// take the convexhull of the digit contours
vector<Point> digitsHull;
convexHull(digits, digitsHull);
// prepare a mask
vector<vector<Point>> digitsRegion;
digitsRegion.push_back(digitsHull);
Mat digitsMask = Mat::zeros(im.rows, im.cols, CV_8U);
drawContours(digitsMask, digitsRegion, 0, Scalar(255, 255, 255), -1);
// expand the mask to include any information we lost in earlier morphological opening
morphologyEx(digitsMask, digitsMask, CV_MOP_DILATE, kernel);
// copy the region to get a cleaned image
Mat cleaned = Mat::zeros(im.rows, im.cols, CV_8U);
dibw.copyTo(cleaned, digitsMask);

EDIT

编辑

Java Code

Java代码

Mat im = Highgui.imread("aRh8C.png", 0);
// apply Otsu threshold
Mat bw = new Mat(im.size(), CvType.CV_8U);
Imgproc.threshold(im, bw, 0, 255, Imgproc.THRESH_BINARY_INV | Imgproc.THRESH_OTSU);
// take the distance transform
Mat dist = new Mat(im.size(), CvType.CV_32F);
Imgproc.distanceTransform(bw, dist, Imgproc.CV_DIST_L2, Imgproc.CV_DIST_MASK_PRECISE);
// threshold the distance transform
Mat dibw32f = new Mat(im.size(), CvType.CV_32F);
final double SWTHRESH = 8.0;    // stroke width threshold
Imgproc.threshold(dist, dibw32f, SWTHRESH/2.0, 255, Imgproc.THRESH_BINARY);
Mat dibw8u = new Mat(im.size(), CvType.CV_8U);
dibw32f.convertTo(dibw8u, CvType.CV_8U);

Mat kernel = Imgproc.getStructuringElement(Imgproc.MORPH_RECT, new Size(3, 3));
// open to remove connections to stray elements
Mat cont = new Mat(im.size(), CvType.CV_8U);
Imgproc.morphologyEx(dibw8u, cont, Imgproc.MORPH_OPEN, kernel);
// find contours and filter based on bounding-box height
final double HTHRESH = im.rows() * 0.5; // bounding-box height threshold
List<MatOfPoint> contours = new ArrayList<MatOfPoint>();
List<Point> digits = new ArrayList<Point>();    // contours of the possible digits
Imgproc.findContours(cont, contours, new Mat(), Imgproc.RETR_CCOMP, Imgproc.CHAIN_APPROX_SIMPLE);
for (int i = 0; i < contours.size(); i++)
{
    if (Imgproc.boundingRect(contours.get(i)).height > HTHRESH)
    {
        // this contour passed the bounding-box height threshold. add it to digits
        digits.addAll(contours.get(i).toList());
    }   
}
// find the convexhull of the digit contours
MatOfInt digitsHullIdx = new MatOfInt();
MatOfPoint hullPoints = new MatOfPoint();
hullPoints.fromList(digits);
Imgproc.convexHull(hullPoints, digitsHullIdx);
// convert hull index to hull points
List<Point> digitsHullPointsList = new ArrayList<Point>();
List<Point> points = hullPoints.toList();
for (Integer i: digitsHullIdx.toList())
{
    digitsHullPointsList.add(points.get(i));
}
MatOfPoint digitsHullPoints = new MatOfPoint();
digitsHullPoints.fromList(digitsHullPointsList);
// create the mask for digits
List<MatOfPoint> digitRegions = new ArrayList<MatOfPoint>();
digitRegions.add(digitsHullPoints);
Mat digitsMask = Mat.zeros(im.size(), CvType.CV_8U);
Imgproc.drawContours(digitsMask, digitRegions, 0, new Scalar(255, 255, 255), -1);
// dilate the mask to capture any info we lost in earlier opening
Imgproc.morphologyEx(digitsMask, digitsMask, Imgproc.MORPH_DILATE, kernel);
// cleaned image ready for OCR
Mat cleaned = Mat.zeros(im.size(), CvType.CV_8U);
dibw8u.copyTo(cleaned, digitsMask);
// feed cleaned to Tesseract

回答by Hazem Abdullah

I think you need to work more on the pre-processing part to prepare the image to be clear as much as you can before calling the tesseract.

我认为你需要在预处理部分做更多的工作,以便在调用 tesseract 之前尽可能多地准备清晰的图像。

What's my ideas to do that are the following:

我的想法是:

1- Extract contours from the image and find contours in the image (check this) and this

1-从图像中提取轮廓并在图像中找到轮廓(检查这个)和这个

2- Each contours have width, height and area, so you may filter the contours according to the width, height and its area (check thisand this), plus you may use some part of the contour analysis code here to filter the contours and more you may delete the contours that are not similar to a "letter or number" contour using a template contour matching.

2-每个轮廓都有宽度,高度和面积,因此您可以根据宽度,高度及其面积过滤轮廓(检查thisthis),另外您可以使用此处的轮廓分析代码的某些部分来过滤轮廓和您还可以使用模板轮廓匹配删除与“字母或数字”轮廓不相似的轮廓。

3- After filter the contour you may check where are the letters and the numbers in this image, so you may need to use some text detection methods like here

3-过滤轮廓后,您可以检查此图像中的字母和数字在哪里,因此您可能需要使用一些文本检测方法,例如here

4- All what you need now if to remove the non-text area, and the contours that are not good from the image

4- 如果要删除非文本区域和图像中不好的轮廓,您现在需要的所有内容

5- Now you can create your binirization method or you may use the tesseract one to do the binirization to the image then call the OCR on the image.

5-现在您可以创建您的二值化方法,或者您可以使用tesseract对图像进行二值化,然后在图像上调用OCR。

Sure these are the best steps to do this, you may use some of them and it may enough for you.

当然,这些是执行此操作的最佳步骤,您可以使用其中的一些,这对您来说就足够了。

Other ideas:

其他想法:

  • You may use different ways to do this the best idea is to find a way to detect the digit and character location using different methods like template matching, or feature based like HOG.

  • You may first to do binarization to your image and get the binary image, then you need to apply opening with line structural for the horizontal and vertical and this will help you to detect the edges after that and do the segmentation on the image then the OCR.

  • After detecting all the contours in the image, you also may use Hough transformationto detect any kind of line and defined curve like this one, and in this way you can detect the characters that are a lined so you may segment the image and do the OCR after that.

  • 您可以使用不同的方法来做到这一点,最好的想法是找到一种使用不同方法(如模板匹配)或基于特征(如 HOG)来检测数字和字符位置的方法。

  • 您可能首先对您的图像进行二值化并获得二值图像,然后您需要应用水平和垂直线结构的开口,这将帮助您在此之后检测边缘并对图像进行分割,然后是 OCR .

  • 检测图像中的所有轮廓后,您还可以使用Hough transformation检测到任何类型的线路和自定义曲线像这样的一个,并以这种方式则可以检测到被一字排开,所以你可以分割的图像字符和后做OCR那。

Much easier way:

更简单的方法:

1- Do binirization enter image description here

1- 做二值化 在此处输入图片说明

2- Some morphology operation to separate the contours:

2- 一些形态学操作来分离轮廓:

enter image description here

在此处输入图片说明

3- Inverse the color in the image (this may be before step 2)

3- 反转图像中的颜色(这可能在第 2 步之前)

enter image description here

在此处输入图片说明

4- Find all contours in the image

4- 找到图像中的所有轮廓

enter image description here

在此处输入图片说明

5- Delete all the contours that width is more than its height, delete the very small contours, the very large ones, and the not rectangle contours

5-删除宽度大于高度的所有轮廓,删除非常小的轮廓,非常大的轮廓和非矩形轮廓

enter image description here

在此处输入图片说明

Note : you may use the text detection methods (or using HOG or edge detection) instead of step 4 and 5

注意:您可以使用文本检测方法(或使用 HOG 或边缘检测)而不是步骤 4 和 5

6- Find the large rectangle that contain all the remaining contours in the image

6- 找到包含图像中所有剩余轮廓的大矩形

enter image description here

在此处输入图片说明

7- You may do some extra pre-processing to enhance the input for the tesseract then you may call the OCR now. (I advice you to crop the image and make it as an input to the OCR [I mean crop the yellow rectangle and do not make the whole image as an input just the yellow rectangle and that will enhance the results also])

7- 您可以进行一些额外的预处理以增强正方体的输入,然后您现在可以调用 OCR。(我建议您裁剪图像并将其作为 OCR 的输入 [我的意思是裁剪黄色矩形,不要将整个图像作为黄色矩形的输入,这也会增强结果])

回答by MarkusAtCvlabDotDe

Would that image help you?

这张图对你有帮助吗?

enter image description here

在此处输入图片说明

The algorithm producing that image would be easy to implement. I am sure, if you tweak some of its parameters, you can get very good results for that kind of images.

生成该图像的算法很容易实现。我敢肯定,如果你调整它的一些参数,你可以获得那种图像的非常好的结果。

I tested all the images with tesseract:

我用tesseract测试了所有图像:

  • Original image : Nothing detected
  • Processed image #1 : Nothing detected
  • Processed image #2 : 12-14 (exact match)
  • My processed image : y'1'2-14/j
  • 原始图像:未检测到任何内容
  • 处理后的图像 #1:未检测到任何内容
  • 处理后的图像 #2:12-14(完全匹配)
  • 我处理的图像:y'1'2-14/j

回答by Yannis Douros

Just a little bit of thinking out of the box:

开箱即用的一点思考:

I can see from your original image that it's a rather rigorously preformatted document, looks like a road tax badge or something like that, right?

我可以从你的原始图片中看到,这是一个相当严格的预先格式化的文件,看起来像路税徽章或类似的东西,对吧?

If the assumption above is correct, then you could implement a less generic solution: The noise you are trying to get rid of is due to features of the specific document template, it occurs in specific and known regions of your image. In fact, so does the text.

如果上述假设是正确的,那么您可以实施一个不太通用的解决方案:您试图消除的噪音是由于特定文档模板的特征,它出现在图像的特定和已知区域。事实上,文本也是如此。

In that case, one of the ways to go about is define the boundaries of the regions where you know that there is such "noise", and just white them out.

在这种情况下,一种方法是定义您知道存在此类“噪音”的区域的边界,然后将它们清除。

Then, follow the rest of the steps that you are already following: Do the noise reduction that will remove the finest detail (i.e. the background pattern that looks like the safety watermark or hologram in the badge). The result should be clear enough for Tesseract to process without trouble.

然后,按照您已经遵循的其余步骤进行操作: 进行降噪以去除最细微的细节(即看起来像徽章中的安全水印或全息图的背景图案)。结果应该足够清晰,让 Tesseract 可以轻松处理。

Just a thought anyway. Not a generic solution, I acknowledge that, so it depends on what your actual requirements are.

反正只是一个念头。不是通用的解决方案,我承认这一点,所以这取决于您的实际需求。

回答by Gowthaman

The font size should not be so big or small, approximately it should in range of 10-12 pt(i.e, character height approximately above 20 and less than 80). you can down sample the image and try with tesseract. And few fonts are not trained in tesseract, the issue may arise if it is not in that trained fonts.

字体大小不宜过大或过小,大约在10-12pt范围内(即字高大约在20以上80以下)。您可以对图像进行下采样并尝试使用tesseract。很少有字体没有经过 tesseract 训练,如果不是经过训练的字体,可能会出现问题。