.net 使用 itextsharp 从 Pdf 文件中提取文本和文本矩形坐标

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4577789/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-03 15:05:12  来源:igfitidea点击:

Extract text and text rectangle coordinates from a Pdf file using itextsharp

.netpdfitextsharp

提问by Pakhu

I'm trying to get all words and their location coordinates from a PDF file. I've succeeded using the Acrobat API on .NET. Now, I'm trying to get the same result using a free API, such as iTextSharp (the .NETversion). I can get the text (line by line) with PRTokeniser, but I have no idea of how to get the coordinates of the line, let alone of each word.

我正在尝试从 PDF 文件中获取所有单词及其位置坐标。我已经成功地在.NET. 现在,我正在尝试使用免费 API 获得相同的结果,例如 iTextSharp(.NET版本)。我可以使用 获取文本(逐行)PRTokeniser,但我不知道如何获取行的坐标,更不用说每个单词的坐标了。

回答by greenhat

My account is too new reply to Mark Storer's answer.

我的帐户对 Mark Storer 的回答太新了。

I wasn't able to directly use the LocationTextExtracationStrategy (I think I must be doing something wrong). When I used the LocationTextExtracationStrategy I was able to get the text but I couldn't figure out how to get the coords for each string (or line of strings).

我无法直接使用 LocationTextExtracationStrategy (我想我一定是做错了什么)。当我使用 LocationTextExtracationStrategy 时,我能够获取文本,但我无法弄清楚如何获取每个字符串(或字符串行)的坐标。

I ended up subclassing the LocationTextExtracationStrategy and exposing the data I wanted because it does have it internally.

我最终继承了 LocationTextExtracationStrategy 并公开了我想要的数据,因为它内部确实有它。

I also wanted it in .net... so here is a sloppy C# version of what I put together.

我也想在 .net 中使用它...所以这里是我放在一起的一个草率的 C# 版本。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

using iTextSharp.text.pdf.parser;

namespace PdfHelper
{
    /// <summary>
    /// Taken from http://www.java-frameworks.com/java/itext/com/itextpdf/text/pdf/parser/LocationTextExtractionStrategy.java.html
    /// </summary>
    class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
    {
        private List<TextChunk> m_locationResult = new List<TextChunk>();
        private List<TextInfo> m_TextLocationInfo = new List<TextInfo>();
        public List<TextChunk> LocationResult 
        {
            get { return m_locationResult; }
        }
        public List<TextInfo> TextLocationInfo
        {
            get { return m_TextLocationInfo; }
        }

        /// <summary>
        /// Creates a new LocationTextExtracationStrategyEx
        /// </summary>
        public LocationTextExtractionStrategyEx()
        {
        }

        /// <summary>
        /// Returns the result so far
        /// </summary>
        /// <returns>a String with the resulting text</returns>
        public override String GetResultantText()
        {
            m_locationResult.Sort();

            StringBuilder sb = new StringBuilder();
            TextChunk lastChunk = null;
            TextInfo lastTextInfo = null;
            foreach (TextChunk chunk in m_locationResult)
            {
                if (lastChunk == null)
                {
                    sb.Append(chunk.Text);
                    lastTextInfo = new TextInfo(chunk);
                    m_TextLocationInfo.Add(lastTextInfo);
                }
                else
                {
                    if (chunk.sameLine(lastChunk))
                    {
                        float dist = chunk.distanceFromEndOf(lastChunk);

                        if (dist < -chunk.CharSpaceWidth)
                        {
                            sb.Append(' ');
                            lastTextInfo.addSpace();
                        }
                        //append a space if the trailing char of the prev string wasn't a space && the 1st char of the current string isn't a space
                        else if (dist > chunk.CharSpaceWidth / 2.0f && chunk.Text[0] != ' ' && lastChunk.Text[lastChunk.Text.Length - 1] != ' ')
                        {
                            sb.Append(' ');
                            lastTextInfo.addSpace();
                        }
                        sb.Append(chunk.Text);
                        lastTextInfo.appendText(chunk);
                    }
                    else
                    {
                        sb.Append('\n');
                        sb.Append(chunk.Text);
                        lastTextInfo = new TextInfo(chunk);
                        m_TextLocationInfo.Add(lastTextInfo);
                    }
                }
                lastChunk = chunk;
            }
            return sb.ToString();
        }

        /// <summary>
        /// 
        /// </summary>
        /// <param name="renderInfo"></param>
        public override void RenderText(TextRenderInfo renderInfo)
        {
            LineSegment segment = renderInfo.GetBaseline();
            TextChunk location = new TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetAscentLine(), renderInfo.GetDescentLine());
            m_locationResult.Add(location);
        }

        public class TextChunk : IComparable, ICloneable
        {
            string m_text;
            Vector m_startLocation;
            Vector m_endLocation;
            Vector m_orientationVector;
            int m_orientationMagnitude;
            int m_distPerpendicular;
            float m_distParallelStart;
            float m_distParallelEnd;
            float m_charSpaceWidth;

            public LineSegment AscentLine;
            public LineSegment DecentLine;

            public object Clone()
            {
                TextChunk copy = new TextChunk(m_text, m_startLocation, m_endLocation, m_charSpaceWidth, AscentLine, DecentLine);
                return copy;
            }

            public string Text
            {
                get { return m_text; }
                set { m_text = value; }
            }
            public float CharSpaceWidth
            {
                get { return m_charSpaceWidth; }
                set { m_charSpaceWidth = value; }
            }
            public Vector StartLocation
            {
                get { return m_startLocation; }
                set { m_startLocation = value; }
            }
            public Vector EndLocation
            {
                get { return m_endLocation; }
                set { m_endLocation = value; }
            }

            /// <summary>
            /// Represents a chunk of text, it's orientation, and location relative to the orientation vector
            /// </summary>
            /// <param name="txt"></param>
            /// <param name="startLoc"></param>
            /// <param name="endLoc"></param>
            /// <param name="charSpaceWidth"></param>
            public TextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth, LineSegment ascentLine, LineSegment decentLine)
            {
                m_text = txt;
                m_startLocation = startLoc;
                m_endLocation = endLoc;
                m_charSpaceWidth = charSpaceWidth;
                AscentLine = ascentLine;
                DecentLine = decentLine;

                m_orientationVector = m_endLocation.Subtract(m_startLocation).Normalize();
                m_orientationMagnitude = (int)(Math.Atan2(m_orientationVector[Vector.I2], m_orientationVector[Vector.I1]) * 1000);

                // see http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
                // the two vectors we are crossing are in the same plane, so the result will be purely
                // in the z-axis (out of plane) direction, so we just take the I3 component of the result
                Vector origin = new Vector(0, 0, 1);
                m_distPerpendicular = (int)(m_startLocation.Subtract(origin)).Cross(m_orientationVector)[Vector.I3];

                m_distParallelStart = m_orientationVector.Dot(m_startLocation);
                m_distParallelEnd = m_orientationVector.Dot(m_endLocation);
            }

            /// <summary>
            /// true if this location is on the the same line as the other text chunk
            /// </summary>
            /// <param name="textChunkToCompare">the location to compare to</param>
            /// <returns>true if this location is on the the same line as the other</returns>
            public bool sameLine(TextChunk textChunkToCompare)
            {
                if (m_orientationMagnitude != textChunkToCompare.m_orientationMagnitude) return false;
                if (m_distPerpendicular != textChunkToCompare.m_distPerpendicular) return false;
                return true;
            }

            /// <summary>
            /// Computes the distance between the end of 'other' and the beginning of this chunk
            /// in the direction of this chunk's orientation vector.  Note that it's a bad idea
            /// to call this for chunks that aren't on the same line and orientation, but we don't
            /// explicitly check for that condition for performance reasons.
            /// </summary>
            /// <param name="other"></param>
            /// <returns>the number of spaces between the end of 'other' and the beginning of this chunk</returns>
            public float distanceFromEndOf(TextChunk other)
            {
                float distance = m_distParallelStart - other.m_distParallelEnd;
                return distance;
            }

            /// <summary>
            /// Compares based on orientation, perpendicular distance, then parallel distance
            /// </summary>
            /// <param name="obj"></param>
            /// <returns></returns>
            public int CompareTo(object obj)
            {
                if (obj == null) throw new ArgumentException("Object is now a TextChunk");

                TextChunk rhs = obj as TextChunk;
                if (rhs != null)
                {
                    if (this == rhs) return 0;

                    int rslt;
                    rslt = m_orientationMagnitude - rhs.m_orientationMagnitude;
                    if (rslt != 0) return rslt;

                    rslt = m_distPerpendicular - rhs.m_distPerpendicular;
                    if (rslt != 0) return rslt;

                    // note: it's never safe to check floating point numbers for equality, and if two chunks
                    // are truly right on top of each other, which one comes first or second just doesn't matter
                    // so we arbitrarily choose this way.
                    rslt = m_distParallelStart < rhs.m_distParallelStart ? -1 : 1;

                    return rslt;
                }
                else
                {
                    throw new ArgumentException("Object is now a TextChunk");
                }
            }
        }

        public class TextInfo
        {
            public Vector TopLeft;
            public Vector BottomRight;
            private string m_Text;

            public string Text
            {
                get { return m_Text; }
            }

            /// <summary>
            /// Create a TextInfo.
            /// </summary>
            /// <param name="initialTextChunk"></param>
            public TextInfo(TextChunk initialTextChunk)
            {
                TopLeft = initialTextChunk.AscentLine.GetStartPoint();
                BottomRight = initialTextChunk.DecentLine.GetEndPoint();
                m_Text = initialTextChunk.Text;
            }

            /// <summary>
            /// Add more text to this TextInfo.
            /// </summary>
            /// <param name="additionalTextChunk"></param>
            public void appendText(TextChunk additionalTextChunk)
            {
                BottomRight = additionalTextChunk.DecentLine.GetEndPoint();
                m_Text += additionalTextChunk.Text;
            }

            /// <summary>
            /// Add a space to the TextInfo.  This will leave the endpoint out of sync with the text.
            /// The assumtion is that you will add more text after the space which will correct the endpoint.
            /// </summary>
            public void addSpace()
            {
                m_Text += ' ';
            }


        }
    }
}

I added a TextLocationInfo property which hands back a List of lines of text + the coords for those lines (upper left and lower right) which can be used to give you a bounding box.

我添加了一个 TextLocationInfo 属性,它返回一个文本行列表 + 这些行的坐标(左上角和右下角),可用于为您提供一个边界框。

I also saw something odd with my initial playing around. It looked like I got the same coords if I pulled the startPoint & endPoint from the baseline (I think the right thing to do, and the thing I did, was to pull those points from ascentLine and DecentLine). My initial pass I just used the baseline. Odd that I didn't see a difference in the resulting coords. So word to the wary... I'm not sure if the coords I'm providing are right... I just think they are/should be.

我也看到了我最初玩耍时的一些奇怪之处。如果我从基线拉出起点和终点,看起来我得到了相同的坐标(我认为正确的做法是从 ascentLine 和 DecentLine 拉出这些点)。我的初始传球我只是使用了基线。奇怪的是,我没有看到结果坐标有什么不同。所以对谨慎的人来说......我不确定我提供的坐标是否正确......我只是认为它们是/应该是。

回答by Mark Storer

You'll want to use the com.itextpdf.text.pdf.parser package classes. They track the current transformation, color, font, and so forth.

您需要使用 com.itextpdf.text.pdf.parser 包类。它们跟踪当前的转换、颜色、字体等。

Sadly, these classes weren't covered in the new book, so you're left with the JavaDoc, and mentally converting it all from Java to C#, which isn't much of a stretch.

遗憾的是,这些类没有包含在新书中,所以你只剩下JavaDoc,并且在精神上将它全部从 Java 转换为 C#,这并不复杂。

So you'll want to plug a LocationTextExtractionStrategyinto a PdfTextExtractor.

因此,您需要将 aLocationTextExtractionStrategy插入PdfTextExtractor.

This will give you the strings and locations as they are presented in the pdf. It will be up to you to interpret that as words (and paragraphs if need be, ouch).

这将为您提供 pdf 中显示的字符串和位置。您可以将其解释为单词(如果需要,还可以将其解释为段落,哎哟)。

Keep in mind that PDF doesn't know anything about text layout. Every character can be placed individually. If someone were so inclined (and they'd have to be a few tacos short of a combo platter to do so) they could draw all the 'a's on a given page, then all the 'b's, and so forth.

请记住,PDF 对文本布局一无​​所知。每个角色都可以单独放置。如果有人如此倾向(并且他们必须在组合拼盘中少一些炸玉米饼才能这样做),他们可以在给定页面上绘制所有'a',然后绘制所有'b',依此类推。

More realistically, someone might draw all the text on the page that uses FontA, then everything for FontB, and so on. This can produce more efficient content streams. Keep in mind that italicand bold(and bold italic) are all separate fonts. If someone marks only part of a word as bold (or whatever), then that Logical Word is required to be broken up into at least two drawing commands.

更现实的是,有人可能会在使用 FontA 的页面上绘制所有文本,然后使用 FontB 绘制所有内容,依此类推。这可以产生更有效的内容流。请记住,斜体粗体(和粗斜体)都是独立的字体。如果有人仅将单词的一部分标记为粗体(或其他),则需要将该逻辑单词分解为至少两个绘图命令。

But lots of folks just write out their text into PDF in logical order... which is Very Handy for folks who are stuck trying to parse it, but you Must Not Expect It. Because you will invariably run into some oddball that doesn't.

但是很多人只是按照逻辑顺序将他们的文本写成 PDF ......这对于那些试图解析它的人来说非常方便,但你不能期待它。因为你总是会遇到一些没有的怪人。