使用 iTextSharp c# 从 PDF 中逐行提取文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15748800/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract text by line from PDF using iTextSharp c#
提问by Xander
I need to run some analysis my extracting data from a PDF document.
我需要运行一些分析我从 PDF 文档中提取数据。
Using iTextSharp
, I used the PdfTextExtractor.GetTextFromPage
method to extract contents from a PDF document and it returned me in a single long line.
使用iTextSharp
,我使用该PdfTextExtractor.GetTextFromPage
方法从 PDF 文档中提取内容,它在一行中返回给我。
Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.
有没有办法逐行获取文本,以便我可以将它们存储在数组中?这样我就可以逐行分析数据,这将更加灵活。
Below is the code I used:
下面是我使用的代码:
string urlFileName1 = "pdf_link";
PdfReader reader = new PdfReader(urlFileName1);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
candidate3.Text = text.ToString();
回答by adebayo
Try
尝试
String page = PdfTextExtractor.getTextFromPage(reader, 2);
String s1[]=page.split("\n");
回答by Kumar Sandeep
Use LocationTextExtractionStrategy in lieu of SimpleTextExtractionStrategy. LocationTextExtractionStrategy extracted text contains the new line character at the end of line.
使用 LocationTextExtractionStrategy 代替 SimpleTextExtractionStrategy。LocationTextExtractionStrategy 提取的文本包含行尾的换行符。
ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;
回答by Snziv Gupta
public void ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
MessageBox.Show(line);
}
}
}
}
回答by Silent Sojourner
LocationTextExtractionStrategy will automatically insert '\n' in the output text. However, sometimes it will insert '\n' where it shouldn't. In that case you need to build a custom TextExtractionStrategy or RenderListener. Bascially the code that detects newline is the method
LocationTextExtractionStrategy 将自动在输出文本中插入 '\n'。但是,有时它会在不该插入的地方插入 '\n'。在这种情况下,您需要构建自定义 TextExtractionStrategy 或 RenderListener。基本上检测换行符的代码是方法
public virtual bool SameLine(ITextChunkLocation other) {
return OrientationMagnitude == other.OrientationMagnitude &&
DistPerpendicular == other.DistPerpendicular;
}
In some cases '\n' shouldn't be inserted if there is only small difference between DistPerpendicular and other.DistPerpendicular, so you need to change it to something like Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10
在某些情况下,如果 DistPerpendicular 和 other.DistPerpendicular 之间只有很小的差异,则不应插入 '\n',因此您需要将其更改为 Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10
Or you can put that piece of code in the RenderText method of your custom TextExtractionStrategy/RenderListener class
或者你可以把这段代码放在你的自定义 TextExtractionStrategy/RenderListener 类的 RenderText 方法中
回答by supersoka
I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:
我知道这是在较旧的帖子上发布的,但我花了很多时间试图弄清楚这一点,所以我将与未来尝试使用谷歌搜索的人分享这个:
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = @"Your said path\the file name.pdf";
string outPath = @"the output said path\the text file name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.
我让程序以 PDF 格式读取,从设置的路径中读取,然后输出到文本文件,但您可以对其进行任何操作。这是建立在 Snziv Gupta 的回应之上的。
回答by dodgy_coder
All the other code samples here didn't work for me, probably due to changes to the itext7 API.
这里的所有其他代码示例对我都不起作用,可能是由于 itext7 API 的更改。
This minimal example here works ok:
这个最小的例子在这里工作正常:
var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());