Java 如何将pdf表单字段自动导出到xml
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21009608/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to export pdf form fields to xml automatically
提问by Michael
I have a pdf
file including form fields and need to export the data into a xml
file AUTOMATICALLY. Here is a screen of a sample form I created for testing:
我有一个pdf
包含表单字段的文件,需要将数据自动导出到xml
文件中。这是我为测试创建的示例表单的屏幕:
Note: It works great exporting it MANUALLYusing Acrobat Professional by clicking on Tools > Form > Export Form Data
and finally chose xml extension for file output. This is the result I'm getting when I export it manually:
注意:通过单击并最终选择 xml 扩展名进行文件输出,使用 Acrobat Professional手动导出它效果很好Tools > Form > Export Form Data
。这是我手动导出时得到的结果:
<?xml version="1.0" encoding="UTF-8"?>
<fields>
<first_name>John</first_name>
<last_name>Doe</last_name>
</fields>
However, I need to automate it, e.g. with a python script, Java implementationor some command line tools. Any ideas which libraries or tools I could use to export form field data to xml
? The tool or library should be open source, that I can integrate it in my workflow.
但是,我需要自动化它,例如使用python 脚本、Java 实现或一些命令行工具。我可以使用哪些库或工具将表单字段数据导出到任何想法xml
?工具或库应该是开源的,我可以将它集成到我的工作流程中。
I already tried python pdfminer
library, which helped me to export static parts (like Static form header
, First name:
and Last name:
) of the pdf file: But how to export form field data (in my case the content of the form fields first_name
and last_name
)??
我已经尝试过 pythonpdfminer
库,它帮助我导出pdf 文件的静态部分(如Static form header
,First name:
和Last name:
):但是如何导出表单字段数据(在我的情况下是表单字段的内容first_name
和last_name
)??
EDIT: Feel free to download the sample.pdf file here.
编辑:随意在这里下载 sample.pdf 文件。
采纳答案by jimmyp.smith
How about Apache PDFBox? It is open source and could fit your needs, since the website says "Extract forms data from PDF forms or prefill a PDF form."
怎么样的Apache PDFBox的?它是开源的,可以满足您的需求,因为网站上写着“从 PDF 表单中提取表单数据或预填 PDF 表单”。
EDIT: Check out the PrintFields example.
编辑:查看PrintFields 示例。
回答by annaskulimowska
In Java there is a few libraries to work with PDF, but generally it's hard to get formatted information from PDF. I have never implemented that thing, but Qoppa looks good and seems to be advanced but it's not free. It contains jPDFFieldswhich should be useful to extract values from form fields. Also there is a similar thread, in which there is some information about the command line tool.
在 Java 中有一些库可以处理 PDF,但通常很难从 PDF 中获取格式化信息。我从来没有实现过那个东西,但 Qoppa 看起来不错,似乎很先进,但它不是免费的。它包含jPDFFields,这对于从表单字段中提取值应该很有用。还有一个类似的线程,里面有一些关于命令行工具的信息。
I hope it will be helpful for you.
我希望它会对你有所帮助。
回答by James Kingsbery
In bash, you can do this (at least with my version of these tools, less 444 and cat 8.13):
在 bash 中,您可以这样做(至少使用我的这些工具版本,少于 444 和 cat 8.13):
less ~/Downloads/sample.pdf | cat
I get output that looks like this:
我得到如下所示的输出:
Static form header
First name: John
Last name: Doe
Which you can then parse pretty obviously using Java/Python/awk/whatever.
然后你可以很明显地使用 Java/Python/awk/whatever 解析它们。
Of course, alternatively, if you don't want to rely on the behavior of particular versions of these (not sure if they always do this or not), you can look up less's source codeto see how it does it.
当然,或者,如果您不想依赖这些特定版本的行为(不确定他们是否总是这样做),您可以查找less 的源代码,看看它是如何做到的。
回答by Guy Gavriely
I had much success using pdfminer:
我使用pdfminer取得了很大的成功:
pdf2txt.py -o out.xml -t xml sample.pdf
and then parse it using xpath and join strings, to use it from your code track the code here
然后使用 xpath 解析它并连接字符串,以从您的代码中使用它,在此处跟踪代码
other than that there is a new kid on the block called tabula, written in ruby which I didnt get the chance to use yet but supposed to be great
除此之外还有一个叫tabula的新孩子,用 ruby 写的,我还没有机会使用它,但应该很棒
I understand your unwilling to use paid service, but still worth mentioning that Adobe have a conversion service that at the time of writing costs 2$ a month, check it out, just saying...
我理解你不愿意使用付费服务,但还是值得一提的是,Adobe 有一个转换服务,在撰写本文时每月收费 2 美元,查看一下,只是说...
回答by Jonathan
For a Java solution, you could use iTextto read the fields and then something like Hymanson-dataformat-xmlto write the results as XML. A, somewhat basic, example of this would be:
对于 Java 解决方案,您可以使用iText读取字段,然后使用Hymanson-dataformat-xml 之类的东西将结果写入 XML。一个有点基本的例子是:
// read fields
final PdfReader reader = new PdfReader("/path/to/my.pdf");
final AcroFields fields = reader.getAcroFields();
final Map<String, Object> values = new HashMap<>();
for (String fieldName : (Set<String>) fields.getFields().keySet()) {
values.put(fieldName, fields.getField(fieldName));
}
// write
final XmlMapper mapper = new XmlMapper();
final String result = mapper.writeValueAsString(values);
System.out.println(result);
There is definitely some room for improvement here, but it may be a good enough starting point.
这里肯定有一些改进的空间,但它可能是一个足够好的起点。