Linux 从 doc 和 docx 中提取文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5540886/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract text from doc and docx
提问by Alexandre Mota
I would like to know how can I read the contents of a doc or docx. I'm using a Linux VPS and PHP, but if there is a simpler solution using other language, please let me know, as long as it works under a linux webserver.
我想知道如何阅读 doc 或 docx 的内容。我使用的是 Linux VPS 和 PHP,但如果有使用其他语言的更简单的解决方案,请告诉我,只要它在 linux 网络服务器下工作即可。
回答by Lalaka
回答by no_freedom
This is a .DOCX solution only. For .DOC or .PDF you'll need to use something else like pdf2text.phpfor PDF
这只是一个 .DOCX 解决方案。对于 .DOC 或 .PDF,您需要使用其他内容,例如pdf2text.phpfor PDF
function docx2text($filename) {
return readZippedXML($filename, "word/document.xml");
}
function readZippedXML($archiveFile, $dataFile) {
// Create new ZIP archive
$zip = new ZipArchive;
// Open received archive file
if (true === $zip->open($archiveFile)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = new DOMDocument();
$xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Return data without XML formatting tags
return strip_tags($xml->saveXML());
}
$zip->close();
}
// In case of failure return empty string
return "";
}
echo docx2text("test.docx"); // Save this contents to file
回答by chiptuned
My solution is Antiwordfor .doc and docx2txtfor .docx
我的解决方案是Antiwordfor .doc 和docx2txtfor .docx
Assuming a linux server that you control, download each one, extract then install. I installed each one system wide:
假设您控制了一台 linux 服务器,请下载每个服务器,解压缩然后安装。我在系统范围内安装了每一个:
Antiword: make global_install
docx2txt: make install
反词:make global_install
docx2txt:make install
Then to use these tools to extract the text into a string in php:
然后使用这些工具将文本提取到 php 中的字符串中:
//for .doc
$text = shell_exec('/usr/local/bin/antiword -w 0 ' .
escapeshellarg($docFilePath));
//for .docx
$text = shell_exec('/usr/local/bin/docx2txt.pl ' .
escapeshellarg($docxFilePath) . ' -');
docx2txt requires perl
docx2txt 需要 perl
no_freedom's solution does extract text from docx files, but it can butcher whitespace. Most files I tested had instances where words that should be separated had no space between them. Not good when you want to full text search the documents you're processing.
no_freedom 的解决方案确实从 docx 文件中提取文本,但它可以屠宰空白。我测试的大多数文件都有实例,其中应该分隔的单词之间没有空格。当您想对正在处理的文档进行全文搜索时,这不太好。
回答by M Khalid Junaid
Here i have added the solution to get the text from .doc,.docxword files
在这里,我添加了从.doc,.docxword 文件中获取文本的解决方案
How to extract text from word file .doc,docx php
如何从word文件.doc,docx php中提取文本
For .doc
对于.doc
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
return $outtext;
}
For .docx
对于.docx
private function read_docx(){
$striped_content = '';
$content = '';
$zip = zip_open($this->filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
回答by Mohini
I used docxtotxt to extract docx file content. My code is as follows:
我使用 docxtotxt 来提取 docx 文件内容。我的代码如下:
if($extention == "docx")
{
$docxFilePath = "/var/www/vhosts/abc.com/httpdocs/writers/filename.docx";
$content = shell_exec('/var/www/vhosts/abc.com/httpdocs/docx2txt/docx2txt.pl
'.escapeshellarg($docxFilePath) . ' -');
}
回答by kadutskyi
I insert little improvements in doc to txt converter function
我在 doc to txt 转换器功能中插入了一些改进
private function read_doc() {
$line_array = array();
$fileHandle = fopen( $this->filename, "r" );
$line = @fread( $fileHandle, filesize( $this->filename ) );
$lines = explode( chr( 0x0D ), $line );
$outtext = "";
foreach ( $lines as $thisline ) {
$pos = strpos( $thisline, chr( 0x00 ) );
if ( $pos !== false ) {
} else {
$line_array[] = preg_replace( "/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", "", $thisline );
}
}
return implode("\n",$line_array);
}
Now it saves empty rows and txt file looks row by row .
现在它保存空行,txt 文件逐行显示。
回答by Luke Madhanga
Parse .docx, .odt, .doc and .rtf documents
解析 .docx、.odt、.doc 和 .rtf 文档
I wrote a library that parses the docx, odt and rtf documents based on answers here and elsewhere.
我编写了一个库,根据此处和其他地方的答案解析 docx、odt 和 rtf 文档。
The major improvement I have made to the .docx and .odt parsing is the that the library processes the XML that describes the document and attempts to conform it to HTML tags, i.e. emand strongtags. This means that if you're using the library for a CMS, text formatting is not lost
我对 .docx 和 .odt 解析所做的主要改进是该库处理描述文档的 XML 并尝试使其符合 HTML 标签,即em和strong标签。这意味着如果您将库用于 CMS,文本格式不会丢失
You can get it here
你可以在这里得到
回答by Ilya P
You can use Apache Tikaas complete solution it provides REST API.
您可以使用Apache Tika作为它提供 REST API 的完整解决方案。
Another good library is RawText, as it can do an OCR over images, and extract text from any doc. It's non-free, and it works over REST API.
另一个不错的库是RawText,因为它可以对图像进行 OCR,并从任何文档中提取文本。它是非免费的,它通过 REST API 工作。
The sample code extracting your file with RawText:
使用 RawText 提取文件的示例代码:
$result = $rawText->extract($your_file)