Linux 从 doc 和 docx 中提取文本

Question

提问by Alexandre Mota

I would like to know how can I read the contents of a doc or docx. I'm using a Linux VPS and PHP, but if there is a simpler solution using other language, please let me know, as long as it works under a linux webserver.

我想知道如何阅读 doc 或 docx 的内容。我使用的是 Linux VPS 和 PHP，但如果有使用其他语言的更简单的解决方案，请告诉我，只要它在 linux 网络服务器下工作即可。

Answer 1

回答by Lalaka

Try ApachePOI. It works well for Java. I suppose you won't have any difficulties installing Java on Linux.

试试ApachePOI。它适用于 Java。我想在 Linux 上安装 Java 不会有任何困难。

Answer 2

回答by no_freedom

This is a .DOCX solution only. For .DOC or .PDF you'll need to use something else like pdf2text.phpfor PDF

这只是一个 .DOCX 解决方案。对于 .DOC 或 .PDF，您需要使用其他内容，例如pdf2text.phpfor PDF

function docx2text($filename) {
   return readZippedXML($filename, "word/document.xml");
 }

function readZippedXML($archiveFile, $dataFile) {
// Create new ZIP archive
$zip = new ZipArchive;

// Open received archive file
if (true === $zip->open($archiveFile)) {
    // If done, search for the data file in the archive
    if (($index = $zip->locateName($dataFile)) !== false) {
        // If found, read it to the string
        $data = $zip->getFromIndex($index);
        // Close archive file
        $zip->close();
        // Load XML from a string
        // Skip errors and warnings
        $xml = new DOMDocument();
    $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
        // Return data without XML formatting tags
        return strip_tags($xml->saveXML());
    }
    $zip->close();
}

// In case of failure return empty string
return "";
}

echo docx2text("test.docx"); // Save this contents to file

Answer 3

回答by chiptuned

My solution is Antiwordfor .doc and docx2txtfor .docx

我的解决方案是Antiwordfor .doc 和docx2txtfor .docx

Assuming a linux server that you control, download each one, extract then install. I installed each one system wide:

假设您控制了一台 linux 服务器，请下载每个服务器，解压缩然后安装。我在系统范围内安装了每一个：

Antiword: make global_install
docx2txt: make install

反词：make global_install
docx2txt：make install

Then to use these tools to extract the text into a string in php:

然后使用这些工具将文本提取到 php 中的字符串中：

//for .doc
$text = shell_exec('/usr/local/bin/antiword -w 0 ' . 
    escapeshellarg($docFilePath));

//for .docx
$text = shell_exec('/usr/local/bin/docx2txt.pl ' . 
    escapeshellarg($docxFilePath) . ' -');

docx2txt requires perl

docx2txt 需要 perl

no_freedom's solution does extract text from docx files, but it can butcher whitespace. Most files I tested had instances where words that should be separated had no space between them. Not good when you want to full text search the documents you're processing.

no_freedom 的解决方案确实从 docx 文件中提取文本，但它可以屠宰空白。我测试的大多数文件都有实例，其中应该分隔的单词之间没有空格。当您想对正在处理的文档进行全文搜索时，这不太好。

Answer 4

回答by M Khalid Junaid

Here i have added the solution to get the text from .doc,.docxword files

在这里，我添加了从.doc,.docxword 文件中获取文本的解决方案

How to extract text from word file .doc,docx php

如何从word文件.doc,docx php中提取文本

For .doc

对于.doc

private function read_doc() {
    $fileHandle = fopen($this->filename, "r");
    $line = @fread($fileHandle, filesize($this->filename));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
}

For .docx

对于.docx

private function read_docx(){

        $striped_content = '';
        $content = '';

        $zip = zip_open($this->filename);

        if (!$zip || is_numeric($zip)) return false;

        while ($zip_entry = zip_read($zip)) {

            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

            if (zip_entry_name($zip_entry) != "word/document.xml") continue;

            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            zip_entry_close($zip_entry);
        }// end while

        zip_close($zip);

        $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
        $content = str_replace('</w:r></w:p>', "\r\n", $content);
        $striped_content = strip_tags($content);

        return $striped_content;
    }

Answer 5

回答by Mohini

I used docxtotxt to extract docx file content. My code is as follows:

我使用 docxtotxt 来提取 docx 文件内容。我的代码如下：

if($extention == "docx")
{   
    $docxFilePath = "/var/www/vhosts/abc.com/httpdocs/writers/filename.docx";
    $content = shell_exec('/var/www/vhosts/abc.com/httpdocs/docx2txt/docx2txt.pl     
    '.escapeshellarg($docxFilePath) . ' -');
}

Answer 6

回答by kadutskyi

I insert little improvements in doc to txt converter function

我在 doc to txt 转换器功能中插入了一些改进

private function read_doc() {
    $line_array = array();
    $fileHandle = fopen( $this->filename, "r" );
    $line       = @fread( $fileHandle, filesize( $this->filename ) );
    $lines      = explode( chr( 0x0D ), $line );
    $outtext    = "";
    foreach ( $lines as $thisline ) {
        $pos = strpos( $thisline, chr( 0x00 ) );
        if (  $pos !== false )  {

        } else {
            $line_array[] = preg_replace( "/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", "", $thisline );

        }
    }

    return implode("\n",$line_array);
}

Now it saves empty rows and txt file looks row by row .

现在它保存空行，txt 文件逐行显示。

Answer 7

回答by Luke Madhanga

Parse .docx, .odt, .doc and .rtf documents

解析 .docx、.odt、.doc 和 .rtf 文档

I wrote a library that parses the docx, odt and rtf documents based on answers here and elsewhere.

我编写了一个库，根据此处和其他地方的答案解析 docx、odt 和 rtf 文档。

The major improvement I have made to the .docx and .odt parsing is the that the library processes the XML that describes the document and attempts to conform it to HTML tags, i.e. emand strongtags. This means that if you're using the library for a CMS, text formatting is not lost

我对 .docx 和 .odt 解析所做的主要改进是该库处理描述文档的 XML 并尝试使其符合 HTML 标签，即em和strong标签。这意味着如果您将库用于 CMS，文本格式不会丢失

You can get it here

你可以在这里得到

Answer 8

回答by Ilya P

You can use Apache Tikaas complete solution it provides REST API.

您可以使用Apache Tika作为它提供 REST API 的完整解决方案。

Another good library is RawText, as it can do an OCR over images, and extract text from any doc. It's non-free, and it works over REST API.

另一个不错的库是RawText，因为它可以对图像进行 OCR，并从任何文档中提取文本。它是非免费的，它通过 REST API 工作。

The sample code extracting your file with RawText:

使用 RawText 提取文件的示例代码：

$result = $rawText->extract($your_file)