php 在php中读取DOC文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7358637/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-26 02:33:19  来源:igfitidea点击:

Reading DOC file in php

php

提问by no_freedom

I'm trying to read .doc .docxfile in php. All is working fine. But at last line I'm getting awful characters. Please help me. Here is code which is developed by someone.

我正在尝试.doc .docx在 php 中读取文件。一切正常。但在最后一行我得到了可怕的角色。请帮我。这是由某人开发的代码。

    function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
} 

$userDoc = "k.doc";

Here is screenshot. enter image description here

这是屏幕截图。 在此处输入图片说明

采纳答案by Steve-o

DOC files are not plain text.

DOC 文件不是纯文本

Try a library such as PHPWord(old CodePlex site).

尝试使用诸如PHPWord旧 CodePlex 站点)之类的库。

nb: This answer has been updated multiple times as PHPWord has changed hosting and functionality.

注意:由于 PHPWord 已更改托管和功能,因此此答案已更新多次。

回答by user1817444

You can read .docx files in PHP but you can't read .doc files. Here is the code to read .docx files:

您可以在 PHP 中读取 .docx 文件,但无法读取 .doc 文件。这是读取 .docx 文件的代码:

function read_file_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) return false;

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    //echo $content;
    //echo "<hr>";
    //file_put_contents('1.xml', $content);

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
    $striped_content = strip_tags($content);

    return $striped_content;
}
$filename = "filepath";// or /var/www/html/file.docx

$content = read_file_docx($filename);
if($content !== false) {

    echo nl2br($content);
}
else {
    echo 'Couldn\'t the file. Please check that file.';
}

回答by Davinder Singh

I am using this function working well for me :) try it

我正在使用此功能对我来说效果很好:) 试试看

function read_doc_file($filename) {
     if(file_exists($filename))
    {
        if(($fh = fopen($filename, 'r')) !== false ) 
        {
           $headers = fread($fh, 0xA00);

           // 1 = (ord(n)*1) ; Document has from 0 to 255 characters
           $n1 = ( ord($headers[0x21C]) - 1 );

           // 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
           $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );

           // 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
           $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );

           // 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
           $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );

           // Total length of text in the document
           $textLength = ($n1 + $n2 + $n3 + $n4);

           $extracted_plaintext = fread($fh, $textLength);

           // simple print character stream without new lines
           //echo $extracted_plaintext;

           // if you want to see your paragraphs in a new line, do this
           return nl2br($extracted_plaintext);
           // need more spacing after each paragraph use another nl2br
        }
    }   
    }

回答by hugsbrugs

Decoding in pure PHP never worked for me, so here is my solution : http://wvware.sourceforge.net/

用纯 PHP 解码从来没有对我有用,所以这是我的解决方案:http: //wvware.sourceforge.net/

Install package

安装包

sudo apt-get install wv

Use it in PHP :

在 PHP 中使用它:

$output = str_replace('.doc', '.txt', $filename);
shell_exec('/usr/bin/wvText ' . $filename . ' ' . $output);
$text = file_get_contents($output);
# Convert to UTF-8 if needed
if(!mb_detect_encoding($text, 'UTF-8', true))
{
    $text = utf8_encode($text);
}
unlink($output);

回答by Kratos.vn

I'm using soffice to convert doc to txt and read txt converted file

我正在使用 soffice 将 doc 转换为 txt 并读取 txt 转换文件

soffice --convert-to txt test.doc

you can see more in here

你可以在这里看到更多

回答by xchiltonx

I also used it but for accents ( and single quotes like ' ) it would put ? instead SOo my PDO mySQL didn't like it but I finally figured it out by adding

我也用过它,但对于重音(和单引号,如 '),它会放 ? 相反,我的 PDO mySQL 不喜欢它,但我终于通过添加

mb_convert_encoding($extracted_plaintext,'UTF-8');

So the final version should read:

所以最终版本应该是:

function getRawWordText($filename) {
    if(file_exists($filename)) {
        if(($fh = fopen($filename, 'r')) !== false ) {
            $headers = fread($fh, 0xA00);
            $n1 = ( ord($headers[0x21C]) - 1 );// 1 = (ord(n)*1) ; Document has from 0 to 255 characters
            $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );// 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
            $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );// 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
            $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );// 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
            $textLength = ($n1 + $n2 + $n3 + $n4);// Total length of text in the document
            $extracted_plaintext = fread($fh, $textLength);
            $extracted_plaintext = mb_convert_encoding($extracted_plaintext,'UTF-8');
             // if you want to see your paragraphs in a new line, do this
             // return nl2br($extracted_plaintext);
             return ($extracted_plaintext);
        } else {
            return false;
        }
    } else {
        return false;
    }  
}

This works fine in a utf8_general_ci mySQL database to read word doc files :)

这在 utf8_general_ci mySQL 数据库中可以很好地读取 word doc 文件:)

Hope this helps someone else

希望这对其他人有帮助