用 PHP 读/写 MS Word 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/188452/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 21:51:34  来源:igfitidea点击:

Reading/Writing a MS Word file in PHP

phpms-wordread-write

提问by UnkwnTech

Is it possible to read and write Word (2003 and 2007) files in PHP without using a COM object? I know that I can:

是否可以在不使用 COM 对象的情况下在 PHP 中读取和写入 Word(2003 和 2007)文件?我知道我可以:

$file = fopen('c:\file.doc', 'w+');
fwrite($file, $text);
fclose();

but Word will read it as an HTML file not a native .doc file.

但 Word 会将其作为 HTML 文件而不是本机 .doc 文件读取。

采纳答案by Stefan Gehrig

Reading binary Word documents would involve creating a parser according to the published file format specifications for the DOC format. I think this is no real feasible solution.

读取二进制 Word 文档将涉及根据已发布的 DOC 格式文件格式规范创建解析器。我认为这不是真正可行的解决方案。

You could use the Microsoft Office XML formatsfor reading and writing Word files - this is compatible with the 2003 and 2007 version of Word. For reading you have to ensure that the Word documents are saved in the correct format (it's called Word 2003 XML-Document in Word 2007). For writing you just have to follow the openly available XML schema. I've never used this format for writing out Office documents from PHP, but I'm using it for reading in an Excel worksheet (naturally saved as XML-Spreadsheet 2003) and displaying its data on a web page. As the files are plainly XML data it's no problem to navigate within and figure out how to extract the data you need.

您可以使用Microsoft Office XML 格式来读取和写入 Word 文件 - 这与 2003 和 2007 版本的 Word 兼容。为了阅读,您必须确保以正确的格式保存 Word 文档(在 Word 2007 中称为 Word 2003 XML 文档)。对于编写,您只需遵循公开可用的 XML 模式。我从未使用这种格式从 PHP 写出 Office 文档,但我使用它来读取 Excel 工作表(自然保存为 XML-Spreadsheet 2003)并在网页上显示其数据。由于文件是简单的 XML 数据,因此在其中导航并找出如何提取所需的数据是没有问题的。

The other option - a Word 2007 only option (if the OpenXML file formats are not installed in your Word 2003) - would be to ressort to OpenXML. As databysspointed out herethe DOCX file format is just a ZIP archive with XML files included. There are a lot of resources on MSDNregarding the OpenXML file format, so you should be able to figure out how to read the data you want. Writing will be much more complicated I think - it just depends on how much time you'll invest.

另一个选项 - 仅限 Word 2007 的选项(如果您的 Word 2003 中未安装 OpenXML 文件格式) - 将使用OpenXML。正如databyss这里指出的,DOCX 文件格式只是一个包含 XML 文件的 ZIP 存档。MSDN上有很多关于 OpenXML 文件格式的资源,因此您应该能够弄清楚如何读取所需的数据。我认为写作会复杂得多——这取决于你将投入多少时间。

Perhaps you can have a look at PHPExcelwhich is a library able to write to Excel 2007 files and read from Excel 2007 files using the OpenXML standard. You could get an idea of the work involved when trying to read and write OpenXML Word documents.

也许您可以看看PHPExcel,它是一个能够写入 Excel 2007 文件并使用 OpenXML 标准从 Excel 2007 文件读取的库。在尝试读取和编写 OpenXML Word 文档时,您可以了解所涉及的工作。

回答by Stefan Gehrig

this works with vs < office 2007 and its pure PHP, no COM crap, still trying to figure 2007

这适用于 vs < office 2007 及其纯 PHP,没有 COM 废话,仍在尝试计算 2007

<?php



/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
} 

$userDoc = "cv.doc";

$text = parseWord($userDoc);
echo $text;


?>

回答by Mantichora

You can use Antiword, it is a free MS Word reader for Linux and most popular OS.

您可以使用 Antiword,它是适用于 Linux 和大多数流行操作系统的免费 MS Word 阅读器。

$document_file = 'c:\file.doc';
$text_from_doc = shell_exec('/usr/local/bin/antiword '.$document_file);

回答by WIlson

Just updating the code

只更新代码

<?php

/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $word_text = @fread($fileHandle, filesize($userDoc));
    $line = "";
    $tam = filesize($userDoc);
    $nulos = 0;
    $caracteres = 0;
    for($i=1536; $i<$tam; $i++)
    {
        $line .= $word_text[$i];

        if( $word_text[$i] == 0)
        {
            $nulos++;
        }
        else
        {
            $nulos=0;
            $caracteres++;
        }

        if( $nulos>1996)
        {   
            break;  
        }
    }

    //echo $caracteres;

    $lines = explode(chr(0x0D),$line);
    //$outtext = "<pre>";

    $outtext = "";
    foreach($lines as $thisline)
    {
        $tam = strlen($thisline);
        if( !$tam )
        {
            continue;
        }

        $new_line = ""; 
        for($i=0; $i<$tam; $i++)
        {
            $onechar = $thisline[$i];
            if( $onechar > chr(240) )
            {
                continue;
            }

            if( $onechar >= chr(0x20) )
            {
                $caracteres++;
                $new_line .= $onechar;
            }

            if( $onechar == chr(0x14) )
            {
                $new_line .= "</a>";
            }

            if( $onechar == chr(0x07) )
            {
                $new_line .= "\t";
                if( isset($thisline[$i+1]) )
                {
                    if( $thisline[$i+1] == chr(0x07) )
                    {
                        $new_line .= "\n";
                    }
                }
            }
        }
        //troca por hiperlink
        $new_line = str_replace("HYPERLINK" ,"<a href=",$new_line); 
        $new_line = str_replace("\o" ,">",$new_line); 
        $new_line .= "\n";

        //link de imagens
        $new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line); 
        $new_line = str_replace("\*" ,"><br>",$new_line); 
        $new_line = str_replace("MERGEFORMATINET" ,"",$new_line); 


        $outtext .= nl2br($new_line);
    }

 return $outtext;
} 

$userDoc = "custo.doc";
$userDoc = "Cultura.doc";
$text = parseWord($userDoc);

echo $text;


?>

回答by Joe Lencioni

I don't know about reading native Word documents in PHP, but if you want to write a Word document in PHP, WordprocessingML (aka WordML)might be a good solution. All you have to do is create an XML document in the correct format. I believe Word 2003 and 2007 both support WordML.

我不知道如何用 PHP 读取原生 Word 文档,但是如果您想用 PHP 编写 Word 文档,WordprocessingML(又名 WordML)可能是一个不错的解决方案。您所要做的就是以正确的格式创建一个 XML 文档。我相信 Word 2003 和 2007 都支持 WordML。

回答by Sergey Kornilov

Most probably you won't be able to read Word documents without COM.

如果没有 COM,您很可能无法阅读 Word 文档。

Writing was covered in this topic

主题涵盖了写作

回答by databyss

2007 might be a bit complicated as well.

2007 年也可能有点复杂。

The .docx format is a zip file that contains a few folders with other files in them for formatting and other stuff.

.docx 格式是一个 zip 文件,其中包含一些文件夹,其中包含用于格式化和其他内容的其他文件。

Rename a .docx file to .zip and you'll see what I mean.

将 .docx 文件重命名为 .zip,您就会明白我的意思。

So if you can work within zip files in PHP, you should be on the right path.

因此,如果您可以在 PHP 中处理 zip 文件,那么您应该走在正确的道路上。

回答by databyss

www.phplivedocx.org is a SOAP based service that means that you always need to be online for testing the Files also does not have enough examples for its use . Strangely I found only after 2 days of downloading (requires additionaly zend framework too) that its a SOAP based program(cursed me !!!)...I think without COM its just not possible on a Linux server and the only idea is to change the doc file in another usable file which PHP can parse...

www.phplivedocx.org 是一项基于 SOAP 的服务,这意味着您始终需要在线测试文件,也没有足够的示例供其使用。奇怪的是,我在下载 2 天后才发现(也需要额外的 zend 框架)它是一个基于 SOAP 的程序(诅咒我!!!)......我认为没有 COM 在 Linux 服务器上是不可能的,唯一的想法是更改 PHP 可以解析的另一个可用文件中的 doc 文件...

回答by Eduardo

One way to manipulate Word files with PHP that you may find interesting is with the help of PHPDocX. You may see how it works having a look at its online tutorial. You can insert or extract contents or even merge multiple Word files into a asingle one.

您可能会感兴趣的一种使用 PHP 操作 Word 文件的方法是借助 PHPDocX。您可以查看它的在线教程,了解它是如何工作的。您可以插入或提取内容,甚至可以将多个 Word 文件合并为一个。

回答by Eduardo

phpLiveDocxis a Zend Framework component and can read and write DOC and DOCX files in PHP on Linux, Windows and Mac.

phpLiveDocx是一个 Zend Framework 组件,可以在 Linux、Windows 和 Mac 上用 PHP 读取和写入 DOC 和 DOCX 文件。

See the project web site at:

请参阅项目网站:

http://www.phplivedocx.org

http://www.phplivedocx.org