Spiga

Extracting text from Word Documents in PHP with COM objects

by Gabi Solomon

dont know if you know but i have a pet project called infostud.ro that is a university teacher grading website. It also has a students essay section where members can download various essay and papers written by other members for there inspiration with there own projects. Almost all those papers are in the word format. In the hope to help them i wanted to have a description of each file so they can preview some of the contents of the files before downloading it, so they know if its going to help them or not. Until recently that description was entered manually by the users, but i decided to give them a hand and make php read the files and have the first X characters from the file as the description.

So i start googling for ways to read the word files, since simple opening the file would return a buch of strange code. On my browsing in the past i remembered reading something about COM objects and how you can read word documents with it.

As i discovered COM ( Component Object Model ) is a interface standard for software componentry introduced by Microsoft. Sounds pretty fancy, but what that means is that using COM objects any other application can communicate with any Microsoft application ( Word, Excel, IE ) and perform various commands as if a user would use it.

Now that i knew what i was looking for, i started to look for documentation on how i could use a COM object in php to open a word document. Even if i thought i would have to read tons of documentation, turns out that using a com object is pretty straight forward.

For my needs i actulay found to ways of reading the text from a word document.

The first way was to convert the word document to a TXT file.

[php]
$filename=”file.doc”;
$TXTfilename = $filename . “.txt”;

$word = new COM(“word.application”) or die(“Unable to instantiate Word object”);
$word->Documents->Open($filename);

// the ’2′ parameter specifies saving in txt format
$word->Documents[1]->SaveAs($TXTfilename ,2);
$word->Documents[1]->Close(false);
$word->Quit();
$word->Release();
$word = NULL;
unset($word);

$content = file_get_contents($TXTfilename);
unlink($TXTfilename);
[/php]

Although this method worked pretty good, as i was about to close the google found pages i saw a different approach to the problem that i liked more then the above ( ans showed me new COM commands – i couldnt find a list of it anywhere )

[php]

$word = new COM(“word.application”) or die (“Could not initialise MS Word object.”);
$word->Documents->Open(realpath(“Sample.doc”));

// Extract content.
$content = (string) $word->ActiveDocument->Content;

echo $content;

$word->ActiveDocument->Close(false);

$word->Quit();
$word = null;
unset($word);

[/php]

As the author points out here this uses a small trick, if you would check $word->ActiveDocument->Contentyou would find that it is an empty object (variant). If you assign the value to a variable you’ll get an empty string, as the variant object has no real __toString(). The workaround in PHP is to explicitly type cast the value as a string and make PHP/COM take care of finding the real value.

As a last pointer, if you look at the script you would see the 3 last commands that are meant to destroy any the COM object and release the memory taken by it, since it takes about 10-15MB upon initialization because it opens a full instance of WINWORD.exe.

So, in just ten lines of code you can get the text out of an MS Word document, easy as ever!

Hope this helped you,

Cheers

  • kasunshashi

    Great, Thanks a lot. I tried so many other examples without any success. Both of your codes works great for text. Only problem is when the doc has some tables. We cannot get that styling. Anyway this is enough for my requirement. Thanks a lot

  • karim

    hello how to compare string variable with the text getting from word

  • Pingback: Extracting text from Word Documents in PHP with COM objects | Rahul's Blog