How could I get the plain text content of a word document? (It's for indexing all of the words in it, so no need for any formatting.)
Printable View
How could I get the plain text content of a word document? (It's for indexing all of the words in it, so no need for any formatting.)
Is it safe to assume you've already tried a StreamReader?
I can't figure the format out. I looked at it; a bunch of FFs, a bunch of 00s, the text, a bunch of 00s, a bunch of meta-data, etc.
Try this:
You would need to add a reference to Microsoft.Office.Interop.Word in your project.vb.net Code:
Dim wdApp As New Word.Application Dim wdDoc As New Word.Document wdDoc = wdApp.Documents.Open("C:\Temp\Test.doc") Dim myText As String = wdDoc.Range.Text '<-- this is the simplest way to get all text. Alternatively you may use paragraphs etc. too. wdDoc.Close() wdApp.Quit(SaveChanges:=False)
I was hoping not to have to use the Word control... I have to index over 50 thousand files. Thanks, though. I'll use it. :)
Does anyone have another way?
You would need to use either Word or some third party control capable of reading word files.
When you need to open so many files, don't open word application for each file. Do it only once. That would be quite fast. Hardly 15 minutes or so.
1. Open Word.Application only once.
2. Open word document read text and close it.
3. Repeat for all files.
4. Close the Word Application.
I know. It just usually takes 5 mins., so it's kind of a big fall.