- How to extract the text from MS Office documents in Linux?
- 9 Answers 9
- Converting .docx files to plain text and preserving line breaks to maintain line number references to source document: howto & implications?
- 1 Answer 1
- How to extract just plain text from .doc & .docx files? [closed]
- 6 Answers 6
- LibreOffice
- Apache POI
How to extract the text from MS Office documents in Linux?
I need a way to extract the text from all of the MS Office document types (Word, Excel, Powerpoint), in Linux. I envision that there might be several different approaches to accomplish this, such as a Bash or Python script, or converting them to PDF and then extracting the text using a tool such as pdftotext. This seems like it might be a commonplace requirement. Is there an established procedure or tool to accomplish this easily?
9 Answers 9
Catdoc can convert doc,xls & ppt to text. Second option would be wvWare.
For more utils check check http://www.linux.com/archive/articles/52385 for word to text coverters and
cool thanks. catdoc is ok but it cant convert xls,ppt to test , i use xls2csv and apache-tika for them. check them out!
Abiword can convert from the commandline between any file formats it knows.
Convert from Word to plain text:
Make a pdf from a Word file:
And so on. The results in these cases would be myfile.txt or myfile.pdf. If you want to specify the output name you can do that too:
abiword —to=txt —to-name=output.txt myfile.doc
In fairness to other answers, it should be noted that AbiWord uses wvWare to handle Word documents, but even the wvWare homepage recommends using AbiWord instead for most conversions.
I hate word processors. This is the main reason I have AbiWord installed.
You might also be interested in unoconv, which is a similar tool supporting formats OpenOffice knows (which would include spreadsheets and the like), but I have no experience with it personally.
I finally found the perfect tool for scripting document parsing , it is apache-tika , it can parse gazillion non-text formats into text which is very cool!
(Mac Homebrew users: brew install tika )
The command-line interface works like this:
tika —text something.docx > something.txt
+1: Apache Tika is a serious Open source project, works also in Windows, works from the command line, it has a GUI with drag and drop, opens anything (Word, Excel, PowerPoint, PDF, svg), extracts the metadata of the document as well. After trying most the tools above, Apache Tika is what I was looking for. This should be the accepted answer (I don’t know if you can accept your own answer)
With LibreOffice you can do:
libreoffice --invisible --convert-to pdf file1.ppt file2.ppt
You could use CUPS ( virtual printer ) and by using ld.
Interesting , can that convert any printable stuff to PDF? Can you point me and example doing that for Doc or Xls?
wv is one option and IIRC OpenOffice can be told from the command line to export as pdf and exit.
If you want to use Apache Tika in a Python project, please check out this blog post.
Docsplit is the perfect tool to extract the text from pdf. It’s a ruby gem. So you should install the ruby and gem in your linux system before using the docsplit command.
If your system didn’t have the ruby and gem, please follow the instruction.
You must be root to install the software (assuming you want it to be available to all users).
Install the ruby on linux : yum install ruby
Install the gem, please download the latest gem package, then follow the instruction
tar xzf rubygems-x.x.xx.tgz
cd rubygems-x.x.xx
ruby setup.rb
Now that RubyGems is installed, you should have the gem command (gem is the command used to interact with the RubyGems package system). Test it out by running:
Converting .docx files to plain text and preserving line breaks to maintain line number references to source document: howto & implications?
I’m exporting MS Word content to plain text for use with text&file utilities. I have a constraint where the line numbering feature has been enabled in the MS software, and any reference to line numbers in the final output must match that numbering. So enter «numbering lines»: (Poe, E.A.) Obviously for Word, that sort of numbering doesn’t break lines at newline, it breaks «lines» after the right margin (or something). A script like docx2txt , doesn’t account for this by default it seems and breaks lines at newline. So if I use grep -n with numbering, the lines won’t match the source line numbers feature, as illustrated above. It’s not exactly clear from the documentation how I would need to edit the Perl script to convert the files the way I need to in this case:
our $config_newLine = "\n"; # Alternative is "\r\n". our $config_lineWidth = 80; # Line width, used for short line justification.
- Unicode(UTF-8)
- Insert line breaks + end lines with (CR/LF)
- Allow character substitution
And now indeed when I use the .txt files there is a perfect match between line numbers in the source numbering feature and the grep -n output.
- Is there any specific configuration/process I should know about docx2txt or a similar command line utility which would have allowed me to convert my .docx files to plain text while preserving line breaks, without resorting to Word like I did?
- What are the best practices, if any, for exporting MS Word documents (which may contain accented characters) to plain text for use with file/text utilities, with respect to line breaks and formatting; and are there any negative implications with the settings I chose for exporting i.e. inserting CR/LF?
As suggested I provide a sample. In this rar archive, I bundled a .docx file with simple paragraphs, and its exported .txt file using Word with the aforementioned options. The latter can be compared with a default run of docx2txt on the source file.
Can you not save it as a txt file from Word? If it gives you bad formatting then I would suggest using vim or emacs to fix the problem (because I’m sure it is patterned).
@Steven Walton Thank you, yes it works when I export to txt from Word. But I don’t want to have to use Word is my point. I wish i could rely only on the script to do that. I want a process for batch.
1 Answer 1
docx2txt works on the information in the docx file which is a zipped set of XML files.
With regards to line wrapping the .docx XML data only includes information about paragraphs and hard-breaks, not about soft-breaks. Soft-breaks are a result of rendering the text in a specific font, font-size and page width. docx2txt normally just tries to fit text in 80 columns (80 columns is configurable), without any regard for font and font-size. If your .docx contains font information from a Windows system that is not available on Unix/Linux, then doing the export to .txt via Open/LibreOffice would also unlikely result in the same layout, although it tries to do a good job¹.
So docx2txt or any other commandline utility, including commandline driven Open/LibreOffice processing, will not guaranteed convert the text to the same layout as exporting from Word does².
If you want to (or are forced by client requirements) to render exactly as Word does, there is in my experience only one way: let Word do the rendering. When faced with a similar problem as yours³, and having incompatible results using other tools, including OpenOffice, I reverted to installing a Windows VM on the host Linux server. On the client VM a program observes incoming files to be converted on the host, which would start and drive Word to do the conversion and then copy back the result⁴.
Decisions about using CR/LF or LF only, or UTF-8 or some other encoding for the .txt largely depends on how the resulting files are used. If the resulting files are used on Windows I would definately go with CR/LF, UTF-8 and an UTF-8 BOM. Modern programs on Linux are able to deduce that a file is UTF-8, but will not barf on the BOM and/or use that information. You should test all your target applications for compatibility if those are known up front.
¹ This sort of incompatibility is the primary reason some of my friends cannot change to Linux from Windows, although they would like to. They have to use MicroSoft Word, as Open/LibreOffice every once in a while mangles texts they exchange with clients.
² You can install all the fonts used in the Word files and might get lucky for some texts, some of the time.
³ Rendering PDFs from .doc/.docx
⁴ The program uses GUI automation—as if someone is clicking its menus—and doesn’t attempt to drive Word via an API. I am pretty sure the latter can be done as well and would have the advantage of not breaking things if Word would get upgraded
How to extract just plain text from .doc & .docx files? [closed]
Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx ? I’ve found this — wondered if there were any other suggestions?
If we have Software Recommendations why to do not transfer here? I also search software for similar tasks and do not found there best answer. But could recommend pandoc as best solution which even tables convert correctly. So I suggest reopen question.
You obviously aren’t on a Mac, but if you were you could use ‘textutil’ at the command line to quickly get plain text from various proprietary document types.
6 Answers 6
If you want the pure plain text(my requirement) then all you need is
unzip -p some.docx word/document.xml | sed -e 's/<[^>]\>//g; s/[^[:print:]]\//g'
It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.
I like this command, but often newlines are still useful data to have in the final version. Therefore I used the following command instead: unzip -p document.docx word/document.xml | sed -e ‘s//\n/g; s/<[^>]\>//g; s/[^[:print:]\n]\//g’ Note the additional sed argument, replacing XML representations of newlines with the actual newline character, and I edited the last sed argument to not strip newline characters. This makes the above command far more useful for diff-ing Word documents.\>
Thanks Rob! @Jeff: I agree but the following command works better for me in practice: unzip -p document.docx word/document.xml | sed -e ‘s// /g; s/<[^>]\>/ /g; s/[^[:print:]]\/ /g’\>
Very nice. Is it also possible to edit the XML data inside the Word document without corrupting it? And how?
@einpoklum the first bit of the command gets the raw xml so that will work fine. the second bit gets all the none xml tag strings and separates them with a new line. So as long as the sed does not barf on esoteric character sets you should be fine. Please post a reply if you find that is not the case.
LibreOffice
One option is libreoffice/openoffice in headless mode (make sure all other instances of libreoffice are closed first):
libreoffice --headless --convert-to "txt:Text (encoded):UTF8" mydocument.doc
Since the openoffice command line syntax is a bit too complicated, there is a handy wrapper which can make the process easier: unoconv.
Apache POI
Another option is Apache POI — a well supported Java library which unlike antiword can read, create and convert .doc , .docx , .xls , .xlsx , .ppt , .pptx files.
Here is the simplest possible Java code for converting a .doc or .docx document to plain text:
import java.io.FileInputStream; import java.io.FileWriter; import java.io.IOException; import org.apache.poi.POITextExtractor; import org.apache.poi.extractor.ExtractorFactory; import org.apache.poi.openxml4j.exceptions.OpenXML4JException; import org.apache.xmlbeans.XmlException; public class WordToTextConverter < public static void main(String[] args) < try < convertWordToText(args[0], args[1]); >catch (ArrayIndexOutOfBoundsException aiobe) < System.out.println("Usage: java WordToTextConverter "); > > public static void convertWordToText(String src, String desc) < try < FileInputStream fs = new FileInputStream(src); final POITextExtractor extractor = ExtractorFactory.createExtractor(fs); FileWriter fw = new FileWriter(desc); fw.write(extractor.getText()); fw.flush(); fs.close(); fw.close(); >catch (IOException | OpenXML4JException | XmlException e) < e.printStackTrace(); >> > # Maven dependencies (pom.xml): 4.0.0 my.wordconv my.wordconv.converter 1.0-SNAPSHOT org.apache.poi poi 3.17 org.apache.poi poi-ooxml 3.17 org.apache.poi poi-scratchpad 3.17
NOTE: You will need to add the apache poi libraries to the classpath. On ubuntu/debian the libraries can be installed with sudo apt-get install libapache-poi-java — this will install them under /usr/share/java . For other systems you’ll need to download the library and unpack the archive to a folder that you should use instead of /usr/share/java . If you use maven/gradle (the recommended option), then include the org.apache.poi dependencies as shown in the code snippet.
The same code will work for both .doc and .docx as the required converter implementation will be chosen by inspecting the binary stream.
Compile the class above (assuming it’s in the default package, and the apache poi jars are under /usr/share/java ):
javac -cp /usr/share/java/*:. WordToTextConverter.java
java -cp /usr/share/java/*:. WordToTextConverter doc.docx doc.txt
A clonable gradle project which pulls all necessary dependencies and generates the wrapper shell script (with gradle installDist ).