Docx to text linux

Convert doc to txt via commandline

We’re searching a programm that allows us to convert a doc or docx document to a txt file. We’re working with linux and we want to start a website that converts user uploaded doc files. We don’t wanna use open office/libre office cause we have bad experience with that. Pandoc can’t handle doc files :/ Anyone have a idea?

4 Answers 4

You will have to use two different command-line tools, depending if you are working with .doc or .docx format.

The latter will produce a file called foo.txt in the same directory as the original.

I’m not sure which Linux distribution you are using, but both catdoc and docx2txt are available from the Ubuntu repositories, for example:

Thanks for the info, unfortunately for me brew install docx2txt didn’t work, ‘catdoc’ command is not available and I need to use ‘docx2txt.sh’ instead of ‘docx2txt’.

It turns out catdoc got delegated to the boneyard but one can build it from source, details here: apple.stackexchange.com/a/294259/36790

here is a perl project which claims to do it. I have done a lot of this by hand also, using XSLT on the document.xml. the Docx file itself is just a zip file, you can unzip it and inspect the elements. I will say that this is not hard to do for specific files, but is very hard to do in the general case, because of the lack of documentation for how Word internally stores things, and the variance of internal representation.

Источник

Docx to text linux

NAME

docx2txt - convert Microsoft OOXML files to plain text.

SYNOPSIS

docx2txt [ infile.docx|-|-h ] [ outfile.txt|- ] docx2txt < infile.docx docx2txt < infile.docx >outfile.txt

DESCRIPTION

This manual page documents briefly the docx2txt commands. docx2txt docx2txt is a tool that attempts to generate equivalent (ASCII) text files from Microsoft .docx documents, preserving some formatting and document information (which MS text conversion drops) along with appropriate character conversions for a good (ASCII) text experience. It is a platform independent solution consisting of (core) Perl and (wrapper) Unix/Windows shell scripts and a configuration file to control the output text appearance to fair extent. It can very conveniently be used to build a Web based docx document conversion service. With unzippers like CakeCmd that can deal with corrupt Zip archives, this tool can extract text from corrupt docx documents in many cases, where MS word processor fails to even open them.

OPTIONS

-h As the first argument to get this usage information. - As the infile name to read the docx file from STDIN. - As the outfile name to dump the text on STDOUT. Output is saved in infile.txt if second argument is omitted.

AUTHOR

docx2txt was written by Sandeep Kumar shimple0@yahoo.com>. This manual page was written by Khalid El Fathi khalid@elfathi.fr>, for the Debian project (and may be used by others). February 25, 2012 DOCX2TXT(1)

© 2019 Canonical Ltd. Ubuntu and Canonical are registered trademarks of Canonical Ltd.

Читайте также:  Linux централизованное хранение конфигов

Источник

Docx to text linux

NAME

docx2txt - convert Microsoft OOXML files to plain text.

SYNOPSIS

docx2txt [ infile.docx|-|-h ] [ outfile.txt|- ] docx2txt < infile.docx docx2txt < infile.docx >outfile.txt

DESCRIPTION

This manual page documents briefly the docx2txt commands. docx2txt docx2txt is a tool that attempts to generate equivalent (ASCII) text files from Microsoft .docx documents, preserving some formatting and document information (which MS text conversion drops) along with appropriate character conversions for a good (ASCII) text experience. It is a platform independent solution consisting of (core) Perl and (wrapper) Unix/Windows shell scripts and a configuration file to control the output text appearance to fair extent. It can very conveniently be used to build a Web based docx document conversion service. With unzippers like CakeCmd that can deal with corrupt Zip archives, this tool can extract text from corrupt docx documents in many cases, where MS word processor fails to even open them.

OPTIONS

-h As the first argument to get this usage information. - As the infile name to read the docx file from STDIN. - As the outfile name to dump the text on STDOUT. Output is saved in infile.txt if second argument is omitted.

AUTHOR

docx2txt was written by Sandeep Kumar shimple0@yahoo.com>. This manual page was written by Khalid El Fathi khalid@elfathi.fr>, for the Debian project (and may be used by others). February 25, 2012 DOCX2TXT(1)

© 2019 Canonical Ltd. Ubuntu and Canonical are registered trademarks of Canonical Ltd.

Источник

Docx to text linux

NAME

docx2txt - convert Microsoft OOXML files to plain text.

SYNOPSIS

docx2txt [ infile.docx|-|-h ] [ outfile.txt|- ] docx2txt < infile.docx docx2txt < infile.docx >outfile.txt

DESCRIPTION

This manual page documents briefly the docx2txt commands. docx2txt docx2txt is a tool that attempts to generate equivalent plain text files from Microsoft .docx documents, preserving some formatting and document information (which MS text conversion drops) along with appropriate character conversions for a good (ascii or utf-8) text experience. It is a platform independent solution consisting of (core) Perl and (wrapper) Unix/Windows shell scripts and a configuration file to control the output text appearance to a fair extent. It can very conveniently be used to build a Web-based docx document conversion service. With unzippers like CakeCmd that can deal with corrupt Zip archives, this tool can extract text from corrupt docx documents in many cases, where MS Word fails to even open them.

OPTIONS

-h As the first argument to get this usage information. - As the infile name to read the docx file from STDIN. - As the outfile name to dump the text on STDOUT. Output is saved in infile.txt if second argument is omitted.

AUTHOR

docx2txt was written by Sandeep Kumar shimple0@yahoo.com>. This manual page was written by Khalid El Fathi khalid@elfathi.fr>, for the Debian project (and may be used by others). February 25, 2012 DOCX2TXT(1)

© 2019 Canonical Ltd. Ubuntu and Canonical are registered trademarks of Canonical Ltd.

Читайте также:  Linux kernel current process

Источник

Converting .docx files to plain text and preserving line breaks to maintain line number references to source document: howto & implications?

enter image description here

I’m exporting MS Word content to plain text for use with text&file utilities. I have a constraint where the line numbering feature has been enabled in the MS software, and any reference to line numbers in the final output must match that numbering. So enter «numbering lines»: (Poe, E.A.) Obviously for Word, that sort of numbering doesn’t break lines at newline, it breaks «lines» after the right margin (or something). A script like docx2txt , doesn’t account for this by default it seems and breaks lines at newline. So if I use grep -n with numbering, the lines won’t match the source line numbers feature, as illustrated above. It’s not exactly clear from the documentation how I would need to edit the Perl script to convert the files the way I need to in this case:

our $config_newLine = "\n"; # Alternative is "\r\n". our $config_lineWidth = 80; # Line width, used for short line justification. 
  • Unicode(UTF-8)
  • Insert line breaks + end lines with (CR/LF)
  • Allow character substitution

And now indeed when I use the .txt files there is a perfect match between line numbers in the source numbering feature and the grep -n output.

  • Is there any specific configuration/process I should know about docx2txt or a similar command line utility which would have allowed me to convert my .docx files to plain text while preserving line breaks, without resorting to Word like I did?
  • What are the best practices, if any, for exporting MS Word documents (which may contain accented characters) to plain text for use with file/text utilities, with respect to line breaks and formatting; and are there any negative implications with the settings I chose for exporting i.e. inserting CR/LF?

As suggested I provide a sample. In this rar archive, I bundled a .docx file with simple paragraphs, and its exported .txt file using Word with the aforementioned options. The latter can be compared with a default run of docx2txt on the source file.

Can you not save it as a txt file from Word? If it gives you bad formatting then I would suggest using vim or emacs to fix the problem (because I’m sure it is patterned).

@Steven Walton Thank you, yes it works when I export to txt from Word. But I don’t want to have to use Word is my point. I wish i could rely only on the script to do that. I want a process for batch.

Читайте также:  Stm32 virtual port driver linux

1 Answer 1

docx2txt works on the information in the docx file which is a zipped set of XML files.

With regards to line wrapping the .docx XML data only includes information about paragraphs and hard-breaks, not about soft-breaks. Soft-breaks are a result of rendering the text in a specific font, font-size and page width. docx2txt normally just tries to fit text in 80 columns (80 columns is configurable), without any regard for font and font-size. If your .docx contains font information from a Windows system that is not available on Unix/Linux, then doing the export to .txt via Open/LibreOffice would also unlikely result in the same layout, although it tries to do a good job¹.

So docx2txt or any other commandline utility, including commandline driven Open/LibreOffice processing, will not guaranteed convert the text to the same layout as exporting from Word does².

If you want to (or are forced by client requirements) to render exactly as Word does, there is in my experience only one way: let Word do the rendering. When faced with a similar problem as yours³, and having incompatible results using other tools, including OpenOffice, I reverted to installing a Windows VM on the host Linux server. On the client VM a program observes incoming files to be converted on the host, which would start and drive Word to do the conversion and then copy back the result⁴.

Decisions about using CR/LF or LF only, or UTF-8 or some other encoding for the .txt largely depends on how the resulting files are used. If the resulting files are used on Windows I would definately go with CR/LF, UTF-8 and an UTF-8 BOM. Modern programs on Linux are able to deduce that a file is UTF-8, but will not barf on the BOM and/or use that information. You should test all your target applications for compatibility if those are known up front.

¹ This sort of incompatibility is the primary reason some of my friends cannot change to Linux from Windows, although they would like to. They have to use MicroSoft Word, as Open/LibreOffice every once in a while mangles texts they exchange with clients.
² You can install all the fonts used in the Word files and might get lucky for some texts, some of the time.
³ Rendering PDFs from .doc/.docx
The program uses GUI automation—as if someone is clicking its menus—and doesn’t attempt to drive Word via an API. I am pretty sure the latter can be done as well and would have the advantage of not breaking things if Word would get upgraded

Источник

Оцените статью
Adblock
detector