Foo

Converting HTML to Odt, Doc, Docx

To convert to odt it’s pretty easy after installing pandoc .

After the relatively hard part: from odt (or even html ) you can script (Open|Libre)Office via e.g. unoconv

abiword --to=doc filename.odt

Also see this thread, and this blog post.

Differences between command-line and GUI converting (*.html, *.odt, *.doc)

If you want to preserve the HTML, I’d still maintain LiveDocx might just prove useful. I did some more digging, and stumbled on phpdocx. On the bottom of the page there’s a link that shows you how to embed HTML.
Both LiveDocX and phpdocx offer examples on their respective sites. I suggest you browse through those.

SO showed up a few times, too: interesting questions might be:

  • OpenTbs convert html tags to MS Word tags
  • How can I convert a docx document to html using php?

I know the latter is the opposite of what you’re trying to do, but don’t write it off just for that reason. Often, it’s quite helpful to look at things from another perspective.

Since your last comment leads me to believe you haven’t actually gotten round to coding ( I just need a script etc. ) I would like to say that SO is not a code generator. When you’re done reading about phpdocx and livedocx, perhaps you should read what makes a good question.

I found what you were looking for, I think, here. If you want to use the php-cli, my guess would be setting your script’s output stream to a file, and using the headers found below (copy-paste from link).

 header("Content-type: application/vnd.ms-word"); 
header("Content-Disposition: attachment; Filename=SaveAsWordDoc.doc");

Sorry if I came across a bit harsh, with the remark on SO not being a code generator, and the link to ‘what makes a good question’. Didn’t mean to bash you.

Sorry, the previous example would only work as a download link: below a working script that turns out a .doc file, based on a html string:

#!/bin/php -n
$opts = array('file'=>array('header'=>'Content-type: application/vnd.ms-word'."\r\n".'Content-type: application/vnd.ms-word'."\r\n"));
$resource = stream_context_create($opts);
$doc = fopen('asDoc.doc','w+',false,$resource);
if (!$doc)
die('FFS');
>
$html='';
$html .='

Hello, world

';
fwrite($doc,$html);
fclose($doc);
exit();
?>

The headers are defined in the context stream, so the first two lines of code are crucial. As is the meta tag. All else is pretty basic.
All functions you need are here, so refer to their man pages for more info if you want to know what does what.

Читайте также:  Kali linux no public key

Converting HTML + Javascript to Word/OpenOffice document programatically

Thanks to all for the answers! In the end I made it work with PhantomJS (custom script converts JS generated parts to images) and Pandoc (which converts the resulting HTML to DOCX).

Источник

How to convert html to doc/docx with OpenOffice or LibreOffice?

I am trying to convert a html to doc or docx. I used some of the available php classes but the results are not good. I wonder if I can convert a html file to doc using some office software like LibreOffice or OpenOffice? LibreOffice (AbiWord) «converts» html to a doc file but without formatting. I mean I see the html sourcecode when I open the doc file.

abiword --plugin AbiCommand convert /home/qqq/test/test.htm /home/qqq/test/test.doc doc 
soffice --headless --convert-to doc:DOC /home/qqq/test/test.htm 

I have to press ctrl+c to get back to the console. Any Ideas how to convert html to doc? I can open the html in the desktop version of these two apps and save as doc but it seems not to work in the console version.

IIRC, libre/openoffice won’t work unless there’s only one instance of it open. That might be your problem with the second command.

pandoc does this quite well, but you specifically limited your question to OO / LO so I’m not putting this an an answer.

@Jan — installed and tested. It does not work. It renames the html to docx but without actually converting it. 🙁 I am open for any solution.

Just re-checked, works as supposed to: pandoc -s -r html page.html -o output.docx . It’s a very complex program and you have to get used tto it though.

if you open the output.docx in a text editor you will see that it is just the html sourcecode. And the sourcecode is also broken by pandoc. The html in the docx is not the same as my source html — the result looks terrible.

3 Answers 3

This is a very late response, but I hope it helps a few people.

LibreOffice makes use of filters

In the below code example (converting HTML to PDF, for example), you’ll notice secondary optional arguments, such as —infilter=writerglobal8_HTML and :writer_web_pdf_Export:

soffice --headless --infilter=writerglobal8_HTML --convert-to pdf:writer_web_pdf_Export --outdir /SomeFolder/Output /SomeFolder/Example.html 

These are known as ‘filters’. It’s equivalent to the subcategories of individual filetypes. For example, a .doc could be a Microsoft Office 95 document, or a 2003 Document. It basically gives LibreOffice a clue as to what format the file is in.

There’s two types of filter, an in filter (which is basically ‘what sub-format is the incoming file’) and an out filter (what format is the outgoing, IE exported, file in).

In your case, valid DOC filters include (where spaces are included, you’d likely need to use quotations, but I often find quotes don’t work, so your mileage may vary):

MS WinWord 5 MS WinWord 6.0 MS Word 2003 XML MS Word 2007 XML MS Word 2007 XML Template MS Word 2007 XML VBA MS Word 95 MS Word 95 Vorlage MS Word 97 MS Word 97 Vorlage MS_Works MS_Write 

You can work out some of the file filters from the (somewhat disorganised) .ODS sheet LibreOffice supplies found here.

Читайте также:  Альт линукс установка rpm пакета

Источник

CSS 2.1 compliant HTML to Microsoft Word conversion?

I’ve found a superb HTML to PDF converter in Prince XML. Now I’m looking for something of similar quality to produce Word documents from HTML + CSS. This is on PHP/Linux.

6 Answers 6

Three options depending on what you need to do:

  1. For simple cases, you can just write out the HTML to a .doc file. Sample here. That’s limited, though, and prompts the user to save as HTML if they make updates.
  2. If you can require Word 2007, you can generate Office Open XML, which is basically a zip file which holds XML documents. I haven’t found a library that can do that, but you can get started by renaming a sample .docx file to a .zip file and looking at what’s included, then generate that from PHP. Some info on that in this SO question.
  3. If you need to support Word 2003, you need to work with Word 2003’s XML format. It’s different than the 2007 format, but is at least forward compatible (so it’ll work on 2007 as well). The simplest way is to save as a Word 2003 XML document from Word, then open the document in a text editor, then get to work writing XSLT that will convert your HTML to the correct XML. I’ve done it, and it worked, but it was a lot of work. There’s info on the format here.

None of those are all that easy, so it might be worth buying a software product that does the conversion for you.

More info on this question is available in this SO question, as well.

OpenOffice can be used in server/headless mode to produce documents in lots of formats.

It might be easier to look for a PDF -> DOC converter instead, if you already have one-half of the problem solved. That said, I don’t know of any good PDF -> DOC converters either 🙁

You might want to try sending the HTML file as DOC (similar to how it’s described link text, but with the proper Contet-Type header :-)) and let the end-user’s system do the conversion (AFAIK you can do something similar with Excel too).

Here is one alternative for pdf => doc. Haven’t tried it, good luck!

Читайте также:  Линукс где мой компьютер

I had to do this a few years ago and ended up rolling my own custom solution. I created a Word document in the format I wanted, saved as HTML, and then added code where required to retrieve text from the database and format the way MS Word likes it. I forced a header to make the client think it was receiving a Word Document instead of an HTML file. Microsoft Word happily opened the file as if it were a regular Word Document.

If it were feasible to output a DOCX file instead, you could do an XSL tranform.

Linked

Hot Network Questions

Subscribe to RSS

To subscribe to this RSS feed, copy and paste this URL into your RSS reader.

Site design / logo © 2023 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev 2023.7.14.43533

By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.

Источник

Notes on exporting HTML document with images to Word on Linux

We have a web application (under Ubuntu webserver) with a WYSIWYG html editor which let users to create a report template with custom tags. These custom tags are then resolved to technical graph images and embedded into the report. This application allows users to

  1. preview the report with graphs on the web browsers
  2. schedule to generate the report which then export to Word document and deliver via email attachment.

In order to generate the HTML document, the obvious choice is to format the IMG tag with embedded base64 image content, especially for preview purpose. So that we can easily send the whole HTML document to client web browsers without worrying how to resolve the IMG src paths under the document root. However, this approach raises another issue. Currently, none of the convert tools in Linux such as Abiword, Libreoffice, OpenOffice, wkhtmltopdf can fully export HTML document with embedded images, i.e. the images are missing when opened in MS Word (even directly open the HTML document in MS Word, it won’t show any images). This type of HTML documents is only supported by the web browsers.

The only alternative is to construct the HTML document with IMG tags in relative path. This approach works much better with Abiword that the exported Word document shows the images in MS Word. Even MS Word can display this type of HTML document with images.

As a result, we create two versions of HTML documents with different type of IMG tags. We use embedded IMG for preview purpose and IMG src link for export to other documents and archive purposes.

Источник

Оцените статью
Adblock
detector