Djvu to txt linux

how to extract text from djvu and other ebooks format (possibly in Python) [closed]

Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

I have a collection of ebooks in djvu, pdf, chm format and I am looking for a way to search the keyword in the content. I have been researching around and find couple suggestion to parse pdf content but there seems to be no way to convert the content in djvu into text. By any chance, does anyone know a way to decode djvu content into text so that I can search it easily? Thanks

3 Answers 3

Assuming the djvu files contain OCR-ed text, a fast way on Linux to get that out is to use Popen to run djvutxt and grab the output.

The text in a .djvu file is compressed with a djvu specific compression algorithm, bzz , for which no simple C interface exists which you could load as an shared object in Python. It is a C++ implementation based on some framework.

Shameless self promotion: I contributed to Calibre the conversion from OCR-ed .djvu , which uses djvutxt in this way. However it falls back to my pure python decoder implementation (sloooow) if djvutxt is not available. So you could use that code if you cannot use djvutxt .

I have not yet put out the Python source seperately from Calibre. But after downloading and extracting Calibre’s source:

curl -L http://status.calibre-ebook.com/dist/src | tar xvJ find . | fgrep djvu 

The relevant files are djvu_input.py , djvu.py and djvubzzdec.py

Источник

djvutxt (1) — Linux Manuals

djvutxt: Extract the hidden text from DjVu documents.

Command to display djvutxt manual in Linux: $ man 1 djvutxt

NAME

djvutxt — Extract the hidden text from DjVu documents.

SYNOPSIS

DESCRIPTION

Program djvutxt decodes the hidden text layer of a DjVu document inputdjvufile and prints it into file outputtxtfile or on the standard output. The hidden text layer is usually generated with the help of an optical character recognition software.

Читайте также:  Linux find file contain string

Without options -detail and -escape, this program simply outputs the UTF-8 text. Option -detail cause the output of S-expressions describing the text and its location. Option -escape uses C-style escape sequences to represent nonprintable non-ASCII characters.

OPTIONS

—page=pagespec Specify which pages should be processed. When this option is not specified, the text of all pages of the documents is concatenated into the output file. The page specification pagespec contains one or more comma-separated page ranges. A page range is either a page number, or two page numbers separated by a dash. For instance, specification 1-10 outputs pages 1 to 10, and specification 1,3,99999-4 outputs pages 1 and 3, followed by all the document pages in reverse order up to page 4. —detail=keyword This options causes djvutxt to output S-expressions specifying the position of the text in the page. See the manual page djvused(1) for a description of the output format. Argument keyword specifies the maximum level of detail for which text location is reported. The recognized values are: page, column, region, para, line, word, and char. All other values are interpreted as char. —escape Output escape sequences of the form ooo for all non ASCII or non printable UTF-8 characters and for the backslash character.

REMARKS

CREDITS

This program was initially written by Andrei Erofeev and was then improved Bill Riemers and many others. It was then rewritten to use the ddjvuapi by Leon Bottou .

SEE ALSO

  • djvutoxml (1) — DjVuLibre XML Tools.
  • djvu (1) — DjVu and DjVuLibre.
  • djvudigital (1) — creates DjVu files from PS or PDF files.
  • djvudump (1) — Display internal structure of DjVu files.
  • djvuextract (1) — Extract chunks from DjVu image files.
  • djvumake (1) — Assemble DjVu image files.
  • djvups (1) — Convert DjVu documents to PostScript.
  • djvused (1) — Multi-purpose DjVu document editor.
  • djvuserve (1) — Generate indirect DjVu documents on the fly.

Источник

DJVU to TXT Converter

Convert DJVU to TXT online, from any device with a modern browser like Chrome and Firefox.

Convert your DJVU files online. You can convert your DJVU documents from any platform (Windows, Linux, macOS). No registration needed. Just drag and drop your DJVU file on upload form, choose the desired output format and click convert button. Once conversion completed you can download your TXT file.

You even can perform more advanced conversions. For example you can convert password protected documents. Just expand LoadOptions and enter the password of your file. Or you can add a watermark to the converted TXT file. Expand the ConvertOptions and fill the fields for watermarking.

Converted TXT files are stored in the cloud. Privacy is 100% guaranteed. All documents are removed from the cloud after 24 hours.

You can convert your DJVU documents from anywhere, from any machine or even from a mobile device. The DJVU converter is always available online and is completely free.

  • Convert WORD to PDF, EXCEL to PDF, PDF to WORD, POWERPOINT to IMAGE, VSDX to PDF, HTML to DOCX, EPUB to PDF, RTF to DOCX, XPS to PDF, ODT to DOCX, ODP to PPTX and many more document formats
  • Simple way to instant convert DJVU to TXT
  • Convert DJVU from anywhere — it works on all platforms including Windows, MacOS, Android and iOS
Читайте также:  Linux mint unlock screen

Free Document

Free Document Conversion, Viewer, Merger app for Windows

  • Easily convert, view or merge unlimited files on your own Windows PC.
  • Process Word, Excel, PowerPoint, PDF and more than 100 file formats.
  • No limit of file size.
  • Batch conversion of multiple files.
  • One app with rich features like Conversion, Viewer, Merger, Parser, Comparison, Signature
  • Regular FREE updates with new features coming every month

DJVU Graphics File format

DjVu, pronounced as “déjà vu”, is a graphics file format intended for scanned documents and books especially those which contain the combination of text, drawings, images and photographs. It was developed by AT&T Labs. It uses multiple techniques like image layer separation of text and background images, progressive loading, arithmetic coding and lossy compression for bitonal images.

TXT Plain Text File Format

A file with .TXT extension represents a text document that contains plain text in the form of lines. Paragraphs in a text document are recognized by carriage returns and are used for better arrangement of file contents. A standard text document can be opened in any text editor or word processing application on different operating systems. All the text contained in such a file is in human-readable format and represented by sequence of characters.

How to convert DJVU to TXT

  • Open our free DJVU to TXT converter website.
  • Click inside the file drop area to upload DJVU file or drag & drop DJVU file.
  • Click on Convert button. Your DJVU files will be uploaded and converted to TXT result format.
  • Download link of result files will be available instantly after conversion.
  • You can also send a link to the TXT file to your email address.
  • Note that file will be deleted from our servers after 24 hours and download links will stop working after this time period.

FAQ

First, you need to add a file for conversion: drag & drop your DJVU file or click inside the white area to choose a file. Then click the "Convert" button. When DJVU to TXT conversion is completed, you can download your TXT file.

Of course! The download link of TXT files will be available instantly after conversion. We delete uploaded files after 24 hours and the download links will stop working after this time period. No one has access to your files. File conversion (including DJVU is absolutely safe.

Yes, you can use our free DJVU to TXT converter on any operating system that has a web browser. Our DJVU to TXT converter works online and does not require any software installation.

Читайте также:  Linux измерить скорость соединения

Detailed examples are available at GitHub in the form of executable projects. If you are only interested in DJVU to TXT conversion then check .NET & Java examples.

Other Supported Conversions

You can also convert DJVU into many other file formats. Please see the complete list below.

Источник

DJVU to TXT Converter

Choose files or drag & drop them here.
Only you have access to your files.
All files will be deleted after one hour.

How to convert DJVU files to TXT online?

Step-by-step guide to convert djvu to txt using AnyConv. It works on PC (Windows, Mac, Linux) and mobile devices (iPhone, Android).

Upload DJVU-file

DJVU to TXT

Download your TXT

FAQ

First you need to add file for conversion: drag and drop your DJVU file or click the «Choose File» button. Then click the «Convert» button. When DJVU to TXT conversion is completed, you can download your TXT file.

Of course! We delete uploaded files immediately, and converted ones after 1 hour. No one has access to your files. File conversion (including DJVU to TXT) is absolutely safe.

Yes, you can use AnyConv on any operating system that has a web browser. Our DJVU to TXT converter works online and does not require software installation.

Information about DJVU and TXT file formats

🔸 File format DJVU TXT
🔸 Full name DJVU — DJVU Ebook Format TXT — Raw text file
🔸 File extension .djvu, .djv .txt
🔸 MIME type image/vnd.djvu, image/x-djvu text/plain
🔸 Developed by AT&T Labs — Research Microsoft
🔸 Type of format Document file formats Plain text
🔸 Description DjVu is a computer file format designed primarily to store scanned documents, especially those containing a combination of text, line drawings, indexed color images, and photographs. The name text file refers to a file format that allows only plain text content with very little formatting (e.g., no bold or italic types). Such files can be viewed and edited on text terminals or in simple text editors.
🔸 Technical details DjVu uses image layer separation of text and background/images technologies, progressive loading, arithmetic coding, and lossy compression for bitonal (monochrome) images. This allows for high-quality, readable images to be stored in a minimum of space to be made available on the web. Text file refers to a type of container, while plain text refers to a kind of content. Text files can contain plain text, but they are not limited to such. At a generic level of description, there are two kinds of computer files: text files and binary files.
🔸 File conversion DJVU conversion TXT conversion
🔸 Associated programs SumatraPDF, Evince. Notepad, TextEdit, WordPad, UltraEdit
🔸 Wiki https://en.wikipedia.org/wiki/DjVu https://en.wikipedia.org/wiki/Text_file

Источник

Оцените статью
Adblock
detector