Ocr pdf for linux

Содержание

Best OCR Apps for Linux
Tesseract OCR
gImageReader
Paperwork
OCRFeeder
gscan2pdf
Conclusion
About the author
Nitesh Kumar
Ocr pdf for linux
How to OCR to searchable PDF in Linux
1. Tesseract & PDFsandwich
2. PDF X-Change Viewer
Conclusion

Best OCR Apps for Linux

This article will cover a list of useful “Optical Character Recognition” software available for Linux. An optical character recognition (OCR) software attempts to detect text content of non-text files whose content cannot be selected or copied but can be viewed or read. For instance, an OCR software can identify text from images, PDF or other scanned documents in digital file formats using various algorithms and AI based solutions.

These OCR software are especially useful for converting and preserving old documents as they can be used to identify text and create digital copies. Sometimes the identified text may not be 100% accurate but OCR software removes the need for manual edits to a great extent by extracting as much text as possible. Manual edits can be made later to improve accuracy further and create one-to-one replicas. Most OCR software can extract text into separate files, though some also support superimposing a hidden text layer on original files. Superimposed text allows you to read content in original print and format but also allows you to select and copy text. This technique is specially used to digitize old documents into PDF format.

Tesseract OCR

Tesseract OCR is a free and open source OCR software available for Linux. Sponsored by Google, and maintained by many volunteers, it is probably the most comprehensive OCR suite available out there that can even beat some paid, proprietary solutions. It provides command line tools as well as an API that you can integrate in your own programs. It can detect text in many languages with good accuracy. It comes with a set of pre-trained data that can be used to identify and extract text. You can also use your own trained data if you need a custom solution or you can get more models from third parties. Tesseract OCR comes with multiple detection engines and you can use them according to your needs depending on the installation method.

To install Tesseract OCR in Ubuntu, use the command specified below:

You can install it in other Linux distributions from default repositories through the package manager. A universal AppImage file and more installation instructions are available here.

Tesseract OCR comes with support for detecting English language content by default. If you want to enable additional languages, you may have to download more language packs. The link given above has instructions for installing additional language packs. In Ubuntu, you can directly find language packages by running the command below:

The command above will output package names for different language packs. Just install them by running a command in the following format:

You can get a list of all installed language packs by running the command below:

Once the main Tesseract OCR package and additional language packages have been installed, you can start detecting text from images and PDF files. To extract text, use commands in following formats:

Читайте также: Linux make apt get

$ tesseract image.png output -l eng

$ tesseract image.png output -l eng+spa

$ tesseract image.png output -l eng pdf

The first command will extract text from “image.png” file in “eng” language and store it in a file called “output”. The second command will parse the image using multiple language packs. The third command can be used to create a PDF file with a text layer superimposed on the image file.

For more information on command line usage of Tesseract OCR, use the following two commands:

gImageReader

gImageReader is a graphical client for the Tesseract OCR engine mentioned above. You can use it to run most of the command line options and actions supported by Tesseract OCR, including extracting text from multiple files, spell-checking the extracted text and performing post-processing on the identified text.

To install gImageReader in Ubuntu, use the command specified below:

You can install it in other Linux distributions from default repositories through the package manager. More distribution specific packages are available here.

Paperwork

Paperwork is a free and open source document manager. You can use it to efficiently manage your library of documents, especially if you have a large collection. It also comes with a built-in OCR mode that uses “Pyocr”, a Python module based on Tesseract and Cuneiform OCR engines. Other main features of Paperwork include ability to edit scanned documents, a search bar to search document library, ability to sort documents, scanner support, and so on.

To install Paperwork in Ubuntu, use the command specified below:

You can install it in other Linux distributions from default repositories through the package manager. A universal flatpak package is also available here.

OCRFeeder

OCRFeeder is a free and open source graphical OCR software maintained by the GNOME team. It supports recognizing text in numerous languages and can export content in numerous file formats. It supports many OCR engines, including Tesseract OCR, GOCR, Ocrad and Cuneiform. It also allows you to do some post-processing to improve formatting and layout of the extracted text content.

To install OCRFeeder in Ubuntu, use the command specified below:

You can install it in other Linux distributions from default repositories through the package manager. A universal flatpak package is also available here.

Note that in my testing, OCRFeeder installed from Ubuntu repositories came with only one OCR engine. However, the flatpak build came with all four supported OCR engines though it downloaded around 2GB data. The package included in the Ubuntu repository was much smaller in size.

gscan2pdf

gscan2pdf is a free and open source graphical utility that can identify and extract text from a variety of file formats. It can directly work with scanners to scan papers and then export OCR detected text content into PDF files. It also supports multiple OCR engines including Tesseract OCR, GOCR, Ocropus and Cuneiform, as long as packages for these engines are installed on your system. Other than direct scanning of papers, you can also import image files and extract text from them.

To install gscan2pdf in Ubuntu, use the command specified below:

You can install it in other Linux distributions from default repositories through the package manager. Source code and executable binaries are also available here.

Conclusion

These are some of the most useful command line and graphical OCR engines and software available for Linux. Tesseract OCR is the most actively developed and most comprehensive tool for detecting text and it should be enough for most of your needs. Though you can also try other apps mentioned in this article if you are not satisfied with the results of Tesseract OCR.

About the author

Nitesh Kumar

I am a freelancer software developer and content writer who loves Linux, open source software and the free software community.

Источник

Ocr pdf for linux

Analog and digital electronics · Arduino projects · Development boards · DIY · Technology related stuff

How to OCR to searchable PDF in Linux

LOGO

Posted by: Cornelius One Transistor Posted on: December 31, 2015 Updated on: 2017-12-25T20:46:56Z

Free software solutions for Linux that can run OCR on PDF documents and convert them to searchable PDF.

There are multiple OCR (optical character recognition) engines for Linux, but most have a major drawback. They can only export plain text of the OCR’ed image and do not support embedding text into the PDF in order to make a searchable PDF.

By searchable PDF, we refer to a scanned PDF document that contains invisible OCR’ed text over the scanned image. The text should have the right size in order to be placed over the text portions from image. Every word from the text layer should overlay exactly on the portion of the image that contains that word.

Here are two software solutions that are able to create searchable PDFs. One is a native Linux OCR engine and the other is a free PDF reader with OCR capabilities running in Wine.

1. Tesseract & PDFsandwich

Tesseract is the first and currently the only OCR engine for Linux that supports direct searchable PDF output (starting from version 3.03). The only problem is that it only accepts image input. So you can’t feed it a PDF document. You can install it on APT based Linux (like Ubuntu) using the following command:

sudo apt-get install tesseract-ocr tesseract-ocr-all

If you have a bunch of images resulted from a scanner, you can make a simple script that will OCR each image into single page searchable PDF then join pages into a single PDF document:

#!/bin/bash LANG=eng #replace with your language code shopt -s nullglob for f in *.tif; do echo "Running OCR on $f" tesseract -psm 1 -l $LANG $f $f pdf done echo "Joining files into single PDF. " pdftk *.pdf cat output ../outdocument.pdf rm -r -f *.pdf

This script takes all .tif files from the directory where it is run and processes them with tesseract. To use it, you need also pdftk installed. Copy the above snippet into a new file ocr.sh, make it executable (chmod +x ocr.sh), then place it in the folder with scanned images and run it.

Things get complicated if you already have a PDF document that you want to make searchable. In order to use tesseract, it must be exported to images. And to do this, you must know the resolution of the scanned image. And this can be a problem if you didn’t scan the document and have no idea what resolution it is.

-nopreproc is useful when the PDF already contains processed images and you don’t want any other processing. Note that by default, this script will convert your document to black and white! Using this option you avoid any kind of conversion.
-resolution has a default value of 300 DPI. This is used when converting PDF pages to images and 300 is a good value. But if your document contains small text and you know/believe it may have been scanned at a higher DPI, specify it.
-lang must always be specified if you need to OCR in other than English language. This parameter is passed to tesseract. The availability of languages depends on installed tesseract-ocr- packages.

pdfsandwich -lang eng input_document.pdf

2. PDF X-Change Viewer

This is a free PDF reader with a lot of other functions provided by Tracker Software. It is a Windows only application that runs in Wine. I tested the viewer in Wine 1.6, 1.7 and 1.8 and it worked great in all these versions. Yet the OCR engine only worked with Wine 1.8 which is available in PPA.

To install it in Linux, you must have Wine 1.8 installed (wine1.8:i386 package) and download the following files from Tracker Software:

Portable PDF Viewer archive: Portable version (ZIP) | 8 MB

Portable PDF Viewer OCR engine: Portable Version (OCR Lang Files) | 8 MB

Additional OCR languages: choose a package that contains the language(s) you are interested in.

Extract the ZIP file by right clicking it and choosing Extract Here. You should get a folder PDFX_Vwr_Port. Extract the OCR Lang files archive and you will get an ocrdats folder. Put this folder in the PDFX_Vwr_Port folder. You can now start PDFXCview.exe with wine and you can OCR English, German, French and Spanish documents.

If you want additional languages, extract the Additional language packs archive. You will get an exe file. Don’t launch it because it will not install. Instead install innoextract package and extract it. Here is what I did with the EU language pack:

innoextract OCRAdditionalLangsEU.exe

You will get two folders (code:SetAppFolder|inst and code:SetEditorFolder|inst) with identical content. A language pack is contains two files: .lng and _pxvocr.dat. You need to copy both files to ocrdats folder. Fot example, to run OCR in Romanian, I copied rom.lng and ron_pxvocr.dat from one of those two folders.

OCR in PDF X-Change Viewer

To launch OCR, load a document in the viewer and press the OCR button (1). Select page range (2), choose a language (3) and start (4).

Notes: in Wine 1.6, PDF X-Change Viewer crashed when launching OCR (on click on the OK button). In Wine 1.7 it crashed after reaching 99% OCR progress. It Wine 1.8 it works without issues.

Conclusion

As you can see, you can OCR and make searchable PDF document on Linux. And with free software. Comparing the two applications presented here, PDF X-Change Viewer is faster than Tesseract. The processing time depends on accuracy, and Tesseract is known for being highly accurate.

Источник