Replace text in pdf linux

Command line tool to search and replace text on a PDF

I have a PDF that has my name as an obnoxious watermark through out a rather long PDF file. I tried replacing the text in LibreOffice Draw with blanks, but while my name does appear as text, the find and replace function seems to tank my computer taking significant RAM and CPU time to do. Is there a command line way to remove strings from PDF? Hmm. can sed do that?

2 Answers 2

Accepted answer will work only in rare cases

Sorry, the answer given by @dessert is as wrong as it could be as a general advice. It will not work for the general case of text replacement in PDFs (watermarks or not), and you’ll have to be very lucky for very rare cases of PDFs you encounter were it would work. (Moreover, watermarks inserted by LibreOffice frequently are converted into vector or pixel graphics, even if they appear like text when printed or viewed on screen. but this case I’ll not discuss any further — below I deal only with real text contents in a PDF.)

Reasons

The reasons for this are these:

  1. What appears to be ASCII text in the visual representation of its content in a PDF viewer, very likely will not be ASCII text inside the PDF source code. Instead it may be hex encoded.
  2. Additionally, an ASCII string’s individual characters might be placed on the page in a consecutive order, but they may easily be placed individually, with each having its own coordinate information sprinkled in between the individual characters.
  3. Also, the hex encoding of the ASCII (and non-ASCII) character table (the «mapping») will not be predictable, and it may change from font to font.

Hence in all these cases your sed command will not succeed — not even after uncompressing the PDF.

Example

Here is an example for the «string» Watermark, how it can appear inside a PDF created with LibreOffice:

56.8 726.989 Td /F2 16 Tf[29-26-1]TJ 

I’ll dissect for you what that means:

  • 56.8 726.989 Td : Td is an operator to move the text positioning on the page; 56.8 726.989 are the x-/y-coordinates to describe that exact position.
  • /F2 16 Tf : Tf is an operator to set a certain font as well as its size as the currently active one; in this case it is the font tagged elsewhere with the name /F2 and its size should be 16 pt.
  • [29-26-1]TJ : TJ is an operator to show text while at the same time allowing for individual glyph positioning. The meaning of the hex snippets enclosed by angle brackets are the following, according to the ‘charmap’ table specific for that PDF and the used font:
    • : this is the ‘W’ .
    • : this is the ‘at’ .
    • : this is the ‘er’ .
    • : this is the ‘m’ .
    • : this is the ‘ark’ .

    The numbers in between these hex snippets ( 29 , -2 , 6 and -1 ) are correction values which determine the individual spacings of the different characters.

    Now you show me how you’d replace that «string» by something else by using sed . Remember, you do not know the encoding in advance, nor the placement correction numbers, when you deal with an arbitrary PDF. You can only find out by opening its source code in an editor and analysing its content.

    Executive Summary

    No, there is no command line way to reliably remove unwanted strings from a PDF!

    (a) . you are a PDF expert who is skilled to read the PDF source code;

    (b) . you are prepared to analyse the PDF file in question individually;

    (c) . you use a text editor to modify its contents after uncompressing the PDF source code.

    WARNING: The answer currently marked as ‘accepted’ might have worked for the specific PDF of the OP. However, it will not work in the general case. Don’t take the «recipe» it advertises for granted!

    Источник

    Saved searches

    Use saved searches to filter your results more quickly

    You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

    Use this utility to replace text in a PDF file without having to edit the file with a bespoke PDF editor

    gerritonagoodday/pdfreplacetxt

    This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

    Name already in use

    A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

    Sign In Required

    Please sign in to use Codespaces.

    Launching GitHub Desktop

    If nothing happens, download GitHub Desktop and try again.

    Launching GitHub Desktop

    If nothing happens, download GitHub Desktop and try again.

    Launching Xcode

    If nothing happens, download Xcode and try again.

    Launching Visual Studio Code

    Your codespace will open once ready.

    There was a problem preparing your codespace, please try again.

    Latest commit

    Git stats

    Files

    Failed to load latest commit information.

    README.md

    Use this utility to replace text in a PDF file without having to edit the file with a bespoke PDF editor. This can only do basic replacements using regular expressions such as you might use on a text file using the ‘sed’ command e.g.

    sed -e 's/[search pattern]/[replacement pattern]/' -i textfile 

    The equivalent for a PDF file is:

    pdfreplacetxt file.pdf "[search pattern]" "[replacement pattern]" 
    • Do not assume that line breaks inside a PDF file are going to be where they appear when a PDF viewer displays a PDF file. It is therefore important not to use open-ended regexes such as «*. etc» for a search pattern. Generally, be very specific what you are looking for to replace. Just like normal ‘SED’, this ‘SED’ does not span across lines either.
    • If the replacement string is much longer than the searched-for string, the resulting document may loose some of its formatting.
    • If the letters in a word in the searched-for string are written in different fonts or sizes or offsets, the chances are pretty slim that the searched-for string will be found.
    • In rare cases, the text to be removed is encapsulated inside a font in the PDF file, so can’t readily be identified in order for it to be removed using this process.
    • Sometimes a PDF file contains an entire bitmap image and no text. This is especially the case for PDF documents that were created from scannned paper documents. This utility will not help in such cases.
    • Take care not to specify a PDF control term to be replaced, as this can potentially corrupt your entire PDF file.

    You need to have the PDF Toolkit installed, a.k.a. «pdftk». Depending on your Linux distro, use one of the following installation commands depending on what version of Linux you have:

    yum install pdftk apt-get install pdftk emerge pdftk 
    pdfreplacetxt ebook.pdf "09 October" "10 October" 

    Copy these files into your /usr/local/bin directory

    $ sudo cp pdfreplacetxt* /usr/local/bin 

    To change the face-value of a PDF cheque (that a «check» to you lot in America) from $100 to $1000, do could do this (but you shouldn’t):

    pdfreplacetxt cheque.pdf "100\.00" "1000\.00" pdfreplacetxt cheque.pdf "HUNDRED" "THOUSAND" 

    BTW, don’t do this, it’s called fraud. I am using this as an example:

    • To illustrate that most PDF documents are not secured and can easily be adulterated.
    • Because cheques should not be used as a modern payment method anyway. Cheques and the people who use them are both stupid. And the banks that still advocate cheques are greedy.

    To become a Big-Time criminal and apply this to all the $100 cheques in a directory tree, do this to secure your place in jail next to Bernie Madoff:

    find . -name "*.pdf" -exec pdfreplacetxt <> "100\.00" "1000\.00" \; -print find . -name "*.pdf" -exec pdfreplacetxt <> "HUNDRED" "THOUSAND" \; -print 

    Gerrit Hoekstra. You can contact me via https://github.com/gerritonagoodday

    This work was created from 100%-recycled electrons. No animals were hurt during the production of this work, except when I forgot to feed my cats that one time. The cats and I are on speaking terms again.

    About

    Use this utility to replace text in a PDF file without having to edit the file with a bespoke PDF editor

    Источник

    How to replace a single page in a pdf using another pdf in linux?

    The output consists of the first 12 pages of inA.pdf , followed by page 3 of inB.pdf and then pages 14 until end of inA.pdf .

    Many Linux distributions provide a PDFtk package you can download and install using their package manager.

    Thanks! Note, A= and B= must be uppercase. When I tried lowercase pdftk though a= was part of the filename.

    Didn’t work for me: ` pdftk A=./inA.pdf B=./inB.pdf A1-4 B1 A6-end output out.pdf Error: Unable to find file. Error: Failed to open PDF file: A1-4 Error: Unable to find file. Error: Failed to open PDF file: B1 Error: Unable to find file. Error: Failed to open PDF file: A6-end Errors encountered. No output created. Done. Input errors, so no output created. ` — I used qpdf and worked.

    @azbarcea: the cat is missing. The complete command: pdftk A=./inA.pdf B=./inB.pdf cat A1-4 B1 A6-end output out.pdf . While qpdf certainly has its merits, pdftk allows additional rotation parameters for the cat operation. Try pdftk —help

    This will not preserve bookmarks. As a workaround, you can use dump_data (check the docs) to dump bookmarks into a text file, and update_info to push those bookmarks to your output pdf file. The pdftk data dump only preserves page numbers, not offsets within the page. So a bookmark links to something at the bottom of the page after going through dump data and update info, the link will go to the top of the page.

    Источник

    Читайте также:  Starting linux from windows
Оцените статью
Adblock
detector