Linux split pdf page

Split pages in pdf

I have a scanned PDF file in which two different real pages appear together on one virtual page. The resolution is with good quality. The problem is I have to zoom when reading and drag from left to the right.
Is there some command ( convert , pdftk , . ) or script that can convert this pdf file with normal pages (one page from book = one page in pdf file)?

For the records, the reverse operation (joining multiple pages) can be obtain from the command line (rather than «print to file») with pdfnup , from the pdfjam suite.

11 Answers 11

Just an addition since I had issues with the python script (and several other solutions): for me mutool worked great. It’s a simple and small addition shipped with the elegant mupdf reader. So you can try:

mutool poster -y 2 input.pdf output.pdf 

For horizontal splits, replace y with x . And you can, of course, combine the two for more complex solutions.

Really happy to have found this (after years of daily mupdf usage 🙂

Installing mupdf and mutool from source

( mutool comes shipped with mupdf starting from version 1.4: http://www.mupdf.com/news)

wget http://www.mupdf.com/downloads/mupdf-1.8-source.tar.gz tar -xvf mupdf-1.8-source.tar.gz cd mupdf-1.8-source sudo make prefix=/usr/local install 

Or go to the downloads page to find a newer version.

Installing mutool from a Linux distribution package

On Debian, the package containing mutool is mupdf-tools :

apt-get install mupdf-tools 

I had a djvu. I turned it into a postscript (quite fast), then into a pdf (turtle slow) — and finally mutool cut it so fast I thought it hadn’t work — it had!

This one is the easiest and better. mutool was made for this. Also, beware of -y , I think in most cases what you want is -x .

This utility is very fast, however I have a problem with the page order. The command allocates the right page at first position and the left page in the second one. Can somebody help me with this issue?

Here’s a small Python script using the old PyPdf library that does the job neatly. Save it in a script called un2up (or whatever you like), make it executable ( chmod +x un2up ), and run it as a filter ( un2up 1up.pdf ).

#!/usr/bin/env python import copy, sys from pyPdf import PdfFileWriter, PdfFileReader input = PdfFileReader(sys.stdin) output = PdfFileWriter() for p in [input.getPage(i) for i in range(0,input.getNumPages())]: q = copy.copy(p) (w, h) = p.mediaBox.upperRight p.mediaBox.upperRight = (w/2, h) q.mediaBox.upperLeft = (w/2, h) output.addPage(p) output.addPage(q) output.write(sys.stdout) 

Ignore any deprecation warnings; only the PyPdf maintainers need be concerned with those.

If the input is oriented in an unusual way, you may need to use different coordinates when truncating the pages. See Why my code not correctly split every page in a scanned pdf?

Just in case it’s useful, here’s my earlier answer which uses a combination of two tools plus some manual intervention:

  • Pdfjam (at least version 2.0), based on the pdfpages LaTeX package, to crop the pages;
  • Pdftk, to put the left and right halves back together.
Читайте также:  Telegram linux через терминал

Both tools are needed because as far as I can tell pdfpages isn’t able to apply two different transformations to the same page in one stream. In the call to pdftk , replace 42 by the number of pages in the input document ( 2up.pdf ).

pdfjam -o odd.pdf --trim '0cm 0cm 14.85cm 0cm' --scale 1.141 2up.pdf pdfjam -o even.pdf --trim '14.85cm 0cm 0cm 0cm' --scale 1.141 2up.pdf pdftk O=odd.pdf E=even.pdf cat $(i=1; while [ $i -le 42 ]; do echo O$i E$i; i=$(($i+1)); done) output all.pdf 

In case you don’t have pdfjam 2.0, it’s enough to have a PDFLaTeX installation with the pdfpages package (on Ubuntu: you need texlive-latex-recommended Install texlive-latex-recommended and perhaps (on Ubuntu: texlive-fonts-recommended Install texlive-fonts-recommended), and use the following driver file driver.tex :

\batchmode \documentclass \usepackage \begin \includepdfmerge[trim=0cm 0cm 14.85cm 0cm,scale=1.141] \includepdfmerge[trim=14.85cm 0cm 0cm 0cm,scale=1.141] \end

Then run the following commands, replacing 42 by the number of pages in the input file (which must be called 2up.pdf ):

pdflatex driver pdftk driver.pdf cat $(i=1; pages=42; while [ $i -le $pages ]; do echo $i $(($pages+$i)); i=$(($i+1)); done) output 1up.pdf 

The PyPdf library works perfect. I only changed it a little and run it with python conv_pdf.py res.pdf . How would you run your script shebang from commandline?

I’d like to try the version with pdfjam (because of slight scaling) too, but after the installation of pdfjam package my shell won’t recognize pdfjam command.

@xralf: My python script just reads from standard input and writes to standard output. The pdfjam version requires pdfjam 2.0; it’s only a small wrapper around pdfpages, and I’ve added the bit of LaTeX it generates so you can use that directly. The scaling issue is probably solvable with pypdf, it could be a page size issue (I may or may not be able to help if you give more details on what’s happening and especially the page sizes involved).

Thank you, the difference is in very slightly worse resolution, but this doesn’t matter. I will turn back to it when I know more about Latex (it’s too complex for me now and the solution is really good with PyPdf).

@Gilles Versy useful script. I’ve expected to see something like that in pdfjam, pdftk. Anyway, some people may want some modifications to split pages over other axis and use different ordering. This is possible to with changing few lines and using q.mediaBox.lowerRight = (w, h/2)

Imagemagick can do it in one step:

$ convert in.pdf -crop 50%x0 +repage out.pdf 

It looks like convert uses raster as an intermediate format. That causes blurish look even when original PDF contains vector objects.

Does anyone know of a way to do this without rasterizing page contents along the way. or at least to set a higher resolution?

Читайте также:  Steam beta client linux

this rendered texts into images and created pdf from images. Maybe nice for pics but useless for text extraction.

Based on answer from Gilles and how to find PDF page count I wrote

#!/bin/bash pdforiginal=$1 pdfood=$pdforiginal.odd.pdf pdfeven=$pdforiginal.even.pdf pdfout=output_$1 margin=$ scale=$ pages=$(pdftk $pdforiginal dump_data | grep NumberOfPages | awk '') pagesize=$(pdfinfo $pdforiginal | grep "Page size" | awk '') margin=$(echo $pagesize/2-$margin | bc -l) pdfjam -o $pdfood --trim "0cm 0cm $pt 0cm" --scale $scale $pdforiginal pdfjam -o $pdfeven --trim "$pt 0cm 0cm 0cm" --scale $scale $pdforiginal pdftk O=$pdfood E=$pdfeven cat $(i=1; while [ $i -le $pages ]; do echo O$i E$i; i=$(($i+1)); done) output $pdfout rm $pdfood $pdfeven 

where 50 for adjust margin and 1.2 for scale.

ImageMagick’s Convert command can help you to crop your file in 2 parts. See http://www.imagemagick.org/Usage/crop/

If I were you, I’d write a (shell) script like this:

  1. Split your file with pdfsam: 1 page = 1 file on disk (Format doesn’t matter. Choose one that ImageMagick knows. I’d just take PS or PDF.
  2. For each page, crop the first half and put it to a file named $A
  3. Crop the second half and put it to a file named $B. You get 1A.pdf, 1B.pdf, 2A.pdf, 2B.pdf, etc.
  4. Now, assemble this again in a new PDF. There are many methods to do this.

Wouldn’t using ImageMagick rasterize the files? And you should explain that last part inline, especially for the benefit of the non-francophones in the audience.

Because you don’t need to understand French. It just show how you can use ImageMagick’s convert, pdftk, or ghostscript (gs) alone to achieve this goal. I like using pdftk. «Rastering» doesn’t matter as it’s a scanned document.

Here’s a variation of the PyPDF code posted by Gilles. This function will work no matter what the page orientation is:

import copy import math import pyPdf def split_pages(src, dst): src_f = file(src, 'r+b') dst_f = file(dst, 'w+b') input = pyPdf.PdfFileReader(src_f) output = pyPdf.PdfFileWriter() for i in range(input.getNumPages()): p = input.getPage(i) q = copy.copy(p) q.mediaBox = copy.copy(p.mediaBox) x1, x2 = p.mediaBox.lowerLeft x3, x4 = p.mediaBox.upperRight x1, x2 = math.floor(x1), math.floor(x2) x3, x4 = math.floor(x3), math.floor(x4) x5, x6 = math.floor(x3/2), math.floor(x4/2) if x3 > x4: # horizontal p.mediaBox.upperRight = (x5, x4) p.mediaBox.lowerLeft = (x1, x2) q.mediaBox.upperRight = (x3, x4) q.mediaBox.lowerLeft = (x5, x2) else: # vertical p.mediaBox.upperRight = (x3, x4) p.mediaBox.lowerLeft = (x1, x6) q.mediaBox.upperRight = (x3, x6) q.mediaBox.lowerLeft = (x1, x2) output.addPage(p) output.addPage(q) output.write(dst_f) src_f.close() dst_f.close() 

The best solution was mutool see above:

sudo apt install mupdf-tools pdftk 
mutool poster -y 2 input.pdf output.pdf 

but then you need to rotate the pages left:

pdftk output.pdf cat 1-endleft output rotated.pdf 

I made a webpage for this purpose just now: here. From mobile or desktop, upload a PDF containing two-page spreads, and it will split each such page into two of half the width. Click on the new filename to download the resulting PDF.

Everything is self-contained in a single HTML file so if you don’t want to rely on webpages that may go away (not that I have any intention of taking it down), you can save the webpage and use it offline. It also seems that the file size doesn’t double, unlike some of the other solutions.

Читайте также:  Установить статический ip адрес linux

The main reason I wrote this, rather than using one of the existing answers on this page directly, is that I noticed that in many PDFs containing two-page spreads, there are also some pages that correspond to a single page. For example, the first page may be the book cover. We don’t want to slice such pages in half; we want to split only the pages that are genuinely made of two pages side-by-side (identifiable as being wider than they are tall).

Alternatively, here are a couple of Python scripts implementing that feature (it shouldn’t make a difference if your PDF contains only two-page spreads, but otherwise, replace the width > height check with True if you don’t need it):

#!/usr/bin/env python3 import copy import sys import PyPDF2 '''Run as: python3 un2up.py foo.pdf foo-split.pdf Generates foo-split.pdf.''' input = PyPDF2.PdfFileReader(open(sys.argv[1], 'rb')) output = PyPDF2.PdfFileWriter() for p in [input.getPage(i) for i in range(0, input.getNumPages())]: (w, h) = p.mediaBox.upperRight if w > h: q = PyPDF2.pdf.PageObject.createBlankPage( None, p.mediaBox.getWidth(), p.mediaBox.getHeight()) q.mergePage(p) p.mediaBox.upperRight = (w/2, h) q.mediaBox.upperLeft = (w/2, h) output.addPage(p) output.addPage(q) else: output.addPage(p) output.write(open(sys.argv[2], 'wb')) 
'''Run as: python3 unspread.py foo.pdf Generates file `unspread.foo.pdf`.''' import sys import os from pdfrw import PdfReader, PdfWriter, PageMerge def splitpage(src): '''Split a page into two (left and right).''' # Yield a result for each half of the page x, y, width, height = src.MediaBox if width > height: for x_pos in (0, 0.5): yield PageMerge().add(src, viewrect=(x_pos, 0, 0.5, 1)).render() else: yield src inpfn = sys.argv[1] outfn = 'unspread.' + os.path.basename(inpfn) writer = PdfWriter(outfn) for page in PdfReader(inpfn).pages: writer.addpages(splitpage(page)) writer.write() 

Источник

Split PDF document from command line in Linux?

You (should) know that Pdftk is nothing more than a very old version of iText (a Java-PDF library) compiled with GCJ and extended with some command line functionality.

The keywords in the above statement are «VERY OLD».

$ java -classpath /path/to/Multivalent20091027.jar tool.pdf.Split -page 1 input.pdf Exception in thread "main" java.lang.NoClassDefFoundError: tool/pdf/Split Caused by: java.lang.ClassNotFoundException: tool.pdf.Split at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Could not find the main class: tool.pdf.Split. Program will exit.

Turns out, this is a bit of a tricky software: even if it’s on SourceForge, and says here that

Practical Thought generously provides these tools for free use on the command line

The browser is open source. The document tools are a free bonus and not open source.

All releases of Multivalent linked from the official sourceforge site are missing the tools package.

(edit: there seems to be an old Multivalent version with the tools included, see the SO link; but as it looks somewhat like abandonware, I’d rather not use it)

Finally, I’d like to avoid tools that are essentially front ends for LaTeX like pdfjam.

Are there any options for such a PDF splitting command line tool under Linux?

Источник

Оцените статью
Adblock
detector