Scanning to PDF in Linux

I scan documents for two main reasons:

  1. to have backup copies of my airplane’s technical logs (a plane can lose tens of thousands of dollars of value if the logs are lost); and
  2. to allow me to submit expense claims to customers by e-mail, using scanned receipts.

It’s very easy to scan individual pages to just about any format in Linux using graphical frontends like XSane or The Gimp, but when there’s more than one page, nothing beats PDF for ease of use at the receiver’s end (especially when you’ll be sending the file to an admin assistant running Windows and reading e-mail in Outlook). After a bit of experimentation, I found a few steps that actually work:

  • In the XSane preview window, preset the area to Letter size, choosing any resolution you want (150 or 300 dpi are probably the best choices).
  • Save your scans in the format of your choice.
  • Use the convert utility from ImageMagick to merge all of the scanned pages into a Postscript file. It is critical to use the -density option with your scan DPI so that the pages come out the right size, e.g. “convert -density 150 *.tiff output.ps”.
  • Use the ps2pdf utility from Ghostscript to convert the Postscript file to PDF, eg. “ps2pdf output.ps output.pdf”.

I’ve tried many other approaches (including using the libtiff utilities with all compression options, and using convert to go straight to PDF), and they all result in either huge or malformed PDF files. This is the one approach that works for me.

There must be a tool out there, GUI or command line, that willallow me to batch scan multipage documents straight into PDF without all this messing around. I haven’t found it, but I’ll be happy to hear about such a tool in comments.

About David Megginson

Scholar, tech guy, Canuck, open-source/data/information zealot, urban pedestrian, language geek, tea drinker, pater familias, red tory, amateur musician, private pilot.
This entry was posted in Uncategorized and tagged . Bookmark the permalink.

21 Responses to Scanning to PDF in Linux

  1. Tommy says:

    I found this post with a Google search… I’ve been dreaming of the same tool for a long time now, and after MUCH experimentation came upon the same process you describe. I’ve been using the same process successfully for over a year. It actually works very well, producing very usable and quite compact pdf images, with file sizes smaller than ones created on the Acrobat 4 I was using on my old Mac.

    One thing I do — because I tend to use ancient throwaway PCs with limited processing power, and since imagemagick doesn’t care what the source files are, it’s fastest to scan in Xsane’s default format (pnm) rather than tiff — you don’t have to wait for the image to convert after each page is scanned. As you have noticed, you MUST include the density command or you won’t get the results you expect from imagemagick.

    If you use KDE, it’s pretty easy to install a script that you can call up in Konqueror with a right-click contextual menu. I created one I named “Convert to PDF” which performed the convert and ps2pdf steps. I’m sure I could use the same script in GNOME but I haven’t gotten around to figuring it out yet.

    One important issue I have not yet resolved (though I may be close): unfortunately ps2pdf seems to have US Letter set as the default page size. Normally this does not cause me a problem, because 99% of the time I’m scanning from US Letter to US Letter documents. HOWEVER, if I’m scanning a small booklet (or probably a legal size document, though I haven’t done that lately), ps2pdf apparently ignores the boundingbox of the original images, converting it to US Letter. This has the very unfortunate side effect of putting the resulting image in the lower left corner of the page (starting at 0,0 coordinates for PostScript). This may be a bug introduced in later versions of Ghostscript, because it doesn’t seem to be widely acknowledged…

    I have found the gs commands to set the height and width that work with ps2pdf. Additionally there are commands in imagemagick and gs that will extract the original image’s height and width. SO it’s probably just a matter of writing a script to do all the parsing and building up a command line.

    Anyway, if I make any progress on this in the next little bit I’ll try to report back. I’m glad to find someone else looking for this function from their linux boxes!

  2. dave says:

    You don’t actually need to perform the postscript step – ImageMagick’s convert will convert directly to pdf and can allow choice of paper size, for example (using international paper sizes not the strange US ones):
    convert -page A4 *.jpg out.pdf

    One gotcha with convert though is its memory allocation – the defaults for memory expect that it has a whole Gig of pure memory to chew up. If it fails to allocate memory it will just die on you with no error message. To get around this, use the -limit flags:
    convert -page A4 -limit memory 256 -limit map 512

  3. david says:

    When I skipped the Postscript step, I ended up with very large PDF files; going to PostScript, then using ps2pdf, created much smaller output (better compression?).

  4. Benjamin Kay says:

    I’m using XSane 0.991 on Gentoo Linux. In the main XSane dialog there is a “target” icon that looks like a crosshairs, and there is a drop-down menu just to the right of that. Click on the drop down menu and chose “Multipage”.

    A new dialog called “xsane multipage project” will appear. You may chose any name for your project; xsane will create a folder with that name to store temporary files in and eventually output a pdf file with that name.

    Chose “PDF” under the “Multipage document filetype” drop-down menu. Then click on the “Create project button”. Now you can use the main XSane dialog to scan pages into the project one at a time by clicking scan (optionally, chose the paper size in the preview window). The multipage project dialog has buttons to edit and remove images from the project and to change what order they go in.

    When you are satisfied with your project, click the “Save multipage file” button. This will produce a multipage PDF with the project’s name. You can then click “Delete project” to get rid of the project within XSane and remove the temporary directory created by XSane. This will NOT delete the multipage PDF file. You can click the “Create project” button again to start over.

    This essentially does the same thing as the process David outlined, only it does it entirely within the XSane GUI and probably doesn’t need as many external applications.

  5. david says:

    Thanks, Benjamin — I’m going to try that.

  6. Nils Halvard Lunde says:

    thankx, folks. googling for “linux scanning pdf” (without the quotes, of course)
    brought up this article, and it was just what i needed!
    i used xsane under fedora5 (just typing xsane at the command line)
    with an old hp scsi scanner, and it worked with almost no problems.

    and because i scanned line arts (b/w text only pages), it also scanned blazingly fast.

    actually, i worked around one minor problem:
    after scanning 6 pages, i pressed view image in the project dialog for one of the images,
    and xsane stopped abruptpy. however, i just restarted xsane, selected multipage agan, wrote the
    exact same project name in the project name input dialog, and the image file list showed up again.
    i could not create the pdf file immediately, but then i scanned one page,
    then deleted it, and now i could save all pages as a pdf file.
    nice to know if xsane freaks out after you have scanned several pages.

    you made my day, people!

    — nils halvard smallsoft com

  7. I find that using the imagescan command for the first step works the best. It automatically outputs into PDF format.
    A typical scan would look like this:

    scanimage –format tiff –batch –batch-double –resolution 200 –source ADF –mode Grayscale –batch-start=3
    convert -density 200 *.tif output.ps
    ps2pdf output.ps output.pdf

    One could potentially write a script that would automate the whole thing

  8. mac jones says:

    This script worked better for me…..
    note the “–” or double hyphen for the scanimage options

    #!/bin/sh

    #scan a batch
    scanimage –format pnm –batch –batch-prompt –resolution 150 –mode Gray

    #convert the raw file to postscript
    convert -density 150 *.pnm out.ps

    #convert the postscript to pdf
    ps2pdf out.ps out.pdf

    #remove raw scan files
    rm *.pnm

    #remove old ps files
    rm out.ps

  9. Naranek says:

    Thanks for a great article. This is just what I needed. I changed –batch-prompt to –batch-count=1 so it automatically scans only one page at a time. Had some problems with the ps2pdf cropping my A4 scan, but that was solved with -sPAPERSIZE=a4 switch.

  10. Here’s a nice German tutorial on how to automate the whole thing (with scanimage, as mentioned above): http://www.pro-linux.de/t_office/dokumente-in-pdf-scannen.html — The script is in English 🙂

    I got, however, the best results (much work but small files) by scanning the pages (in my case: from a magazine) to Gimp in a high resolution (e.g. 360 dpi), then scaling them down so that the text was still readable, then converting them to an indexed format (instead of RGB) and saving them as PNG (best format for images with a small number of different colors). Of course, you have to experiment with convert -density after scaling the images…

  11. Mac Jones says:

    Here’s my graphical script that’s grown over time, the best thing about it is that tools like beagle can find the pdf’s from the meta data you enter ! Needs pdftk and zenity apps.

    Cheers
    Mac
    New Zealand

    #!/bin/sh

    #scan a batch
    #mode Color or Gray on end of next line if needed
    #echo “Starting !”

    colour=`zenity –list –radiolist –column “-” –column “Scan” TRUE Gray FALS E Color`

    LIMIT=10
    a=0
    cont=1

    until [ $cont -eq “0” ]
    do
    echo -n “$a ”
    let “a+=1”
    if zenity –question –text “OK to scan a page, Cancel to finish, Page=$a”
    then
    cont=1
    scanimage –format pnm –resolution 150 –mode $colour > “$a.pnm”
    else
    cont=0
    fi

    done # No surprises, so far.

    echo “.”
    #convert the raw file to postscript
    convert -density 150 *.pnm out.ps | zenity –progress –auto-close

    #convert the postscript to pdf
    ps2pdf out.ps out.pdf | zenity –progress –auto-close

    #remove raw scan files
    rm *.pnm

    #remove old ps files
    rm out.ps

    #add the metadata
    echo -e “\a”
    #echo “Please enter a name for the PDF file (** no .pdf on end)”
    nm=`zenity –entry –text “Enter file name, (no .pdf on the end)”`
    #echo “Please enter Metadata for searching”
    meta=`zenity –entry –text “Meta data for searching” –entry-text=$nm`

    echo “InfoKey: Producer” > tmp
    echo “InfoValue: $meta” >> tmp
    echo “InfoKey: Keywords” >> tmp
    echo “InfoValue: $meta” >> tmp
    echo “InfoKey: Title” >> tmp
    echo “InfoValue: $nm” >> tmp

    #update the metadata
    pdftk out.pdf update_info tmp output “$nm.pdf”

    #rm metadata file and pdf
    rm tmp
    rm out.pdf

    zenity –info –text=”All done, $nm.pdf is ready!”

  12. matt avila says:

    Thanks to all for the comments and suggestions. I’m running SANE 0.991 on SUSE and need to scan, convert, and save “small as practical” pdf’s. While SANE has the capability to scan multiple page docs into multiple page PDF’s – the filesizes are really huge!

    Running through convert (after scanning muilti pages into *.pnm files) and then running that output through ps2pdf yields some relatively small PDF files that are VERY usable. I dumped Redmonds bloatware about a month ago and am attempting to replace the functionality of PaperPort. I typically scan, OCR, and save PDF’s of all the paperwork that comes through the mail on a weekly basis – this can grow quite large (in paper) and doing this allows me to quickly search, retrieve, and also have a continuity plan should the house burn to the ground (make a second backup of the files). Worst case is I loose a month or so of documentation – the most recent is always easiest to replace.

    So from what my testing shows is scan with SANE, use the project files to run through convert, then run that output through ps2pdf. This cuts the filesize by about 80% or better… very handy when your working with 50+ page documents.

  13. gordon says:

    While I am grateful to the contributors for this assistance, I am also struck by the absurd nature of the process involved. One can do this in a single step for any large document (without having to fiddle with ADF settings) using a number of programs on either Windows or OS X. The obvious route is to pay for a copy of Acrobat, which is a lot cheaper (for a business) than going through the experimentation etc necessary to automate a procedure of this nature. There are other cheaper programs available that will do the same thing without requiring all of the activation nonsense that Paperport and other programs based on the Zeon pdf printer driver require.

    Further, I have found that editing pdf files in anything more than the most primitive way is also difficult on Linux system. On top of that, there is the problem of getting programs to print directly to pdf files outside the kdeprinter system, which is trivial in Windows or OS X. It can be done but the messing around with cups-pdf required is obscure and unfriendly for any but experienced users (and the resulting installation gets messed up on every upgrade).

    It is worth bearing in mind that pdf files are becoming the standard document format for lots of business use because they don’t contain the kind of compromising information that gets retained with Word and other files. Sadly, the very primitive support for scanning, creating and manipulating pdf files is a very large barrier to the adoption of Linux as a desktop system for serious business use – and I have tried and use 4-5 different distributions including SuSE which is the best for this purpose in my experience.

  14. Gaurav Goyal says:

    You can look at PDF Studio (http://www.qoppa.com/psindex.html). Its a commercial Java based solution.

  15. Jeffrey Ratcliffe says:

    Try sourceforge.net/projects/gscan2pdf

  16. Hey y’all, googling “linux scanimage to pdf” here.

    Problem, # scanimage –format tiff > scan.tiff // is yielding

    scanimage: unrecognized option `–format’

    Even though it’s documented in the man page.

    Running RHEL v.4.

    scanimage -format tiff > scan.tiff yields a file named scan.tiff with “ormat” in it.

    Any ideas?

    TY
    G

  17. Felix Lechner says:

    Hi there, just thought i would post my script here, too. It works well for me. I keep no paper records anymore at home or work. Have a second script “scanpdf-a4″ with the scan area adjusted for A4. Make sure to change your SANE device in the script. Sincerely, Felix

    #!/bin/sh

    device=”genesys:libusb:004:005″
    tmpdir=”/tmp”
    basename=”scanpdf”
    counter=0
    color=0
    scanarea=”-l 0 -t 0 -x 216 -y 279″ # US letter
    #scanarea=”-l 0 -t 0 -x 210 -y 297″ # A4

    if [ ! -d “$tmpdir” ]
    then
    echo “Need existing tempdir at $tmpdir.”
    exit
    fi

    until [ $counter -eq -1 ]
    do
    echo -n “Black & White/Color/Done [B/c/d]? ”
    read answer
    if [ -z “$answer” -o “$answer” = “b” -o “$answer” = “B” ]
    then
    color=0
    elif [ “$answer” = “c” -o “$answer” = “C” ]
    then
    color=1
    elif [ “$answer” = “d” -o “$answer” = “D” ]
    then
    break
    else
    continue
    fi
    counter=`expr $counter + 1`
    echo “Scanning page $counter”
    filename=”$basename`echo $counter | awk ‘{ printf “-%03d”, $1 }’`”
    if [ $color -eq 1 ]
    then
    mode=”–mode Color”
    resolution=”–resolution 150″
    else
    mode=”–mode Gray”
    resolution=”–resolution 150″
    fi
    scanimage $mode $resolution $scanarea > “$tmpdir”/”$filename”.pnb
    convert -quality 50 “$tmpdir”/”$filename”.pnb “$tmpdir”/”$filename”.jpg
    rm “$tmpdir”/”$filename”.pnb
    done

    if [ $counter -eq 0 ]
    then
    echo “Nothing scanned.”
    exit
    fi

    savein=””
    while [ -z “$savein” ]
    do
    echo -n “Please name the PDF file to save in: ”
    read savein
    done

    convert “$tmpdir”/$basename*.jpg “$savein”.pdf
    rm “$tmpdir”/$basename*.jpg
    acroread “$savein”.pdf >& /dev/null &
    echo Successfully scanned $counter pages and created “$savein”.pdf.

  18. Luke O'Connell says:

    Some interesting information on this page. I am STILL trying to find a way to create searchable PDFs under Linux. Like the author, I scan all my paperwork and rely on that scan being searchable. I have looked high and low but to no avail, there just does not seem to be a way… unless you are ready to pay hundreds for a corporate solution (there are a few out there). Pretty much the last dependency between Redmond and I.

  19. Gary Thompson says:

    I’ve just discovered gscan2pdf and am very pleased with it. I recommend it if you’re tired of the bloated zlib compressed PDF’s created by Xsane.

  20. A. Pater says:

    I second Luke O’Connell.
    It’s a little strange that there’s no Open Source or Freeware program under POSIX that produces PDF’s with a hidden (OCR’ed) text layer over the scanned images. It can’t be that hard to program such a thing. But that is only my uneducated guess. Since it is possible (e.g. with the outstanding Jarnal program – a JAVA based app) to add a visible text layer to a PDF under POSIX systems.
    The author of gscan2pdf says he wants to add such a function to his app in the future. Hope he will be able to do just that!

  21. david says:

    I scanned a one-page document in greyscale at 300 dpi using xsane’s multipage support and ended up with a 2.6 MB PDF file. The second time, I saved from xsane in Postscript format, then used ps2pdf to produce a PDF, and ended up just with a 0.5 MB file. I’ll try gscan2pdf.

Comments are closed.