Comments on: Scanning to PDF in Linux

By: david

david — Thu, 05 Apr 2007 15:23:54 +0000

I scanned a one-page document in greyscale at 300 dpi using xsane’s multipage support and ended up with a 2.6 MB PDF file. The second time, I saved from xsane in Postscript format, then used ps2pdf to produce a PDF, and ended up just with a 0.5 MB file. I’ll try gscan2pdf.

By: A. Pater

A. Pater — Tue, 20 Mar 2007 19:33:59 +0000

I second Luke O’Connell.
It’s a little strange that there’s no Open Source or Freeware program under POSIX that produces PDF’s with a hidden (OCR’ed) text layer over the scanned images. It can’t be that hard to program such a thing. But that is only my uneducated guess. Since it is possible (e.g. with the outstanding Jarnal program – a JAVA based app) to add a visible text layer to a PDF under POSIX systems.
The author of gscan2pdf says he wants to add such a function to his app in the future. Hope he will be able to do just that!

By: Gary Thompson

Gary Thompson — Fri, 23 Feb 2007 12:43:46 +0000

I’ve just discovered gscan2pdf and am very pleased with it. I recommend it if you’re tired of the bloated zlib compressed PDF’s created by Xsane.

By: Luke O'Connell

Luke O'Connell — Tue, 02 Jan 2007 14:05:13 +0000

Some interesting information on this page. I am STILL trying to find a way to create searchable PDFs under Linux. Like the author, I scan all my paperwork and rely on that scan being searchable. I have looked high and low but to no avail, there just does not seem to be a way… unless you are ready to pay hundreds for a corporate solution (there are a few out there). Pretty much the last dependency between Redmond and I.

By: Felix Lechner

Felix Lechner — Sat, 02 Dec 2006 21:26:15 +0000

Hi there, just thought i would post my script here, too. It works well for me. I keep no paper records anymore at home or work. Have a second script “scanpdf-a4″ with the scan area adjusted for A4. Make sure to change your SANE device in the script. Sincerely, Felix

#!/bin/sh

device=”genesys:libusb:004:005″
tmpdir=”/tmp”
basename=”scanpdf”
counter=0
color=0
scanarea=”-l 0 -t 0 -x 216 -y 279″ # US letter
#scanarea=”-l 0 -t 0 -x 210 -y 297″ # A4

if [ ! -d “$tmpdir” ]
then
echo “Need existing tempdir at $tmpdir.”
exit
fi

until [ $counter -eq -1 ]
do
echo -n “Black & White/Color/Done [B/c/d]? ”
read answer
if [ -z “$answer” -o “$answer” = “b” -o “$answer” = “B” ]
then
color=0
elif [ “$answer” = “c” -o “$answer” = “C” ]
then
color=1
elif [ “$answer” = “d” -o “$answer” = “D” ]
then
break
else
continue
fi
counter=`expr $counter + 1`
echo “Scanning page $counter”
filename=”$basename`echo $counter | awk ‘{ printf “-%03d”, $1 }’`”
if [ $color -eq 1 ]
then
mode=”–mode Color”
resolution=”–resolution 150″
else
mode=”–mode Gray”
resolution=”–resolution 150″
fi
scanimage $mode $resolution $scanarea > “$tmpdir”/”$filename”.pnb
convert -quality 50 “$tmpdir”/”$filename”.pnb “$tmpdir”/”$filename”.jpg
rm “$tmpdir”/”$filename”.pnb
done

if [ $counter -eq 0 ]
then
echo “Nothing scanned.”
exit
fi

savein=””
while [ -z “$savein” ]
do
echo -n “Please name the PDF file to save in: ”
read savein
done

convert “$tmpdir”/$basename*.jpg “$savein”.pdf
rm “$tmpdir”/$basename*.jpg
acroread “$savein”.pdf >& /dev/null &
echo Successfully scanned $counter pages and created “$savein”.pdf.

By: Gabriel Richards

Gabriel Richards — Sat, 28 Oct 2006 04:30:18 +0000

Hey y’all, googling “linux scanimage to pdf” here.

Problem, # scanimage –format tiff > scan.tiff // is yielding

scanimage: unrecognized option `–format’

Even though it’s documented in the man page.

Running RHEL v.4.

scanimage -format tiff > scan.tiff yields a file named scan.tiff with “ormat” in it.

Any ideas?

TY
G

By: Jeffrey Ratcliffe

Jeffrey Ratcliffe — Sun, 03 Sep 2006 19:10:09 +0000

Try sourceforge.net/projects/gscan2pdf

By: Gaurav Goyal

Gaurav Goyal — Sat, 02 Sep 2006 14:25:05 +0000

You can look at PDF Studio (http://www.qoppa.com/psindex.html). Its a commercial Java based solution.

By: gordon

gordon — Fri, 01 Sep 2006 07:01:19 +0000

While I am grateful to the contributors for this assistance, I am also struck by the absurd nature of the process involved. One can do this in a single step for any large document (without having to fiddle with ADF settings) using a number of programs on either Windows or OS X. The obvious route is to pay for a copy of Acrobat, which is a lot cheaper (for a business) than going through the experimentation etc necessary to automate a procedure of this nature. There are other cheaper programs available that will do the same thing without requiring all of the activation nonsense that Paperport and other programs based on the Zeon pdf printer driver require.

Further, I have found that editing pdf files in anything more than the most primitive way is also difficult on Linux system. On top of that, there is the problem of getting programs to print directly to pdf files outside the kdeprinter system, which is trivial in Windows or OS X. It can be done but the messing around with cups-pdf required is obscure and unfriendly for any but experienced users (and the resulting installation gets messed up on every upgrade).

It is worth bearing in mind that pdf files are becoming the standard document format for lots of business use because they don’t contain the kind of compromising information that gets retained with Word and other files. Sadly, the very primitive support for scanning, creating and manipulating pdf files is a very large barrier to the adoption of Linux as a desktop system for serious business use – and I have tried and use 4-5 different distributions including SuSE which is the best for this purpose in my experience.

By: matt avila

matt avila — Sat, 26 Aug 2006 23:49:49 +0000

Thanks to all for the comments and suggestions. I’m running SANE 0.991 on SUSE and need to scan, convert, and save “small as practical” pdf’s. While SANE has the capability to scan multiple page docs into multiple page PDF’s – the filesizes are really huge!

Running through convert (after scanning muilti pages into *.pnm files) and then running that output through ps2pdf yields some relatively small PDF files that are VERY usable. I dumped Redmonds bloatware about a month ago and am attempting to replace the functionality of PaperPort. I typically scan, OCR, and save PDF’s of all the paperwork that comes through the mail on a weekly basis – this can grow quite large (in paper) and doing this allows me to quickly search, retrieve, and also have a continuity plan should the house burn to the ground (make a second backup of the files). Worst case is I loose a month or so of documentation – the most recent is always easiest to replace.

So from what my testing shows is scan with SANE, use the project files to run through convert, then run that output through ps2pdf. This cuts the filesize by about 80% or better… very handy when your working with 50+ page documents.

By: Mac Jones

Mac Jones — Mon, 29 May 2006 11:01:34 +0000

Here’s my graphical script that’s grown over time, the best thing about it is that tools like beagle can find the pdf’s from the meta data you enter ! Needs pdftk and zenity apps.

Cheers
Mac
New Zealand

#!/bin/sh

#scan a batch
#mode Color or Gray on end of next line if needed
#echo “Starting !”

colour=`zenity –list –radiolist –column “-” –column “Scan” TRUE Gray FALS E Color`

LIMIT=10
a=0
cont=1

until [ $cont -eq “0” ]
do
echo -n “$a ”
let “a+=1”
if zenity –question –text “OK to scan a page, Cancel to finish, Page=$a”
then
cont=1
scanimage –format pnm –resolution 150 –mode $colour > “$a.pnm”
else
cont=0
fi

done # No surprises, so far.

echo “.”
#convert the raw file to postscript
convert -density 150 *.pnm out.ps | zenity –progress –auto-close

#convert the postscript to pdf
ps2pdf out.ps out.pdf | zenity –progress –auto-close

#remove raw scan files
rm *.pnm

#remove old ps files
rm out.ps

#add the metadata
echo -e “\a”
#echo “Please enter a name for the PDF file (** no .pdf on end)”
nm=`zenity –entry –text “Enter file name, (no .pdf on the end)”`
#echo “Please enter Metadata for searching”
meta=`zenity –entry –text “Meta data for searching” –entry-text=$nm`

echo “InfoKey: Producer” > tmp
echo “InfoValue: $meta” >> tmp
echo “InfoKey: Keywords” >> tmp
echo “InfoValue: $meta” >> tmp
echo “InfoKey: Title” >> tmp
echo “InfoValue: $nm” >> tmp

#update the metadata
pdftk out.pdf update_info tmp output “$nm.pdf”

#rm metadata file and pdf
rm tmp
rm out.pdf

zenity –info –text=”All done, $nm.pdf is ready!”

By: Christoph Lange

Christoph Lange — Thu, 11 May 2006 09:02:08 +0000

Here’s a nice German tutorial on how to automate the whole thing (with scanimage, as mentioned above): http://www.pro-linux.de/t_office/dokumente-in-pdf-scannen.html — The script is in English 🙂

I got, however, the best results (much work but small files) by scanning the pages (in my case: from a magazine) to Gimp in a high resolution (e.g. 360 dpi), then scaling them down so that the text was still readable, then converting them to an indexed format (instead of RGB) and saving them as PNG (best format for images with a small number of different colors). Of course, you have to experiment with convert -density after scaling the images…

By: Naranek

Naranek — Tue, 11 Apr 2006 17:23:13 +0000

Thanks for a great article. This is just what I needed. I changed –batch-prompt to –batch-count=1 so it automatically scans only one page at a time. Had some problems with the ps2pdf cropping my A4 scan, but that was solved with -sPAPERSIZE=a4 switch.

By: mac jones

mac jones — Fri, 24 Mar 2006 03:51:55 +0000

This script worked better for me…..
note the “–” or double hyphen for the scanimage options

#!/bin/sh

#scan a batch
scanimage –format pnm –batch –batch-prompt –resolution 150 –mode Gray

#convert the raw file to postscript
convert -density 150 *.pnm out.ps

#convert the postscript to pdf
ps2pdf out.ps out.pdf

#remove raw scan files
rm *.pnm

#remove old ps files
rm out.ps

By: Harout Hedeshian

Harout Hedeshian — Tue, 21 Mar 2006 03:15:16 +0000

I find that using the imagescan command for the first step works the best. It automatically outputs into PDF format.
A typical scan would look like this:

scanimage –format tiff –batch –batch-double –resolution 200 –source ADF –mode Grayscale –batch-start=3
convert -density 200 *.tif output.ps
ps2pdf output.ps output.pdf

One could potentially write a script that would automate the whole thing

By: Nils Halvard Lunde

Nils Halvard Lunde — Wed, 15 Mar 2006 20:33:07 +0000

thankx, folks. googling for “linux scanning pdf” (without the quotes, of course)
brought up this article, and it was just what i needed!
i used xsane under fedora5 (just typing xsane at the command line)
with an old hp scsi scanner, and it worked with almost no problems.

and because i scanned line arts (b/w text only pages), it also scanned blazingly fast.

actually, i worked around one minor problem:
after scanning 6 pages, i pressed view image in the project dialog for one of the images,
and xsane stopped abruptpy. however, i just restarted xsane, selected multipage agan, wrote the
exact same project name in the project name input dialog, and the image file list showed up again.
i could not create the pdf file immediately, but then i scanned one page,
then deleted it, and now i could save all pages as a pdf file.
nice to know if xsane freaks out after you have scanned several pages.

you made my day, people!

— nils halvard smallsoft com

By: david

david — Tue, 14 Mar 2006 02:50:23 +0000

Thanks, Benjamin — I’m going to try that.

By: Benjamin Kay

Benjamin Kay — Tue, 14 Mar 2006 01:21:15 +0000

I’m using XSane 0.991 on Gentoo Linux. In the main XSane dialog there is a “target” icon that looks like a crosshairs, and there is a drop-down menu just to the right of that. Click on the drop down menu and chose “Multipage”.

A new dialog called “xsane multipage project” will appear. You may chose any name for your project; xsane will create a folder with that name to store temporary files in and eventually output a pdf file with that name.

Chose “PDF” under the “Multipage document filetype” drop-down menu. Then click on the “Create project button”. Now you can use the main XSane dialog to scan pages into the project one at a time by clicking scan (optionally, chose the paper size in the preview window). The multipage project dialog has buttons to edit and remove images from the project and to change what order they go in.

When you are satisfied with your project, click the “Save multipage file” button. This will produce a multipage PDF with the project’s name. You can then click “Delete project” to get rid of the project within XSane and remove the temporary directory created by XSane. This will NOT delete the multipage PDF file. You can click the “Create project” button again to start over.

This essentially does the same thing as the process David outlined, only it does it entirely within the XSane GUI and probably doesn’t need as many external applications.

By: david

david — Tue, 28 Feb 2006 00:44:06 +0000

When I skipped the Postscript step, I ended up with very large PDF files; going to PostScript, then using ps2pdf, created much smaller output (better compression?).

By: dave

dave — Mon, 27 Feb 2006 23:24:48 +0000

You don’t actually need to perform the postscript step – ImageMagick’s convert will convert directly to pdf and can allow choice of paper size, for example (using international paper sizes not the strange US ones):
convert -page A4 *.jpg out.pdf

One gotcha with convert though is its memory allocation – the defaults for memory expect that it has a whole Gig of pure memory to chew up. If it fails to allocate memory it will just die on you with no error message. To get around this, use the -limit flags:
convert -page A4 -limit memory 256 -limit map 512