So the state I'm in released a bunch of data in PDF form, but to make matters worse, most (all?) of the PDFs appear to be letters typed in Office, printed/fax, and then scanned (our government at its best eh?). At first I thought I was crazy, but then I started seeing numerous pdfs that are 'tilted', like someone didn't get them on the scanner properly. So, I figured the next best thing to getting the actual text out of them, would be to turn each page into an image.

Obviously this needs to be automated, and I'd prefer to stick with Python if possible. If Ruby or Perl have some form of implementation that's just too awesome to pass up, I can go that route. I've tried pyPDF for text extraction, that obviously didn't do me much good. I've tried swftools, but the images I'm getting from that are just shy of completely unusable. It just seems like the fonts get ruined in the conversion. I also don't even really care about the image format on the way out, just as long as they're relatively lightweight, and readable.

Converting PDF to images automatically

Answers

If the PDFs are truly scanned images, then you shouldn't convert the PDF to an image, you should extract the image from the PDF. Most likely, all of the data in the PDF is essentially one giant image, wrapped in PDF verbosity to make it readable in Acrobat.

You should try the simple expedient of simply finding the image in the PDF, and copying the bytes out: Extracting JPGs from PDFs. The code there is dead simple, and there are probably dozens of reasons it won't work on your PDF files. But if it does, you'll have a quick and painless way to get the image data out of the PDF files.


Ghostscript is ideal for converting PDF files to images. It is reliable and has many configurable options. Its also available under the GPL license or commercial license. You can call it from the command line or use its native API.


You could call e.g. pdftoppm from the command-line (or using Python's subprocess module) and then convert the resulting PPM files to the desired format using e.g. ImageMagick (again, using subprocess or some bindings if they exist).


Here's an alternative approach to turning a .pdf file into images: Use an image printer. I've successfully used the function below to "print" pdf's to jpeg images with XsPDF. However, there are MANY image printers out there. Pick the one you like. Some of the code may need to be altered slightly based on the image printer you pick and the standard file saving format that image printer uses.