Overcoming MiniMagick’s Incorrect Count of Pages

TL;DR: Use .verbose before calling .pages on a MiniMagick::Image object.

I was investigating an error with uploaded PDF’s. We use the gem, MiniMagic, to process files from some of our customers into correctly formatted images. From GitHub, MiniMagick is: “A ruby wrapper for ImageMagick or GraphicsMagick command line.”

The Error

The specific error I was getting was very cryptic, unfortunately. From the Resque Worker:

Failed to manipulate with MiniMagick, maybe it is not an image? Original
Error: `convert -density 200 /path/temporary_file.pdf[1]
/path/new_file.png` failed with error: convert.im6: Postscript delegate
failed `/path/temporary_file.pdf': No such file or directory @
error/pdf.c/ReadPDFImage/677. convert.im6: no images defined
`/path/new_file.png' @ error/convert.c/ConvertImageCommand/3044.

This made it sound like the file was of a nonsupported type. This happens occasionally, when a customer attempts to upload other files. One customer dragged the Internet Explorer icon into the upload area. Unfortunately, there is no way to convert the entire Internet Explorer program into an image.

Investigating

Looking deeper, we had to get a copy of the original PDF. Fortunately, the customer was gracious enough to send us a digital copy. We were also able to find the original file still in our S3 file system. For a control, I used a PDF that MiniMagick had already successfully converted. I tried inspecting the MetaData of the PDF, to see if that revealed the problems. While I found that it was created by Epson Scan (more than likely from an Epson Printer/Scanner), it didn’t give me the results I was looking for: incorrect MetaData. On to some Rails Console testing of the PDF culprit.

Trying the gem out in Console revealed some weird behavior with the rogue PDF:

app(dev)> image = ::MiniMagick::Image.open("/path/original_file.pdf")
 **** Warning: can't process font stream, loading font by the name.
 **** Error reading a content stream. The page may be incomplete.
 **** File did not complete the page properly and may be damaged.

 **** This file had errors that were repaired or ignored.
 **** The file was produced by:
 **** >>>> ��EPSON Scan <<<<
 **** Please notify the author of the software that produced this
 **** file that it does not conform to Adobe's published PDF
 **** specification.

#<MiniMagick::Image:0x0000000c74b088 @path="/path/temporary_file.pdf", @tempfile=#<Tempfile:/path/temporary_file.pdf (closed)>, @info=#<MiniMagick::Image::Info:0x0000000c74b060 @path="/path/temporary_file.pdf", @info={}>>

What is this error with the font stream? Well, it says that they were repaired or ignored. Also, MiniMagick thought that this PDF had three pages:

app(dev)> image.pages
 **** Warning: can't process font stream, loading font by the name.
 **** Error reading a content stream. The page may be incomplete.
 **** File did not complete the page properly and may be damaged.

 **** This file had errors that were repaired or ignored.
 **** The file was produced by:
 **** >>>> ��EPSON Scan <<<<
 **** Please notify the author of the software that produced this
 **** file that it does not conform to Adobe's published PDF
 **** specification.

[
 [0] #<MiniMagick::Image:0x0000000c7376f0 @path=“/path/temporary_file.pdf[0]”, @tempfile=nil, @info=#<MiniMagick::Image::Info:0x0000000c7376c8 @path=“/path/temporary_file.pdf[0]", @info={}>>,
 [1] #<MiniMagick::Image:0x0000000c737600 @path=“/path/temporary_file.pdf[1]", @tempfile=nil, @info=#<MiniMagick::Image::Info:0x0000000c7375d8 @path=“/path/temporary_file.pdf[1]", @info={}>>,
 [2] #<MiniMagick::Image:0x0000000c737510 @path=“/path/temporary_file.pdf[2]", @tempfile=nil, @info=#<MiniMagick::Image::Info:0x0000000c7374e8 @path=“/path/temporary_file.pdf[2]", @info={}>>
]

I knew that this PDF only had a single page. Why was it returning an array with three elements?

MiniMagick Code

I had to look into the source code for the gem. I needed to start going backwards into what MiniMagick was using for their source of record. They don’t actually pull the number of pages from the MetaData…it’s found by counting the number of lines returned from the identify method:

def layers
 layers_count = identify.lines.count
 layers_count.times.map do |idx|
 MiniMagick::Image.new("#{path}[#{idx}]")
 end
end
alias pages layers
alias frames layers

What did I get when I called .identify?

app(dev)> image.identify
 **** Warning: can't process font stream, loading font by the name.
 **** Error reading a content stream. The page may be incomplete.
 **** File did not complete the page properly and may be damaged.

 **** This file had errors that were repaired or ignored.
 **** The file was produced by:
 **** >>>> ��EPSON Scan <<<<
 **** Please notify the author of the software that produced this
 **** file that it does not conform to Adobe's published PDF
 **** specification.

"Can't find the font file /usr/share/fonts/truetype/fonts-japanese-mincho.ttf\nCan't find the font file /usr/share/fonts/truetype/fonts-japanese-mincho.ttf\n/path/temporary_file.pdf PDF 612x792 612x792+0+0 16-bit Bilevel DirectClass 61KB 0.000u 0:00.000"

Well, there’s the specifics on that font stream error. The Epson Scan was apparently trying to use some odd characters. I doubt the PDF was actually trying to use a Japanese font. I’m not certain, but I would venture a guess that the unknown characters in the warning (the black diamond with a question mark) were causing the error messages. Two characters, two messages.

MiniMagick was counting those error messages as pages. Could those errors be coming directly from ImageMagick? Possibly. However, the gem should probably account for this possibility, and only return a single page for a single page PDF.

Method Missing

I started looking through the MiniMagick code and poking around. I tried calling .details on my image, but unfortunately, it failed due to a NoMethodError for nil. I’m guessing that it calls for the info from each page. After much exhaustion, I tried something based on what I had seen in the code:

image.verbose

I didn’t realize what I had done when I called that. I was just trying out weird stuff. Why? MiniMagick has defined .method_missing to pass the “missing” methods on to Mogrify (a dangerous, but powerful feature of Ruby). This, in turn, transfers any method you call on it into an extra parameter, command line style. So, it translates .verbose into -verbose for ImageMagick. I was hoping to receive some additional info on the missing fonts, or at least a true listing of how many pages there are. Here’s a sample output:

app(dev)> image.verbose
"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" "-sOutputFile=/path/file_name1” “-f/path/file_name2” “-f/path/file_name3”
 **** Warning: can't process font stream, loading font by the name.
 **** Error reading a content stream. The page may be incomplete.
 **** File did not complete the page properly and may be damaged.

 **** This file had errors that were repaired or ignored.
 **** The file was produced by:
 **** >>>> ��EPSON Scan <<<<
 **** Please notify the author of the software that produced this
 **** file that it does not conform to Adobe's published PDF
 **** specification.

/path/file_name PNG 612x792 612x792+0+0 8-bit DirectClass 261KB 0.010u 0:00.010
/path/temporary_file.pdf PDF 612x792 612x792+0+0 16-bit DirectClass 261KB 0.000u 0:00.000
/path/temporary_file.pdf PDF 612x792 612x792+0+0 16-bit DirectClass 393KB 0.100u 0:00.030
#<MiniMagick::Image:0x0000000c74b088 @path=“/path/temporary_file.pdf", @tempfile=#<Tempfile:/path/temporary_file.pdf (closed)>, @info=#<MiniMagick::Image::Info:0x0000000c74b060 @path=“/path/temporary_file.pdf", @info={}>>

So, I got additional info, but not exactly useful for what I want: number of pages. At this point, it still outputs the extra warning about being incomplete. Surprise: I didn’t know I had somehow solved my issue.

What? That solved it?

Yes. For some reason, sending -verbose to ImageMagick somehow helps it stop sending errors about the missing fonts (I would assume that the fonts are still missing, but it’s not complaining). I found out later, after needing to recheck what had happened (I thought I had mixed up my variables in Console). Calling .pages now returns:

app(dev)> image.pages
[
 [0] #<MiniMagick::Image:0x0000000c727660 @path=“/path/temporary_file.pdf[0]", @tempfile=nil, @info=#<MiniMagick::Image::Info:0x0000000c727638 @path=“/path/temporary_file.pdf[0]", @info={}>>
]

This is because what is returned from .identify is only the info that you would expect. No font stream errors are being returned.

Final Thoughts

The great thing about this approach is that this method is chainable in Ruby. You only have to call it once, so if you are going to be converting each page of a PDF into an image, you might as well call image.verbose.pages.

The worst part about this solution is that I don’t know why it works. I admit that I am not an expert. There is probably something that is happening in ImageMagick that is changing what is returned, and it has a good reason. I realize this is just a workaround, but from what I can tell, this fixes errors that may be happening with bad PDFs that couldn’t successfully be converted before.

Run into this before? Had a similar experience? Or, can you explain ImageMagick? Leave me a comment!

 

Overcoming MiniMagick’s Incorrect Count of Pages

Leave a comment