Extract original images from PDF files
Posted: 03 Feb 2023 01:14
I never post scripts but this time is the exception because it was so damn painful to get this working, I've looked everywhere for an application that can extract the images from PDF files without re-converting them and found none except for mutool, the problem is its extract command doesn't offer an output path option and it doesn't name the files by their pages so I had to do this manually, I started doing this with and overcomplicated XYplorer script and batch file but found other way with the help of sebras from mutool, thanks to him I realized it's possible to execute JS scripts with mutool.
This tiny script needs 2 applications (the last one is optional) that you have to download from their official sites:
And they must be included in your PATH, you can do this by simply copying the executables to "C:\Windows", exiftool executable must be renamed to "exiftool.exe".
If sounder is not included it will silently fail to play a sound if there is an error (see point 5 below).
Sound downloaded from freesound.org, don't remember which one was.
Also the script makes use of 2 extra files, the JavaScript script and one batch that executes mutool and exiftool. All the files are easy to read, or at least I tried.
If you want to use the JS script in other ways, it accepts 3 arguments.
Script didn't extracted images other than DCT encoded ones (JPG) successfully, now it extracts the original JPG files and extracts all the other formats to PNG without any quality loss.
As far as I know there is no way to extract TIF images as TIF files, WEBP images as WEBP files, and so on, once they are included in a PDF file they become raw data which mutool has to interpret, and it does in this case by outputting it to PNG. There are some formats that can be extracted as is such as JBIG2 but they are useless as there is no viewer.
This tiny script needs 2 applications (the last one is optional) that you have to download from their official sites:
And they must be included in your PATH, you can do this by simply copying the executables to "C:\Windows", exiftool executable must be renamed to "exiftool.exe".
If sounder is not included it will silently fail to play a sound if there is an error (see point 5 below).
Sound downloaded from freesound.org, don't remember which one was.
Also the script makes use of 2 extra files, the JavaScript script and one batch that executes mutool and exiftool. All the files are easy to read, or at least I tried.
- Extract all the images from all the currently listed PDFs to a subfolder named after the PDF file, in the current path.
- Number padding is automatically detected based on the amount of pages and images per page.
- Copy the dates from the PDF file to the images, and then the dates from the metadata.
- It extracts the original JPG files and all the other formats as PNG files so there is no quality loss.
- Warns if there is any page in any PDF that has no image.
If you want to use the JS script in other ways, it accepts 3 arguments.
- PDF file
- Output path
- Anything, this is just to make the script output and error if there is at least one page that has no image.
Script didn't extracted images other than DCT encoded ones (JPG) successfully, now it extracts the original JPG files and extracts all the other formats to PNG without any quality loss.
As far as I know there is no way to extract TIF images as TIF files, WEBP images as WEBP files, and so on, once they are included in a PDF file they become raw data which mutool has to interpret, and it does in this case by outputting it to PNG. There are some formats that can be extracted as is such as JBIG2 but they are useless as there is no viewer.