Extract page sizes from large PDFs

Extract page sizes from large PDFs - python-2.7

I need to extract the number of pages and their sizes in px/mm/cm/some-unit from PDF files using Python (sadly, 2.7, because it's a legacy project). The problem is that the files can be truly huge (hundreds of MiBs) because they'll contain large images.
I do not care for this content and I really want just a list of page sizes from the file, with as little consumption of RAM as possible.
I found quite a few libraries that can do that (included, but not limited, to the ones in the answers here), but none provide any remarks on the memory usage, and I suspect that most of them - if not all - read the whole file in memory before doing anything with it, which doesn't fit my purpose.
Are there any libraries that extract only structure and give me the data that I need without clogging my RAM?

pyvips can do this. It loads the file structure when you open the PDF and only renders each page when you ask for pixels.
For example:
#!/usr/bin/python
import sys
import pyvips
i = 0
while True:
try:
x = pyvips.Image.new_from_file(sys.argv[1], dpi=300, page=i)
print("page =", i)
print("width =", x.width)
print("height =", x.height)
except:
break
i += 1
libvips 8.7, due in another week or so, adds a new metadata item called n-pages you can use to get the length of the document. Until that is released though you need to just keep incrementing the page number until you get an error.
Using this PDF, when I run the program I see:
$ /usr/bin/time -f %M:%e ./sizes.py ~/pics/r8.pdf
page = 0
width = 2480
height = 2480
page = 1
width = 2480
height = 2480
page = 2
width = 4960
height = 4960
...
page = 49
width = 2480
height = 2480
55400:0.19
So it opened 50 pages in 0.2s real time, with a total peak memory use of 55mb. That's with py3, but it works fine with py2 as well. The dimensions are in pixels at 300 DPI.
If you set page to -1, it'll load all the pages in the document as a single very tall image. All the pages need to be the same size for this though, sadly.

Inspired by the other answer, I found that libvips, which is suggested there, uses poppler (it can fall back to some other library if it cannot find poppler).
So, instead of using the superpowerful pyvips, which seems great for multiple types of documents, I went with just poppler, which has multiple Python libraries. I picked pdflib and came up with this solution:
from sys import argv
from pdflib import Document
doc = Document(argv[1])
for num, page in enumerate(doc, start=1):
print(num, tuple(2.54 * x / 72 for x in page.size))
The 2.54 * x / 72 part converts from px to cm, nothing more.
Speed and memory test on a 264MiB file with one huge image per page:
$ /usr/bin/time -f %M\ %e python t2.py big.pdf
1 (27.99926666666667, 20.997333333333337)
2 (27.99926666666667, 20.997333333333337)
...
56 (27.99926666666667, 20.997333333333337)
21856 0.09
Just for the reference, if anyone is looking a pure Python solution, I made a crude one which is available here. Not thoroughly tested and much, much slower than this (some 30sec for the above).

Related

MPEG 2 and 2.5 - problems calculating frame sizes in bytes

I have a console program which I have used for years, for (among other things) displaying info about certain audio-file formats, including mp3. I used data from the mpeghdr site to calculate the frame sizes, in order to further calculate playing time for the tracks. The equation that I got from mpeghdr was:
// Read the BitRate, SampleRate and Padding of the frame header.
// For Layer I files use this formula:
//
// FrameLengthInBytes = (12 * BitRate / SampleRate + Padding) * 4
//
// For Layer II & III files use this formula:
//
// FrameLengthInBytes = 144 * BitRate / SampleRate + Padding
This works well for most mp3 files, but there have always been a small subset for whom this equation failed. Recently, I've been looking at a set of very small mp3 files, and have found that for these files this formula fails much more often, so I'm trying to finally nail down what is going on. All of these mp3 files were generated using Lame V3.100, with default settings, on Windows 7 64-bit.
In all cases, I can successfully find the first frame header, but when I used the above formula to calculate the offset to the next frame header, it is sometimes not correct.
As an example, I have a file 'wolf howl.mp3'; analytical files such as MPEGAudioInfo show frame size as 288 bytes. When I run my program, though, it shows length of first frame as 576 bytes (2 * 288). When I look at the mp3 file in a hex editor, with first frame at 0x154, I can see that the next frame is at 0x154 + 208 bytes, but this calculation does in fact result in 576 bytes...
File info:
mpegV2.5, layer III
frame: bitrate=32, sample_rate=8000, pad=0, bytes=576
mtemp->frame_length_in_bytes =
(144 * (mtemp->bitrate * 1000) / mtemp->sample_rate) + mtemp->padding_bit;
which equals 576
I've looked at numerous other references, and they all show this equation...
At first I thought is was an issue with MPEG 2.5, which is an unofficial standard, but I have also seen this with MPEG2 files as well. Only happens with small files, though.
Does anyone have any insights on what I am missing here??
//**************************************
Later notes:
I thought maybe audio format would be relevant to this issue, so I dumped channel_mode and mode_extension for each of my test files (3 calculate properly, 2 don't). Sadly, all of them are cmode=3, mode_ext=0
(i.e., last byte of the header is 0xC4)... so that doesn't help...

Okay, I found the answer to this queston... it was in the MPEGAudioInfo program on CodeProject site. Here is the vital key:
//*************************************************************************************
// This reference data is from MPEGAudioInfo app
// Samples per Frame / 8
static const u32 m_dwCoefficients[2][3] =
{
{ // MPEG 1
12, // Layer1 (must be multiplied with 4, because of slot size)
144, // Layer2
144 // Layer3
},
{ // MPEG 2, 2.5
12, // Layer1 (must be multiplied with 4, because of slot size)
144, // Layer2
72 // Layer3
}
};
It is unfortunately that none of the reference pages mention this detail !!
My program now successfully calculates frame sizes for all of my mp3 files, including the small ones.

I had the same problem. Some documents, I've read, don't define dividing by 2 in Frame-Size formula for MPEG2.5L3. But some src-code, I encountered - does.
It's hard to find out any proof.
I have nothing better than this link:
https://link.springer.com/chapter/10.1007/978-1-4615-0327-9_12
(it's better to share that link in "add a comment"-form, but I have insufficient rank)

webRTC : How to apply webRTC's VAD on audio through samples obtained from WAV file

Currently, I am parsing wav files and storing samples in std::vector<int16_t> sample. Now, I want to apply VAD (Voice Activity Detection) on this data to find out the "regions" of voice, and more specifically the start and end of words.
The parsed wav files are 16KHz, 16 bit PCM, mono. My code is in C++.
I have searched a lot about it but could not find proper documentation regarding webRTC's VAD functions.
From what I have found, the function that I need to use is WebRtcVad_Process(). It's prototype is written below :
int WebRtcVad_Process(VadInst* handle, int fs, const int16_t* audio_frame,
size_t frame_length)
From what I found here : https://stackoverflow.com/a/36826564/6487831
Each frame of audio that you send to the VAD must be 10, 20 or 30 milliseconds long.
Here's an outline of an example that assumes audio_frame is 10 ms (320 bytes) of audio at 16000 Hz:
int is_voiced = WebRtcVad_Process(vad, 16000, audio_frame, 160);
It makes sense :
1 sample = 2B = 16 bits
SampleRate = 16000 sample/sec = 16 samples/ms
For 10 ms, no of samples = 160
So, based on that I have implemented this :
const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms++)
{
int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
std::cout<<ms<<" ms : "<<isActive<<std::endl;
temp = temp + 160; // processed 160 samples
}
Now, I am not really sure if this is correct. Also, I am also unsure about whether this gives me correct output or not.
So,
Is it possible to use the samples parsed directly from the wav files, or does it need some processing?
Am I looking at the correct function to do the job?
How to use the function to properly perform VAD on the audio stream?
Is it possible to distinct between the spoken words?
What is the best way to check if the output I am getting is correct?
If not, what is the best way to do this task?

I'll start by saying that no, I don't think you will be able to segment an utterance into individual words using VAD. From the article on speech segmentation in Wikipedia:
One might expect that the inter-word spaces used by many written
languages like English or Spanish would correspond to pauses in their
spoken version, but that is true only in very slow speech, when the
speaker deliberately inserts those pauses. In normal speech, one
typically finds many consecutive words being said with no pauses
between them, and often the final sounds of one word blend smoothly or
fuse with the initial sounds of the next word.
That said, I'll try to answer your other questions.
You need to decode the WAV file, which could be compressed, into raw PCM audio data before running VAD. See e.g. Reading and processing WAV file data in C/C++. Alternately, you could use something like sox to convert the WAV file to raw audio before running your code. This command will convert a WAV file of any format to 16 KHz, 16-bit PCM in the format that WebRTCVAD expects:
sox my_file.wav -r 16000 -b 16 -c 1 -e signed-integer -B my_file.raw
It looks like you are using the right function. To be more specific, you should be doing this:
#include "webrtc/common_audio/vad/include/webrtc_vad.h"
// ...
VadInst *vad;
WebRtcVad_Create(&vad);
WebRtcVad_Init(vad);
const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms += 10)
{
int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
std::cout << ms << " ms : " << isActive << std::endl;
temp = temp + 160; // processed 160 samples (320 bytes)
}
To see if it's working, you can run known files and see if you get the results you expect. For example, you could start by processing silence and confirm that you never (or rarely--this algorithm is not perfect) see a voiced result come back from WebRtcVad_Process. Then you could try a file that is all silence except for one short utterance in the middle, etc. If you want to compare to an existing test, the py-webrtcvad module has a unit test that does this; see the test_process_file function.
To do word-level segmentation, you will probably need to find a speech recognition library that does it or gives you access to the information that you need to do it. E.g. this thread on the Kaldi mailing list seems to talks about how to segment by words.

Convert VERY large ppm files to JPEG/JPG/PNG?

So I wrote a C++ program that produces very high resolution pictures (fractals).
I use fstream to save all the data in a .ppm file.
Everything works fine, but when I go into really high resolution (38400x21600) the ppm file has ~8 Gigabytes.
With my 16 Gigabytes of Ram, however, I am still not able to convert that picture. I downloaded couple of converters, but they couldn't handle it. Even Gimp crashed when I try to "export as...".
So, does anyone know a good converter that can handle really large ppm files? In fact, I even want to go above 100 Gigabytes. I don't care if it's slow, it should just work.
If there is no such converter: Is there a way to std::ofstream in a better way? Like maybe, is there a library that automaticly produces a PNG file?
Thanks for you help !
Edit: also I asked myself what might be the best format for saving these large images. I researched and JPEG looks quite pretty (small size, still good quality). But may be there a better format? Let me know. Thanks

A few thoughts...
An 8-bit PPM file of 38400x21600 should take 2.3GB. A 16-bit PPM file of the same dimensions requires twice as much, i.e. 4.6GB so I am not sure where you got 8GB from.
VIPS is excellent for processing large images, and if I take a 38400x21600 PPM file, and use the following command in Terminal (i.e. at the command-line), I can see it peaks at 58MB of RAM to do the conversion from PPM to JPEG...
vips jpegsave fractal.ppm fractal.jpg --vips-leak
memory: high-water mark 58.13 MB
That takes 31 seconds on a reasonable spec iMac and produces a 480MB file from my (random) data, so you would expect your result to be much smaller, since mine is pretty incompressible.
ImageMagick, on the other hand, takes 1.1GB peak working set of memory and does the same conversion in 74 seconds:
/usr/bin/time -l convert fractal.ppm fractal.jpg
73.81 real 69.46 user 4.16 sys
11616595968 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
4051124 page reclaims
4 page faults
0 swaps
0 block input operations
106 block output operations
0 messages sent
0 messages received
0 signals received
9 voluntary context switches
11791 involuntary context switches

Go to the Baby X resource compiler and download the JPEG encoder, savejpeg.c. It takes an rgb buffer which has to be flat in memory. Hack into it and replace with a version that accepts a stream of 16x16 blocks. Then write your own ppm loader that loads in a 16 pixel high strip at a time.
Now the system will scale up to huge images which don't fit in memory. How you're going to display them I don't know. But the JPEG will be to specification.
https://github.com/MalcolmMcLean/babyxrc

I'd suggest that a more efficient and faster solution would be to simply get more RAM - 128GB is not prohibitively expensive these days (or add swap space).

Reaching limitations creating a gallery of plots

In a script I'm using, the code generates a figure where a number of subplots are generated. Usually it creates a rectangular grid of plots, but for it's current use, the horizontal parameter only has 1 value, and the vertical parameter has considerably more values than it has had previously. This is causing my program to crash while running, because (presumably) the vertical dimension is too large. The code that's causing the issue is:
#can't get past the first line here
self.fig1 = plt.figure('Title',figsize=(4.6*numXparams,2.3*numYparams))
self.gs = gridspec.GridSpec(numYparams,numXparams)
self.gs.update(left=0.03, right=0.97, top=0.9, bottom=0.1, wspace=0.5, hspace=0.5)
and then later in a nested for loop running over both params:
ax = plt.subplot(self.gs[par0, par1])
The error I'm getting is:
X Error of failed request: badAlloc (insufficient resources for operation)
Major opcode of failed request: 53 (X_CreatePixmap)
Serial number of failed request: 295
Current serial number in output stream: 296
My vertical parameter currently has 251 values in it, so I can see how 251*2.3 inches could lead to trouble. I added in the 2.3*numYparams because the plots were overlapping, but I don't know how to create the figure any smaller without changing how the plots are arranged in the figure. It is important for these plots to stay in a vertically oriented column.

There are a couple of errors in your code. Fixing them allowed me to generate the figure you are asking for.
# I needed the figsize keyword here
self.fig1 = plt.figure(figsize=(4.6*numXparams,2.3*numYparams))
# You had x and y switched around here
self.gs = gridspec.GridSpec(numYparams,numXparams)
self.gs.update(left=0.03, right=0.97, top=0.9, bottom=0.1, wspace=0.5, hspace=0.5)
# I ran this loop
for i in range(numYparams):
ax = fig1.add_subplot(gs[i, 0]) # note the y coord in the gridspec comes first
ax.text(0.5,0.5,i) # just an identifier
fig1.savefig('column.png',dpi=50) # had to drop the dpi, because you can't have a png that tall!
and this is the top and bottom of the output figure:
Admittedly, there was a lot of space above the first and below the last subplot, but you can fix that by playing with the figure dimensions or gs.update

Running textcleaner against ImageMagick compressed images

I am trying to use the textcleaner script for cleaning up real life images that I am using with OCR. The issue I am having is that the images sent to me are rather large sometimes. (3.5mb - 5mb 12MP pics) The command I run with textcleaner ( textcleaner -g -e none -f <int # 10 - 100> -o 5 result1.jpg out1.jpg ) takes about 10 seconds at -f 10 and minutes or more on a -f 100.
To get around this I tried using ImageMagick to compress the image so it was much smaller. Using convert -strip -interlace Plane -gaussian-blur 0.05 -quality 50% main.jpg result1.jpg I was able to take a 3.5mb and convert it almost loss-lessly to ~400kb. However when I run textcleaner on this new file it STILL acts like its a 3.5mb file. (Times are almost exactly the same). I have tested these textcleaner settings against a file NOT compressed #400kb and it is almost instant while -f 100 takes about 12 seconds.
I am about out of ideas. I would like to follow the example here as I am in almost exactly the same situation. However, at the current speed of transformation an entire OCR process could take over 10 minutes when I need this to be around 30 seconds.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js