scanning plot through a large data file using python - python-2.7

I have a large (10-100GB) data file of 16-bit integer data, which represents a time series from a data acquisition device. I would like to write a piece of python code that scans through it, plotting a moving window of a few seconds of this data. Ideally, I would like this to be as continuous as possible.
The data is sampled at 4MHz, so to plot a few seconds of data involves plotting ~10 million data points on a graph. Unfortunately I cannot really downsample since the features I want to see are sparse in the file.
matplotlib is not really designed to do this. It is technically possible, and I have a semi-working matplotlib solution which allows me to plot any particular time window, but it's far too slow and cumbersome to do a continuous scan of incrementally changing data - redrawing the figure takes several seconds, which is far too long.
Can anyone suggest a python package or approach do doing this?

PyQtGraph is faster than Matplotlib but I don't know if it can plot 10 million points a second. It does also include multiple methods to down-sample your data so one of them might still be useful to you. Note that it requires Qt and PyQt.
Still, you have between 5e9 and 5e10 data samples. If you can simultaneously plot
10 million of them, this still means making between 500 and 5000 plots. Are you really going to inspect them all visually? You might consider to implement some kind of feature detection.

Something that has worked for me in a similar problem (time varying heat-maps) was to run a batch job of producing several thousands such plots over night, saving each as a separate image. At 10s a figure, you can produce 3600 in 10h. You can then simply scan through the images which could provide you with the insight you're looking for.

Related

BOKEH FLASK : Flask cant display large array resulted by heavy computing script into bokeh, but worked in light script with the same size of array

im developing web application using flask and use bokeh to visualize the result. It read tiff file as array then process it with some mathematical computation and then map the result into colormap and display it with bokeh, the array has size of 7771x7631. When i try to use only 1/8 of the total script which also resulted in array of 7771x7631, it works fine. But then when i try to use the whole script which is quite heavy and need two minutes to run completely and also resulted in array of 7771x7631, the browser didnt display any plot with bokeh. The script is fine and works as it should be, because when i try to slice the resulted array into half size (4000x4000) it worked. I really dont know why this could happen. Please help me.
This is not a reasonable size for image plots in Bokeh. It is well outside the intended/expected usage size for such (not even close, really). I can imagine a variety of reasons it might not work on any given browser, starting with browser-specific memory or canvas size limitations, browser bugs, etc. An RGBA array of that size is alone nearly a quarter of a gigabyte, it has to be received as a base64-encoded string (which is slightly larger) which is then converted into JS typed arrays. Both of those have to exist simultaneously, at least long enough to decode things, which means the browser stack must accommodate more than half a gigabyte on the stack at once [1]. Many browsers will simply won't. There is nothing at all that the project can do about this.
If you have data of this size you really must look in to some kind of downsampling approach that get the images that are sent to the browser down to more manageable size that also more closely match the pixel dimensions of the actual plot canvas. If you send such huge images to a smaller canvas then the browser will end up subsampling regardless, just to fit things on the canvas. The only difference is that you have no control over the process in that case. So better to take an active role in how things the downsampling is accomplished. A tool like Datashader, which works well with Bokeh, might be useful.
[1] If this was a Bokeh server application, then the data could be sent in a raw binary format directly into typed array buffers, which would avoid roughly half that cost. It might help some, but I would still consider this usage very extreme.

How does image resolution affect result and accuracy in Keras?

I'm using Keras (with Tensorflow backend) for an image classification project. I have a total of almost 40 000 hi-resolution (1920x1080) images that I use as training input data. Training takes about 45 minutes and this is becoming a problem so I was thinking that I might be able to speed things up by lowering the resolution of the image files. Looking at the code (I didn't write it myself) it seems all images are re-sized to 30x30 pixels anyway before processing
I have two general questions about this.
Is it reasonable to expect this to improve the training speed?
Would resizing the input image files affect the accuracy of the image classification?
1- Of course it will affect the training speed as the spatial dimensions is one of the most important key of the model speed performance.
2- We can say sure it'll affect the accuracy, but how much exactly that depends on many of other aspects like what objects are you classifying and what dataset are you working with.

Appropriate image file format for losslessly compressing series of screenshots

I am building an application which takes a great many number of screenshots during the process of "recording" operations performed by the user on the windows desktop.
For obvious reasons I'd like to store this data in as efficient a manner as possible.
At first I thought about using the PNG format to get this done. But I stumbled upon this: http://www.olegkikin.com/png_optimizers/
The best algorithms only managed a 3 to 5 percent improvement on an image of GUI icons. This is highly discouraging and reveals that I'm going to need to do better because just using PNG will not allow me to use previous frames to help the compression ratio. The filesize will continue to grow linearly with time.
I thought about solving this with a bit of a hack: Just save the frames in groups of some number, side by side. For example I could just store the content of 10 1280x1024 captures in a single 1280x10240 image, then the compression should be able to take advantage of repetitions across adjacent images.
But the problem with this is that the algorithms used to compress PNG are not designed for this. I am arbitrarily placing images at 1024 pixel intervals from each other, and only 10 of them can be grouped together at a time. From what I have gathered after a few minutes scanning the PNG spec, the compression operates on individual scanlines (which are filtered) and then chunked together, so there is actually no way that info from 1024 pixels above could be referenced from down below.
So I've found the MNG format which extends PNG to allow animations. This is much more appropriate for what I am doing.
One thing that I am worried about is how much support there is for "extending" an image/animation with new frames. The nature of the data generation in my application is that new frames get added to a list periodically. But I do have a simple semi-solution to this problem, which is to cache a chunk of recently generated data and incrementally produce an "animation", say, every 10 frames. This will allow me to tie up only 10 frames' worth of uncompressed image data in RAM, not as good as offloading it to the filesystem immediately, but it's not terrible. After the entire process is complete (or even using free cycles in a free thread, during execution) I can easily go back and stitch the groups of 10 together, if it's even worth the effort to do it.
Here is my actual question that everything has been leading up to. Is MNG the best format for my requirements? Those reqs are: 1. C/C++ implementation available with a permissive license, 2. 24/32 bit color, 4+ megapixel (some folks run 30 inch monitors) resolution, 3. lossless or near-lossless (retains text clarity) compression with provisions to reference previous frames to aid that compression.
For example, here is another option that I have thought about: video codecs. I'd like to have lossless quality, but I have seen examples of h.264/x264 reproducing remarkably sharp stills, and its performance is such that I can capture at a much faster interval. I suspect that I will just need to implement both of these and do my own benchmarking to adequately satisfy my curiosity.
If you have access to a PNG compression implementation, you could easily optimize the compression without having to use the MNG format by just preprocessing the "next" image as a difference with the previous one. This is naive but effective if the screenshots don't change much, and compression of "almost empty" PNGs will decrease a lot the storage space required.

Need fast c++ qt/qwt scatter plot

I have a huge array of 2D points (about 3 millions of pairs), which I need to render with reasonable speed in a Qt-based application.
I've tried using QGraphicsScene, but its very slow even on 400000 primitives, so I was looking into the qwt library instead.
It has a scatter plot example screenshot on its sourceforge page, which looks like exactly what I need, but I cannot find neither any kind of actual code that can be used for this data, nor an according API in qwt docs - it mentions only different types of curves.
So it would be good to get some pointers for scatter plot examples and some advice on its performance.
Suggestions for other c++ qt-compatible plotting libraries which can cope with this amount of data are also welcome.
Scatter plot is contained in the "realtime" example: what you want is the IncrementalPlot class.
I'd also suggest that drawing all the 3 million points isn't reasonable, since modern screens have only about 2 million pixels :) Thus it seems better to simplify the plot beforehand by merging the adjacent points into one with a threshold dependent on the zoom factor.
As viens pointed out, generating scatter plots with 3 million points is probably not a good idea.
I have achieved good performance generating 3D scatter plots with 30.000 points using OpenGL.
OpenGL is fast and integrates well with Qt. However, it is a low level API that forces you to do a lot of tedious coding.
VTK may be another option.
MathGL is free (GPL) cross-platform plotting library. It was written in C++ and have Qt widget. Also it is rather fast, but 3 millions points ... it take about 30 seconds to plot in my laptop.
You'd suggest using OpenGL as #vines said, and in particular exploiting or display lists glGenList or vertex buffers. Some million points as primitives vertices shouldn't be that difficult.

Help with FFT(Fast Fourier Transforms) and/or DSP

Im trying to do a screen-flashing application, that flashes the screen according to the music(which will be frequencies, such as healing frequencies, etc...).
I already made the player and know how will I make the screen flash, but I need to make the screen flash super fast according to the music, for example if the music speeds up, the screen-flash will flash faster. I understand that I would achieve this by FFT or DSP(as I only need to know when the frequency raises from some Hz, lets say 20 to change the color, making the screen-flash).
But I've found that I understand NOTHING, even less try to implement it to my application.
Can somebody help me out to learn whichever both of them? My email is sismetic_chaos#hotmail.com. I really need help, I've been stucked for like 3 days not coding or doing anything at all, trying to understand, but I dont.
PS:My application is written in C++ and Qt.
PS:Thanks for taking the time to read this and the willingness to help.
Edit: Thanks to all for the answers, the problem is in no way solved yet, but I appreciate all the answers, I didnt thought I would get so many answers and info. Thanks to you all.
This is a difficult problem, requiring more than an FFT. I'll briefly describe how I implemented beat detection when I was writing software for professional DJ equipment.
First of all, you'll need to cut down the amount of data you're dealing with, since there are only two or three beats per second, but tens of thousands of samples. You'll also need to look at different frequency ranges, since some types of music carry the tempo in the bassline, and others in percussion or other instruments. So pass the signal through several band-pass filters (I chose 8 filters, each covering one octave, from low bass to high treble), and then downsample each band by averaging the power over a few hundred samples.
Every few seconds, you'll have a thousand or so samples in each band. Your next tool is an autocorrelation, to identify repetitive patterns in the music. The peaks of the autocorrelation tell you what the beat is more or less likely to be; but you'll need to make up some heuristics to compare all the frequency bands to find a beat that you can be confident in, and to avoid misleading syncopations. If you can manage that, then you'll have a reasonable guess at the tempo, but no idea of the phase (i.e. exactly when to flash the screen).
Now you can look at the a smoothed version of the audio data for peaks, some of which are likely to correspond to beats. Initially, look for the strongest peak over the course of a few seconds and take that as a downbeat. In conjunction with the tempo you estimated in the first stage, you can predict when the next beat is due, and measure where you actually saw something like a beat, and adjust your estimate to more closely match the data. You can also maintain a confidence level based on how well the predicted beats match the measured peaks; if that drops too low, then restart the beat detection from scratch.
There are a lot of fiddly details to this, and it took me some weeks to get it working nicely. It is a difficult problem.
Or for a simple visualisation effect, you could simply detect peaks and flash the screen for each one; it will probably look good enough.
The output of a FFT will give you the frequency spectrum of an audio sample, but extracting the tempo from the FFT output is probably not the way you want to go.
One thing you can do is to use peak detection to identify the volume "spikes" that typically occur on the "down-beats" of the music. If you can identify the down-beats, then you can use a resource like bpmdatabase.com to find the tempo of the song. The tempo will tell you how fast to flash and the peaks you detected will tell you when to start flashing. Have your app monitor your flashes to make sure that they generally occur at the same time as a peak (if the two start to diverge, then the tempo may have changed mid-song).
That may sound straightforward, but this is actually a very non-trivial thing to implement. You might want to read this SO question for more information. There are some quality links in the answers there.
If I'm completely mis-interpreting what you are trying to do and you need to do FFTs for something different, then you might want to look at using one of the existing FFT libraries to do the heavy lifting for you. Some examples are FFTW and KissFFT.
It sounds like maybe you're trying to get your visualizer to flash the screen in time with the
music somehow. I don't think calculating the FFT is going to help you here. At any
given instant, there will be many simultaneous frequency components, all over the audio spectrum (roughly 20 Hz to 20 kHz). But you're likely to be a lot more interested in the
musical tempo (beats per minute -- more like 5 Hz or below), and that's not going to show
up anywhere in an FFT of the raw audio signal.
You probably need something much simpler -- some sort of real-time peak detection.
Whenever you see a peak greater than some threshold above the average volume,
make your screen flash.
Of course, more complicated visualizations might well take advantage of the FFT,
but not the one you're describing.
My recommendation would be to find a library that does this for you. Unless you have a lot of mathematics to back you up, I think you will be wasting a ton of your time trying to learn FFTs when all you really want out is some sort of 'base hits per minute' number out which you can adjust your graphics to accordingly.
Check out this similar post:
here
It took me about three weeks to understand the mathematics behind FFTs and then another week to write something in Matlab using those concepts. If you are discouraged after three days, don't try and roll your own.
I hope this is helpful advice and not discouraging.
-Brian J. Stinar-
As previous answers have noted, an FFT is probably not the tool you need in order to solve your problem, which requires tempo detection rather than spectral analysis.
For an example of what can be done using FFT - and of how a particular FFT implementation was integrated into a Qt application, take a look at this blog post which describes the spectrum analyzer demo I developed. Code for the demo is shipped with Qt itself, in the demos/spectrum directory.