how compressed can I get a heightmap? - compression

I've been playing around with simple terrain generation ( diamond square alogorithm ) and I got to thinking about how large I can practically do it.
If I wanted to generate a continent, then a 1000 by 1000 km square would be big enough, but if I also wanted high resolution it quickly results in humongous file sizes. 1000 by 1000 km = 1 million square km, if I store a point for every meter, then every square km is a million square meters.
If I use unsigned shorts ( max altitude of 10,000 meters ) and I do my math right, that's 2TB of data. Of course I couldn't store it all in RAM at once, but even with HDD space getting cheaper everyday, a 2TB heightmap is not practical.
I got to thinking about compressing the data, but I've never done compression before, and have no clue how far I could shrink it down, if it goes from 2 to 1.9 TB, it wouldn't be worth it. What compression methods work best without loss of data?
I'm willing to reduce the size and resolution of the heightmap, but I'd like to make it as large as practical.

What you can do depends a lot on your needs. If you don't need all geometrical data all the time and can spare a lot of CPU time, as little as a couple of bytes for your random seed will suffice to reliably regenerate the full terrain. You could also go ahead and divide your continent into patches and store a seed for each patch. That way you can recreate smaller bits of the continent reliably without having to go through the expense of creating the whole continent each time you only need a fraction of the geometry.

Related

Why JPEG compression processes image by 8x8 blocks?

Why JPEG compression processes image by 8x8 blocks instead of applying Discrete Cosine Transform to the whole image?
8 X 8 was chosen after numerous experiments with other sizes.
The conclusions of experiments are:
1. Any matrices of sizes greater than 8 X 8 are harder to do mathematical operations (like transforms etc..) or not supported by hardware or take longer time.
2. Any matrices of sizes less than 8 X 8 dont have enough information to continue along with the pipeline. It results in bad quality of the compressed image.
Because, that would take "forever" to decode. I don't remember fully now, but I think you need at least as many coefficients as there are pixels in the block. If you code the whole image as a single block I think you need to, for every pixel, iterate through all the DCT coefficients.
I'm not very good at big O calculations but I guess the complexity would be O("forever"). ;-)
For modern video codecs I think they've started using 16x16 blocks instead.
One good reason is that images (or at least the kind of images humans like to look at) have a high degree of information correlation locally, but not globally.
Every relatively smooth patch of skin, or piece of sky or grass or wall eventually ends in a sharp edge and is replaced by something entirely different. This means you still need a high frequency cutoff in order to represent the image adequately rather than just blur it out.
Now, because Fourier-like transforms like DCT "jumble" all the spacial information, you wouldn't be able to throw away any intermediate coefficients either, nor the high-frequency components "you don't like".
There are of course other ways to try to discard visual noise and reconstruct edges at the same time by preserving high frequency components only when needed, or do some iterative reconstruction of the image at finer levels of detail. You might want to look into space-scale representation and wavelet transforms.

Parameters to improve a music frequency analyzer

I'm using a FFT on audio data to output an analyzer, like you'd see in Winamp or Windows Media Player. However the output doesn't look that great. I'm plotting using a logarithmic scale and I average the linear results from the FFT into the corresponding logarithmic bins. As an example, I'm using bins like:
16k,8k,4k,2k,1k,500,250,125,62,31,15 [hz]
Then I plot the magnitude (dB) against frequency [hz]. The graph definitely 'reacts' to the music, and I can see the response of a drum sample or a high pitched voice. But the graph is very 'saturated' close to the lower frequencies, and overall doesn't look much like what you see in applications, which tend to be more evenly distributed. I feel that apps that display visual output tend to do different things to the data to make it look better.
What things could I do to the data to make it look more like the typical music player app?
Some useful information:
I downsample to single channel, 32kHz, and specify a time window of 35ms. That means the FFT gets ~1100 points. I vary these values to experiment (ie tried 16kHz, and increasing/decreasing interval length) but I get similar results.
With an FFT of 1100 points, you probably aren't able to capture the low frequencies with a lot of frequency resolution.
Think about it, 30 Hz corresponds to a period of 33ms, which at 32kHz is roughly 1000 samples. So you'll only be able to capture about 1 period in this time.
Thus, you'll need a longer FFT window to capture those low frequencies with sharp frequency resolution.
You'll likely need a time window of 4000 samples or more to start getting noticeably more frequency resolution at the low frequencies. This will be fine too, since you'll still get about 8-10 spectrum updates per second.
One option too, if you want very fast updates for the high frequency bins but good frequency resolution at the low frequencies, is to update the high frequency bins more quickly (such as with the windows you're currently using) but compute the low frequency bins less often (and with larger windows necessary for the good freq. resolution.)
I think a lot of these applications have variable FFT bins.
What you could do is start with very wide evenly spaced FFT bins like you have and then keep track of the number of elements that are placed in each FFT bin. If some of the bins are not used significantly at all (usually the higher frequencies) then widen those bins so that they are larger (and thus have more frequency entries) and shring the low frequency bins.
I have worked on projects were we just spend a lot of time tuning bins for specific input sources but it is much nicer to have the software adjust in real time.
A typical visualizer would use constant-Q bandpass filters, not a single FFT.
You could emulate a set of constant-Q bandpass filters by multiplying the FFT results by a set of constant-Q filter responses in the frequency domain, then sum. For low frequencies, you should use an FFT longer than the significant impulse response of the lowest frequency filter. For high frequencies, you can use shorter FFTs for better responsiveness. You can slide any length FFTs along at any desired update rate by overlapping (re-using) data, or you might consider interpolation. You might also want to pre-window each FFT to reduce "spectral leakage" between frequency bands.

How many data can be mass data? How many dimensions can be high-dimension?

i'm going to be a master now, and my teacher's research direction is data-mining for high-dimension mass data.
but i still can't imagine what are mass data, and how many dimension can be called high-dimension.
tks~
Mass data? Well, you can consider that all Google's requests, considered as a stream, contitute a mass data.
Mass dimensions? Imagine a Google engineer considering a few topics like "five-legged dogs". He can think that every user represents a dimension, and compute some correlation stuff. And there i a lot of users.
Now, back to the point, there are no clear definitions of mass data, or of high dimensions. However, you can consider that :
If you have so much data that you cannot load all of it in memory (I'm talking about HDD, not just RAM), it's mass data
If your algorithms begin to fail because of the curse of dimensionality, it's high dimensionality. 1.000.000 dimensions is surely high-dimension. You can often consider that 1.000 is high dimension too.

What is a correct formula of amplifying WaveForm audio?

I am wondering what a correct formula of amplifying WaveForm audio is from C++.
Let's say there's a 16 bit waveform data following:
0x0000 0x2000, 0x3000, 0x2000, 0x0000, (negative part), ...
Due to acoustic reason, just doubled the number won't make twice bigger audio like this:
0x0000 0x4000, 0x6000, 0x4000, 0x0000, (doubled negative part), ...
If there's someone who knows well about audio modification, please let me know.
If you double all the sample values it will sure sound "twice as loud", that is, 6dB louder. Of course, you need to be careful to avoid distortion due to clipping - that's the main reason why all professional audio processing software today uses float samples internally.
You may need to get back to integer when finally outputting the sound data. If you're just writing a plugin for some DAW (as I would recommend if you want to do program simple yet effective sound FX), it will do all this stuff for you: you just get a float, do something with it, and output a float again. But if you want to, for instance, directly output a .wav file, you need to first limit the output so that everything above 0dB (which is +-1 in a usual float stream) is clipped to just +-1. Then you can multiply by the maximum your desired integer type can reach -1, and just cast it into that type. Done.
Anyway, you're certainly right in that it's important to scale your volume knob logarithmically rather than linear (many consumer-grade programs don't, which is just stupid because you will end up using values very close to the left end of the knobs range most of the time), but that has nothing to do with the amplification calculation itself, it's just because we perceive the loudness of signals on a logarithmic scale. Still, the loudness itself is determined by a simple multiplication of a constant factor of the sound pressure, which in turn is proportional to the voltage in the analog circuitry and to the values of the digital samples in any DSP.
Another thing: I don't know how far you're intending to go, but if you want do do this really properly you should not just clip away peaks that are over 0dB (the clipping sounds very harsh), but implement a proper compressor/limiter. This would then automatically prevent clipping by reducing the level at the loudest parts. You don't want to overdo this either (popular music is usually over-compressed anyway, as a result a lot of the dynamic musical expression is lost), but it is still a "less dangerous" way of increasing the audio level.
I used linear multiplication for it every time and it never failed. It even worked for fade-outs for example...
so
float amp=1.2;
short sample;
short newSample=(short)amp*sample;
If you want your fade out to be linear, in a sample processing loop do
amp-=0.03;
and if you want to be logarithmic, in a sample processing loop do
amp*=0.97;
until amp reaches some small value (amp < 0.1)
This just may be a perception problem. Your ears (and eyes - look up gamma w.r.t. video), don't perceive loudness in a linear response to the input. A good model of it is that your ears respond to perceive a ln(n) increase for a n increase in volume. Look up the difference between linear pots and audio pots.
Anyway, I don't know if that matters here because your output amp may account for that, but if you want it to be perceived twice as loud you may have to make it e^2 times as loud. Which may mean you're in the realm of clipping now.

detecting pauses in a spoken word audio file using pymad, pcm, vad, etc

First I am going to broadly state what I'm trying to do and ask for advice. Then I will explain my current approach and ask for answers to my current problems.
Problem
I have an MP3 file of a person speaking. I'd like to split it up into segments roughly corresponding to a sentence or phrase. (I'd do it manually, but we are talking hours of data.)
If you have advice on how to do this programatically or for some existing utilities, I'd love to hear it. (I'm aware of voice activity detection and I've looked into it a bit, but I didn't see any freely available utilities.)
Current Approach
I thought the simplest thing would be to scan the MP3 at certain intervals and identify places where the average volume was below some threshold. Then I would use some existing utility to cut up the mp3 at those locations.
I've been playing around with pymad and I believe that I've successfully extracted the PCM (pulse code modulation) data for each frame of the mp3. Now I am stuck because I can't really seem to wrap my head around how the PCM data translates to relative volume. I'm also aware of other complicating factors like multiple channels, big endian vs little, etc.
Advice on how to map a group of pcm samples to relative volume would be key.
Thanks!
PCM is a time frame base encoding of sound. For each time frame, you get a peak level. (If you want a physical reference for this: The peak level corresponds to the distance the microphone membrane was moved out of it's resting position at that given time.)
Let's forget that PCM can uses unsigned values for 8 bit samples, and focus on
signed values. If the value is > 0, the membrane was on one side of it's resting position, if it is < 0 it was on the other side. The bigger the dislocation from rest (no matter to which side), the louder the sound.
Most voice classification methods start with one very simple step: They compare the peak level to a threshold level. If the peak level is below the threshold, the sound is considered background noise.
Looking at the parameters in Audacity's Silence Finder, the silence level should be that threshold. The next parameter, Minimum silence duration, is obviously the length of a silence period that is required to mark a break (or in your case, the end of a sentence).
If you want to code a similar tool yourself, I recommend the following approach:
Divide your sound sample in discrete sets of a specific duration. I would start with 1/10, 1/20 or 1/100 of a second.
For each of these sets, compute the maximum peak level
Compare this maximum peak to a threshold (the silence level in Audacity). The threshold is something you have to determine yourself, based on the specifics of your sound sample (loudnes, background noise etc). If the max peak is below your threshold, this set is silence.
Now analyse the series of classified sets: Calculate the length of silence in your recording. (length = number of silent sets * length of a set). If it is above your Minimum silence duration, assume that you have the end of a sentence here.
The main point in coding this yourself instead of continuing to use Audacity is that you can improve your classification by using advanced analysis methods. One very simple metric you can apply is called zero crossing rate, it just counts how often the sign switches in your given set of peak levels (i.e. your values cross the 0 line). There are many more, all of them more complex, but it may be worth the effort. Have a look at discrete cosine transformations for example...
Just wanted to update this. I'm having moderate success using Audacity's Silence Finder. However, I'm still interested in this problem. Thanks.
PCM is a way of encoding a sinusoidal wave. It will be encoded as a series of bits, where one of the bits (1, I'd guess) indicates an increase in the function, and 0 indicates a decrease. The function can stay roughly constant by alternating 1 and 0.
To estimate amplitude, plot the sin wave, then normalize it over the x axis. Then, you should be able to estimate the amplitude of the sin wave at different points. Once you've done that, you should be able to pick out the spots where amplitude is lower.
You may also try to use a Fourier transform to estimate where the signals are most distinct.