How to pan audio sample data naturally? - c++

I'm developing Flutter plugin which is targeting only Android for now. It's kind of synthesis thing; Users can load audio file into memory, and they can adjust pitch (not pitch shift) and play multiple sound with the least delay using audio library called Oboe.
I managed to get PCM data from audio files which MediaCodec class supports, and also succeeded to handle pitch by manipulating playback via accessing PCM array manually too.
This PCM array is stored as float array, ranging from -1.0 to 1.0. I now want to support panning feature, just like what internal Android class such as SoundPool. I'm planning to follow how SoundPool is handling panning. There are 2 values I have to pass to SoundPool when performing panning effect : left, and right. These 2 values are float, and must range from 0.0 to 1.0.
For example, if I pass (1.0F, 0.0F), then users can hear sound only by left ear. (1.0F, 1.0F) will be normal (center). Panning wasn't problem... until I encountered handling stereo sounds. I know what to do to perform panning with stereo PCM data, but I don't know how to perform natural panning.
If I try to shift all sound to left side, then right channel of sound must be played in left side. In opposite, if I try to shift all sound to right side, then left channel of sound must be played in right side. I also noticed that there is thing called Panning Rule, which means that sound must be a little bit louder when it's shifted to side (about +3dB). I tried to find a way to perform natural panning effect, but I really couldn't find algorithm or reference of it.
Below is structure of float stereo PCM array, I actually didn't modify array when decoding audio files, so it should be common structure
[left_channel_sample_0, right_channel_sample_0, left_channel_sample_1, right_channel_sample_1,
...,
left_channel_sample_n, right_channel_sample_n]
and I have to pass this PCM array to audio stream like c++ code below
void PlayerQueue::renderStereo(float * audioData, int32_t numFrames) {
for(int i = 0; i < numFrames; i++) {
//When audio file is stereo...
if(player->isStereo) {
if((offset + i) * 2 + 1 < player->data.size()) {
audioData[i * 2] += player->data.at((offset + i) * 2);
audioData[i * 2 + 1] += player->data.at((offset + i) * 2 + 1);
} else {
//PCM data reached end
break;
}
} else {
//When audio file is mono...
if(offset + i < player->data.size()) {
audioData[i * 2] += player->data.at(offset + i);
audioData[i * 2 + 1] += player->data.at(offset + i);
} else {
//PCM data reached end
break;
}
}
//Prevent overflow
if(audioData[i * 2] > 1.0)
audioData[i * 2] = 1.0;
else if(audioData[i * 2] < -1.0)
audioData[i * 2] = -1.0;
if(audioData[i * 2 + 1] > 1.0)
audioData[i * 2 + 1] = 1.0;
else if(audioData[i * 2 + 1] < -1.0)
audioData[i * 2 + 1] = -1.0;
}
//Add numFrames to offset, so it can continue playing PCM data in next session
offset += numFrames;
if(offset >= player->data.size()) {
offset = 0;
queueEnded = true;
}
}
I excluded calculation of playback manipulating to simplify code. As you can see, I have to manually pass PCM data to audioData float array. I'm adding PCM data to perform mixing multiple sounds including same sound too.
How to perform panning effect with this PCM array? It will be good if we can follow mechanisms of SoundPool, but it will be fine as long as I can perform panning effect properly. (EX: pan value can be just -1.0 to 1.0, 0 will mean centered)
When applying Panning Rule, what is relationship between PCM and decibel? I know how to make sound louder, but I don't know how to make sound louder with exact decibel. Are there any formula for this?

Pan rules or pan laws are implemented a bit different from manufacturer to manufacturer.
One implementation that is frequently used is that when sounds are panned fully to one side, that side is played at full volume, where as the other side is attenuated fully. if the sound is played at center, both sides are attenuated by roughly 3 decibels.
to do this you can multiply the sound source by the calculated amplitude. e.g. (untested pseudo code)
player->data.at((offset + i) * 2) * 1.0; // left signal at full volume
player->data.at((offset + i) * 2 + 1) * 0.0; // right signal fully attenuated
To get the desired amplitudes you can use the sin function for the left channel and the cos function for the right channel.
notice that when the input to sin and cos is pi/4, that the amplitude is 0.707 on both sides. This will give you your attenuation on both sides of around 3 decibels.
So all that is left to do is to map the range [-1, 1] to the range [0, pi/2]
e.g. assuming you have a value for pan which is in the range [-1, 1]. (untested pseudo code)
pan_mapped = ((pan + 1) / 2.0) * (Math.pi / 2.0);
left_amplitude = sin(pan_mapped);
right_amplitude = cos(pan_mapped);
UPDATE:
Another option frequently used (e.g. ProTools DAW) is to have a pan setting on each side. effectively treating the stereo source as 2 mono sources. This allows you to place the left source freely in the stereo field without affecting the right source.
To do this you would: (untested pseudo code)
left_output += left_source(i) * sin(left_pan)
right_output += left_source(i) * cos(left_pan)
left_output += right_source(i) * sin(right_pan)
right_output += right_source(i) * cos(right_pan)
The setting of these 2 pans are are up to the operator and depend on the recording and desired effect.
How you want to map this to a single pan control is up to you. I would just advise that when the pan is 0 (centred) that the left channel is played only on the left side and the right channel is only played on the right side. Else you would interfere with the original stereo recording.
One possibility would be that the segment [-1, 0) controls the right pan, leaving the left side untouched. and vice versa for [0, 1].
hPi = math.pi / 2.0
def stereoPan(x):
if (x < 0.0):
print("left source:")
print(1.0) # amplitude to left channel
print(0.0) # amplitude to right channel
print("right source:")
print(math.sin(abs(x) * hPi)) # amplitude to left channel
print(math.cos(abs(x) * hPi)) # amplitude to right channel
else:
print("left source:")
print(math.cos(x * hPi)) # amplitude to left channel
print(math.sin(x * hPi)) # amplitude to right channel
print("right source:")
print(0.0) # amplitude to left channel
print(1.0) # amplitude to right channel

The following is not meant to contradict anything in the excellent answer given by #ruff09. I'm just going to add some thoughts and theory that I think is relevant when trying to emulate panning.
I'd like to point out that simply using volume differences has a couple drawbacks. First off, it doesn't match the real world phenomenon. Imagine you are walking down a sidewalk and immediately there on the street, on your right, is a worker with a jackhammer. We could make the sound 100% volume on the right and 0% on the left. But in reality much of what we hear from that source is also coming in the left ear, drowning out other sounds.
If you omit left-ear volume for the jackhammer to obtain maximum right-pan, then even quiet sounds on the left will be audible (which is absurd), since they will not be competing with jackhammer content on that left track. If you do have left-ear volume for the jackhammer, then the volume-based panning effect will swing the location more towards the center. Dilemma!
How do our ears differentiate locations in such situations? I know of two processes that potentially can be incorporated to the panning algorithm to make the panning more "natural." One is a filtering component. High frequencies that match wavelengths that are smaller than the width of our head get attenuated. So, you could add some differential low-pass filtering to your sounds. Another aspect is that in our scenario, the jackhammer sounds reach the right ear a few milliseconds before they reach the left. Thus, you could also add a bit of delay to based on the panning angle. The time-based panning effect works most clearly with frequency content that has wave lengths that are larger than our heads (e.g., some high-pass filtering would also be a component).
There has also been a great deal of work on how the shapes of our ears have differential filtering effects on sounds. I think that we learn to use this as we grow up by subconsciously associating different timbres with different locations (especially pertains to altitude and front vs. back stereo issues).
There are big computation costs, though. So simplifications such as sticking with purely amplitude-based panning is the norm. Thus, for sounds in a 3D world, it is probably best to prefer mono source content for items that need dynamic location changes, and only use stereo content for background music or ambient content that doesn't need dynamic panning based on player location.
I want to do some more experimenting with dynamic time-based panning combined with a bit of amplitude, to see if this can be used effectively with stereo cues. Implementing a dynamic delay is a little tricky, but not as costly as filtering. I'm wondering if there might be ways to record a sound source (preprocess it) to make it more amenable to incorporating real-time filter- and time-based manipulation that result in effective panning.

Related

Mapping mouse position from full screen to smaller frame

I have been stuck on remapping my mouse position to a new frame for a few days and I am unsure what to do. I will provide images to describe my issue. The main problem is I want to click on an object in my program and the program will highlight the object I select (in 3d space.) I have this working perfectly when my application is in full screen mode. I recently started rendering my scene into a smaller frame so that I can have editor tools on the sides(like unity.) Here is the transition (graphically) i made from working to not working:
So essentially the mouse coordinates go from (0,0) to (screenWidth, screenHeight).. I want to map these coordinates to be from (frameStartX, frameStartY) to (frameStartX + frameWidth, frameStartY + frameHeight). I did some research on linearly transforming a number to scale it to a new range so I thought I could do this:
float frameMousePosX = (mousePosX - 0) / (screenWidth - 0) * ((frameWidth + frameStartX) - frameStartX ) + frameStartX ;
float frameMousePosY = (mousePosY - 0) / (screenHeight - 0) * ((frameHeight +frameStartY) - frameStartY ) + frameStartY;
I assumed this would work but it doesn't. It's not even close. I am really unsure what to do to get this transformation.
Assuming the transformation works, I would want it to read 0,0 at the bottom left of the frameStart which is (x,y) in the image attached and then reach its peak at the top right of the framed scene.
Any help would be extremely appreciated.

Raytracer : High FOV distortion

I'm actually realising a C++ raytracer and I'm confronting a classic problem on raytracing. When putting a high vertical FOV, the shapes get a bigger distortion the nearer they are from the edges.
I know why this distortion happens, but I don't know to resolve it (of course, reducing the FOV is an option but I think that there is something to change in my code). I've been browsing different computing forums but didn't find any way to resolve it.
Here's a screenshot to illustrate my problem.
I think that the problem is that the view plane where I'm projecting my rays isn't actually flat, but I don't know how to resolve this. If you have any tip to resolve it, I'm open to suggestions.
I'm on a right-handed oriented system.
The Camera system vectors, Direction vector and Light vector are normalized.
If you need some code to check something, I'll put it in an answer with the part you ask.
code of ray generation :
// PixelScreenX = (pixelx + 0.5) / imageWidth
// PixelCameraX = (2 ∗ PixelScreenx − 1) ∗
// ImageAspectRatio ∗ tan(fov / 2)
float x = (2 * (i + 0.5f) / (float)options.width - 1) *
options.imageAspectRatio * options.scale;
// PixelScreeny = (pixely + 0.5) / imageHeight
// PixelCameraY = (1 − 2 ∗ PixelScreeny) ∗ tan(fov / 2)
float y = (1 - 2 * (j + 0.5f) / (float)options.height) * options.scale;
Vec3f dir;
options.cameraToWorld.multDirMatrix(Vec3f(x, y, -1), dir);
dir.normalize();
newColor = _renderer->castRay(options.orig, dir, objects, options);
There is nothing wrong with your projection. It produces exactly what it should produce.
Let's consider the following figure to see how all the quantities interact:
We have the camera position, the field of view (as an angle) and the image plane. The image plane is the plane that you are projecting your 3D scene onto. Essentially, this represents your screen. When you are viewing your rendering on the screen, your eye serves as the camera. It sees the projected image and if it is positioned at the right point, it will see exactly what it would see if the actual 3D scene was there (neglecting effects like depth of field etc.)
Obviously, you cannot modify your screen (you could change the window size but let's stick with a constant-size image plane). Then, there is a direct relationship between the camera's position and the field of view. As the field of view increases, the camera moves closer and closer to the image plane. Like this:
Thus, if you are increasing your field of view in code, you need to move your eye closer to the screen to get the correct perception. You can actually try that with your image. Move your eye very close to the screen (I'm talking about 3cm). If you look at the outer spheres now, they actually look like real balls again.
In summary, the field of view should approximately match the geometry of the viewing setup. For a given screen size and average watch distance, this can be calculated easily. If your viewing setup does not match your assumptions in code, 3D perception will break down.

C++ - Smooth speedup and slowing of objects

I am dealing with some positions of objects in Cocos2dx but this question can apply to virtually every situation in which a smooth start and stop is necessary.
Here's what I am looking for:
Given a origin position at x = 0 and a final position of x = 8, I want to accelerate slowly and get further the further I am from the start and then have it slow down as it reaches the end. Is there a smoothing algorithm for this?
There are lots of algorithms for this. One idea is to set up a linear interpolation:
x(t) = t * x0 + (1.0 - t) * x1;
If you feed evenly spaced values of t from 0.0 to 1.0, you'll get a smooth, linear animation.
If you want slow start and slow end, you can use t = sin(theta)/2.0 + 1.0 for theta from -pi/2 to pi/2.
A second-order smooth path has constant acceleration during the first half, then constant deceleration during the second part.
This means you accelerate from x=0 to x=4. The formula is x(t)=a*t*t so your choice of acceleration a directly influences the time needed. If you set the deceleration to the same value, you'll arrive after twice the time in x=8. The formula for the second part is therefore x(t) = 16 - a*t*t. The halfway point in time is t=sqrt(4/a).

Draw sound wave with possibility to zoom in/out

I'm writing a sound editor for my graduation. I'm using BASS to extract samples from MP3, WAV, OGG etc files and add DSP effects like echo, flanger etc. Simply speaching I made my framework that apply an effect from position1 to position2, cut/paste management.
Now my problem is that I want to create a control similar with this one from Cool Edit Pro that draw a wave form representation of the song and have the ability to zoom in/out select portions of the wave form etc. After a selection i can do something like:
TInterval EditZone = WaveForm->GetSelection();
where TInterval have this form:
struct TInterval
{
long Start;
long End;
}
I'm a beginner when it comes to sophisticated drawing so any hint on how to create a wave form representation of a song, using sample data returned by BASS, with ability to zoom in/out would be appreciated.
I'm writing my project in C++ but I can understand C#, Delphi code so if you want you can post snippets in last two languages as well :)
Thanx DrOptix
By Zoom, I presume you mean horizontal zoom rather than vertical. The way audio editors do this is to scan the wavform breaking it up into time windows where each pixel in X represents some number of samples. It can be a fractional number, but you can get away with dis-allowing fractional zoom ratios without annoying the user too much. Once you zoom out a bit the max value is always a positive integer and the min value is always a negative integer.
for each pixel on the screen, you need to have to know the minimum sample value for that pixel and the maximum sample value. So you need a function that scans the waveform data in chunks and keeps track of the accumulated max and min for that chunk.
This is slow process, so professional audio editors keep a pre-calculated table of min and max values at some fixed zoom ratio. It might be at 512/1 or 1024/1. When you are drawing with a zoom ration of > 1024 samples/pixel, then you use the pre-calculated table. if you are below that ratio you get the data directly from the file. If you don't do this you will find that you drawing code gets to be too slow when you zoom out.
Its worthwhile to write code that handles all of the channels of the file in an single pass when doing this scanning, slowness here will make your whole program feel sluggish, it's the disk IO that matters here, the CPU has no trouble keeping up, so straightforward C++ code is fine for building the min/max tables, but you don't want to go through the file more than once and you want to do it sequentially.
Once you have the min/max tables, keep them around. You want to go back to the disk as little as possible and many of the reasons for wanting to repaint your window will not require you to rescan your min/max tables. The memory cost of holding on to them is not that high compared to the disk io cost of building them in the first place.
Then you draw the waveform by drawing a series of 1 pixel wide vertical lines between the max value and the min value for the time represented by that pixel. This should be quite fast if you are drawing from pre built min/max tables.
I've recently done this myself. As Marius suggests you need to work out how many samples are at each column of pixels. You then work out the minimum and maximum and then plot a vertical line from the maximum to the minimum.
As a first pass this seemingly works fine. The problem you'll get is that as you zoom out it will start to take too long to retrieve the samples from disk. As a solution to this I built a "peak" file alongside the audio file. The peak file stores the minimum/maximum pairs for groups of n samples. PLaying with n till you get the right amount is up to uyou. Personally I found 128 samples to be a good tradeoff between size and speed. Its also worth remembering that, unless you are drawing a control larger than 65536 pixels in size that you needn't store this peak information as anything more than 16-bit values which saves a bit of space.
Wouldn't you just plot the sample points on a 2 canvas? You should know how many samples there are per second for a file (read it from the header), and then plot the value on the y axis. Since you want to be able to zoom in and out, you need to control the number of samples per pixel (the zoom level). Next you take the average of those sample points per pixel (for example take the average of every 5 points if you have 5 samples per pixel. Then you can use a 2d drawing api to draw lines between the points.
Using the open source NAudio Package -
public class WavReader2
{
private readonly WaveFileReader _objStream;
public WavReader2(String sPath)
{
_objStream = new WaveFileReader(sPath);
}
public List<SampleRangeValue> GetPixelGraph(int iSamplesPerPixel)
{
List<SampleRangeValue> colOutputValues = new List<SampleRangeValue>();
if (_objStream != null)
{
_objStream.Position = 0;
int iBytesPerSample = (_objStream.WaveFormat.BitsPerSample / 8) * _objStream.WaveFormat.Channels;
int iNumPixels = (int)Math.Ceiling(_objStream.SampleCount/(double)iSamplesPerPixel);
byte[] aryWaveData = new byte[iSamplesPerPixel * iBytesPerSample];
_objStream.Position = 0; // startPosition + (e.ClipRectangle.Left * iBytesPerSample * iSamplesPerPixel);
for (float iPixelNum = 0; iPixelNum < iNumPixels; iPixelNum += 1)
{
short iCurrentLowValue = 0;
short iCurrentHighValue = 0;
int iBytesRead = _objStream.Read(aryWaveData, 0, iSamplesPerPixel * iBytesPerSample);
if (iBytesRead == 0)
break;
List<short> colValues = new List<short>();
for (int n = 0; n < iBytesRead; n += 2)
{
short iSampleValue = BitConverter.ToInt16(aryWaveData, n);
colValues.Add(iSampleValue);
}
float fLowPercent = (float)((float)colValues.Min() /ushort.MaxValue);
float fHighPercent = (float)((float)colValues.Max() / ushort.MaxValue);
colOutputValues.Add(new SampleRangeValue(fHighPercent, fLowPercent));
}
}
return colOutputValues;
}
}
public struct SampleRangeValue
{
public float HighPercent;
public float LowPercent;
public SampleRangeValue(float fHigh, float fLow)
{
HighPercent = fHigh;
LowPercent = fLow;
}
}

Optimizing a pinhole camera rendering system

I'm making a software rasterizer for school, and I'm using an unusual rendering method instead of traditional matrix calculations. It's based on a pinhole camera. I have a few points in 3D space, and I convert them to 2D screen coordinates by taking the distance between it and the camera and normalizing it
Vec3 ray_to_camera = (a_Point - plane_pos).Normalize();
This gives me a directional vector towards the camera. I then turn that direction into a ray by placing the ray's origin on the camera and performing a ray-plane intersection with a plane slightly behind the camera.
Vec3 plane_pos = m_Position + (m_Direction * m_ScreenDistance);
float dot = ray_to_camera.GetDotProduct(m_Direction);
if (dot < 0)
{
float time = (-m_ScreenDistance - plane_pos.GetDotProduct(m_Direction)) / dot;
// if time is smaller than 0 the ray is either parallel to the plane or misses it
if (time >= 0)
{
// retrieving the actual intersection point
a_Point -= (m_Direction * ((a_Point - plane_pos).GetDotProduct(m_Direction)));
// subtracting the plane origin from the intersection point
// puts the point at world origin (0, 0, 0)
Vec3 sub = a_Point - plane_pos;
// the axes are calculated by saying the directional vector of the camera
// is the new z axis
projected.x = sub.GetDotProduct(m_Axis[0]);
projected.y = sub.GetDotProduct(m_Axis[1]);
}
}
This works wonderful, but I'm wondering: can the algorithm be made any faster? Right now, for every triangle in the scene, I have to calculate three normals.
float length = 1 / sqrtf(GetSquaredLength());
x *= length;
y *= length;
z *= length;
Even with a fast reciprocal square root approximation (1 / sqrt(x)) that's going to be very demanding.
My questions are thus:
Is there a good way to approximate the three normals?
What is this rendering technique called?
Can the three vertex points be approximated using the normal of the centroid? ((v0 + v1 + v2) / 3)
Thanks in advance.
P.S. "You will build a fully functional software rasterizer in the next seven weeks with the help of an expert in this field. Begin." I ADORE my education. :)
EDIT:
Vec2 projected;
// the plane is behind the camera
Vec3 plane_pos = m_Position + (m_Direction * m_ScreenDistance);
float scale = m_ScreenDistance / (m_Position - plane_pos).GetSquaredLength();
// times -100 because of the squared length instead of the length
// (which would involve a squared root)
projected.x = a_Point.GetDotProduct(m_Axis[0]).x * scale * -100;
projected.y = a_Point.GetDotProduct(m_Axis[1]).y * scale * -100;
return projected;
This returns the correct results, however the model is now independent of the camera position. :(
It's a lot shorter and faster though!
This is called a ray-tracer - a rather typical assignment for a first computer graphics course* - and you can find a lot of interesting implementation details on the classic Foley/Van Damm textbook (Computer Graphics Principes and Practice). I strongly suggest you buy/borrow this textbook and read it carefully.
*Just wait until you get started on reflections and refraction... Now the fun begins!
It is difficult to understand exactly what your code doing, because it seems to be performing a lot of redundant operations! However, if I understand what you say you're trying to do, you are:
finding the vector from the pinhole to the point
normalizing it
projecting backwards along the normalized vector to an "image plane" (behind the pinhole, natch!)
finding the vector to this point from a central point on the image plane
doing dot products on the result with "axis" vectors to find the x and y screen coordinates
If the above description represents your intentions, then the normalization should be redundant -- you shouldn't have to do it at all! If removing the normalization gives you bad results, you are probably doing something slightly different from your stated plan... in other words, it seems likely that you have confused yourself along with me, and that the normalization step is "fixing" it to the extent that it looks good enough in your test cases, even though it probably still isn't doing quite what you want it to.
The overall problem, I think, is that your code is massively overengineered: you are writing all your high-level vector algebra as code to be executed in the inner loop. The way to optimize this is to work out all your vector algebra on paper, find the simplest expression possible for your inner loop, and precompute all the necessary constants for this at camera setup time. The pinhole camera specs would only be the inputs to the camera setup routine.
Unfortunately, unless I miss my guess, this should reduce your pinhole camera to the traditional, boring old matrix calculations. (ray tracing does make it easy to do cool nonstandard camera stuff -- but what you describe should end up perfectly standard...)
Your code is a little unclear to me (plane_pos?), but it does seem that you could cut out some unnecessary calculation.
Instead of normalizing the ray (scaling it to length 1), why not scale it so that the z component is equal to the distance from the camera to the plane-- in fact, scale x and y by this factor, you don't need z.
float scale = distance_to_plane/z;
x *= scale;
y *= scale;
This will give the x and y coordinates on the plane, no sqrt(), no dot products.
Well, off the bat, you can calculate normals for every triangle when your program starts up. Then when you're actually running, you just have to access the normals. This sort of startup calculation to save costs later tends to happen a lot in graphics. This is why we have large loading screens in a lot of our video games!