Writing multiple files slows down after x seconds - c++

I have code which gets frames from a camera and then saves it to disk. The structure of the code is: multiple threads malloc and copy their frames into new memory, enqueue memory. Finally, another thread removes frames from queue and writes them (using ffmpeg API, raw video no compression) to their files (actually I'm using my own memory pool so malloc is only called when I need more buffers). I can have upto 8 files/cams open at the same time enqueing.
The problem is that for the first 45 sec everything works fine: there's never more than one frame on queue. But after that my queue gets backed up, processing takes just a few ms longer resulting in increased ram usage because I cannot save the frames fast enough so I have to malloc more memory to store them.
I have a 8 core, 16GB RAM Windows 7 64 bit computer (NTFS, lots of free space in second disk drive). The disk is supposed to be able to write upto 6Gbits/sec. To save my data in time I need to be able to write at 50 MB/sec. I tested disk speed with "PassMark PerformanceTest" and I had 8 threads writing files simultaneously exactly like ffmpeg saves files (synchronized, uncached I/O) and it was able to achieve 100MB/sec. So why isn't my writes able to achieve that?
Here's how the ffmpeg writes look from Process monitor logs:
Time of Day Operation File# Result Detail
2:30:32.8759350 PM WriteFile 8 SUCCESS Offset: 749,535,120, Length: 32,768
2:30:32.8759539 PM WriteFile 8 SUCCESS Offset: 749,567,888, Length: 32,768
2:30:32.8759749 PM WriteFile 8 SUCCESS Offset: 749,600,656, Length: 32,768
2:30:32.8759939 PM WriteFile 8 SUCCESS Offset: 749,633,424, Length: 32,768
2:30:32.8760314 PM WriteFile 8 SUCCESS Offset: 749,666,192, Length: 32,768
2:30:32.8760557 PM WriteFile 8 SUCCESS Offset: 749,698,960, Length: 32,768
2:30:32.8760866 PM WriteFile 8 SUCCESS Offset: 749,731,728, Length: 32,768
2:30:32.8761259 PM WriteFile 8 SUCCESS Offset: 749,764,496, Length: 32,768
2:30:32.8761452 PM WriteFile 8 SUCCESS Offset: 749,797,264, Length: 32,768
2:30:32.8761629 PM WriteFile 8 SUCCESS Offset: 749,830,032, Length: 32,768
2:30:32.8761803 PM WriteFile 8 SUCCESS Offset: 749,862,800, Length: 32,768
2:30:32.8761977 PM WriteFile 8 SUCCESS Offset: 749,895,568, Length: 32,768
2:30:32.8762235 PM WriteFile 8 SUCCESS Offset: 749,928,336, Length: 32,768, Priority: Normal
2:30:32.8762973 PM WriteFile 8 SUCCESS Offset: 749,961,104, Length: 32,768
2:30:32.8763160 PM WriteFile 8 SUCCESS Offset: 749,993,872, Length: 32,768
2:30:32.8763352 PM WriteFile 8 SUCCESS Offset: 750,026,640, Length: 32,768
2:30:32.8763502 PM WriteFile 8 SUCCESS Offset: 750,059,408, Length: 32,768
2:30:32.8763649 PM WriteFile 8 SUCCESS Offset: 750,092,176, Length: 32,768
2:30:32.8763790 PM WriteFile 8 SUCCESS Offset: 750,124,944, Length: 32,768
2:30:32.8763955 PM WriteFile 8 SUCCESS Offset: 750,157,712, Length: 32,768
2:30:32.8764072 PM WriteFile 8 SUCCESS Offset: 750,190,480, Length: 4,104
2:30:32.8848241 PM WriteFile 4 SUCCESS Offset: 750,194,584, Length: 32,768
2:30:32.8848481 PM WriteFile 4 SUCCESS Offset: 750,227,352, Length: 32,768
2:30:32.8848749 PM ReadFile 4 END OF FILE Offset: 750,256,128, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
2:30:32.8848989 PM WriteFile 4 SUCCESS Offset: 750,260,120, Length: 32,768
2:30:32.8849157 PM WriteFile 4 SUCCESS Offset: 750,292,888, Length: 32,768
2:30:32.8849319 PM WriteFile 4 SUCCESS Offset: 750,325,656, Length: 32,768
2:30:32.8849475 PM WriteFile 4 SUCCESS Offset: 750,358,424, Length: 32,768
2:30:32.8849637 PM WriteFile 4 SUCCESS Offset: 750,391,192, Length: 32,768
2:30:32.8849880 PM WriteFile 4 SUCCESS Offset: 750,423,960, Length: 32,768, Priority: Normal
2:30:32.8850400 PM WriteFile 4 SUCCESS Offset: 750,456,728, Length: 32,768
2:30:32.8850727 PM WriteFile 4 SUCCESS Offset: 750,489,496, Length: 32,768, Priority: Normal
This looks very efficient, however, from DiskMon the actual disk writes look ridiculously fragmented back and forth writing which may account for this slow speed. See the graph for the write speed according to this data (~5MB/s).
TIme Write duration Sector Length MB/sec
95.6 0.00208855 1490439632 896 0.409131784
95.6 0.00208855 1488197000 128 0.058447398
95.6 0.00009537 1482323640 128 1.279965529
95.6 0.00009537 1482336312 768 7.679793174
95.6 0.00009537 1482343992 384 3.839896587
95.6 0.00009537 1482350648 768 7.679793174
95.6 0.00039101 1489278984 1152 2.809730729
95.6 0.00039101 1489393672 896 2.185346123
95.6 0.0001812 1482349368 256 1.347354443
95.6 0.0001812 1482358328 896 4.715740549
95.6 0.0001812 1482370616 640 3.368386107
95.6 0.0001812 1482378040 256 1.347354443
95.6 0.00208855 1488197128 384 0.175342193
95.6 0.00208855 1488202512 640 0.292236989
95.6 0.00208855 1488210320 1024 0.467579182
95.6 0.00009537 1482351416 256 2.559931058
95.6 0.00009537 1482360120 896 8.959758703
95.6 0.00009537 1482371896 640 6.399827645
95.6 0.00009537 1482380088 256 2.559931058
95.7 0.00039101 1489394568 1152 2.809730729
95.7 0.00039101 1489396744 352 0.858528834
95.7 0.00039101 1489507944 544 1.326817289
95.7 0.0001812 1482378296 768 4.042063328
95.7 0.0001812 1482392120 768 4.042063328
95.7 0.0001812 1482400568 512 2.694708885
95.7 0.00208855 1488224144 768 0.350684386
95.7 0.00208855 1488232208 384 0.175342193
I'm pretty confident it's not my code, because I timed everything and for example enqueing takes a few us suggesting that threads don't get stuck waiting for each other. It must be the disk writes. So the question is how can I improve my disk writes and what can I do to profile actual disk writes (remember that I rely on FFmpeg dlls to save so I cannot access the low level writing functions directly). If I cannot figure it out, I'll dump all the frames in a single sequential binary file (which should increase I/O speed) and then split it into video files post processing.
I don't know how much my disk I/O is caching (CacheSet only shows disk C cache size), but the following image from the performance monitor taken at 0 and 45 sec into the video (just before my queue starts piling up) looks weird to me. Basically, the modified set and standby set grew from very little to this large value. Is that the data being cached? Is it possible that only at 45 sec data is starting to be written to disk so suddenly everything slows down?
(FYI, LabVIEW is the program that loads my dll.)
I'll appreciate any help.
M.

With CreateFile it looks like you want one or both of these parameters:
FILE_FLAG_NO_BUFFERING
FILE_FLAG_WRITE_THROUGH
http://msdn.microsoft.com/en-us/library/cc644950(v=vs.85).aspx
Your delayed performance hit occurs when the OS starts pushing data to the disk.
6Gb/s is the performance capability of the SATA 2 bus not the actual devices connected or the physical platters or flash ram underneath.
A common problem with AV systems is constantly writing a high stream of data can get periodically interrupted by disk overhead tasks. There used to be special AV disks you can purchase that don't do this, these days you can purchase disks with special high throughput performance firmware explicitly for security video recording.
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=210671&NewLang=en

The problem is with repeated malloc and free which puts a load on the system. I suggest creating a buffer pools, i.e allocate N buffers in the initialization stage and reuse them instead of mallocing and freeing the memory. Since you have mentioned ffmpeg, to give an example from multimedia, In gstreamer, buffer management occurs in the form of buffer-pools and in a gstreamer pipeline buffers are usually taken and passed around from buffer pools. Most multimedia systems do this.
Regarding:
The problem is that for the first 45 sec everything works fine: there's never more than one frame on queue. But after that my queue gets backed up, processing takes just a few ms longer resulting in increased ram usage because I cannot save the frames fast enough so I have to malloc more memory to store them.
The application is trashing at this point. Calling malloc at this point will make matters even worse. I suggest implementing a producer-consumer model, where one of them gets waits depending on the case. In your case, set up a threshold of N buffers. If there are N buffers in the queue, new frames from camera are not enqueued till the existing buffers are processed.
Another idea, Instead of writing raw frames why not write encoded data? assuming you a want video, you can at least write a elementary H264 stream (and ffmpeg comes with a good H264 encoder!) or even better if you have access to a Mpeg-4 muxer, as a mp4 file? This will reduce the memory requirements and the IO load dramatically.

I am copying a directory with 100,000+ files to an exFat disk. The copy starts at around 60 MB/s and degrades to about 1-2 Mb/s. For me the quick & dirty fix is to put the computer to sleep during the operation and then wake it up. The original speed returns instantly. Not very sophisticated but it works for me.

I think it's finally a consequence of using an "unwise" write mode.
Why does it work with 8 threads writing in your synthetic test? Well, because in that case, little does it matter if a thread blocks. You're pushing data towards the drive full throttle. It's not surprising that you get 100 MB/s that way, which is roughly the reported real-life speed of the drive that you named.
What happens in the "real" program? You have 8 producers and a single consumer that pushes data to disk. You use non-buffered, synchronized writes. Which means as much as for as long as it takes for the drive to receive all data, your writer thread blocks, it doesn't do anything. Producers keep producing. Well that is fine, says you. You only need 50 MB/s or such, and the drive can do 100 MB/s.
Yes, except... that's the maximum with several threads hammering the drive concurrently. You have zero concurrency here, so you don't get those 100 MB/s already. Or maybe you do, for a while.
And then, the drive needs to do something. Something? Something, anything, whatever, say, a seek. Or, you get scheduled out for an instant, or for whatever reason the controller or OS or whatever doesn't push data for a tenth millisecond, and when you look again next, the disk has spun on, so you need to wait one rotation (no joke, that happens!). Maybe there's just one or two sectors in the way on a fragmented disk, who knows. It doesn't matter what... something happens.
This takes, say, anywhere from one to ten milliseconds, and your thread does nothing during that time, it's still waiting for the blocking WriteFile call to finish already. Producers keep producing and keep pulling memory blocks from the pool. The pool runs out of memory blocks, and allocates some more. Producers keep producing, and producing. Meanwhile, your writer thread keeps the buffer locked, and still does nothing.
See where this is going? That cannot be sustained forever. Even if it temporarily recovers, it's always fighting uphill. And losing.
Caching (also write caching) doesn't exist for no reason. Things in the real world do not always go as fast as we wish, both in terms of bandwidth and latency.
On the internet, we can easily crank up bandwidth to gigabit uplinks, but we are bound by latency, and there is absolutely nothing that we can do about it. That's why trans-atlantic communications still suck like they did 30 years ago, and they will still suck in 30 years (the speed of light very likely won't change).
Inside a computer, things are very similar. You can easily push two-digit gigabytes per second over a PCIe bus, but doing so in the first place takes a painfully long time. Graphics programmers know that only too well.
The same applies to disks, and it is no surprise that exactly the same strategy is applied to solve the issue: Overlap transfers, do them asynchronously. That's what virtually every operating system does by default. Data is copied to buffers, and written out lazily, and asynchronously.
That, however, is exactly what you disable by using unbuffered writes. Which is almost always (with very, very few singular exceptions) a bad idea.
For some reason, people have kept around this mindset of DMA doing magic things and being so totally superior speed-wise. Which, maybe, was even true at some point in the distant past, and maybe still is true for some very rare corner cases (e.g. read-once from optical disks). In all other cases, it's a desastrous anti-optimization.
Things that could run in parallel run sequentially. Latencies add up. Threads block, and do nothing.
Simply using "normal" writes means data is copied to buffers, and WriteFile returns almost instantly (well, with a delay equivalent to a memcpy). The drive never (well, hopefully, can never be 100% sure) starves, and your writer thread never blocks, doing nothing. Memory blocks don't stack up in the queue, and pools don't run out, they don't need to allocate more.

Related

Audio manipulation and delete some part of the audio

I'm new in voice codding, now I am succeed to recording microphone in the files and save each 10 seconds in a file with SaveRecordtoFile function(doing this with no problem)
Now I want to delete for example 2 seconds from the recorded data so my output will be 8 seconds instead of 10, in the randomTime array 0 is the number of seconds witch I want to be delete...
In a for-loop I copy the data of waveHeader->lpData in a new buffer if (randomTime[i] == '1')
It seems this is a true algorithm and should works but the problem is the outputs, some of the outputs are good (about 70% or more) but some of them are corrupted
I think I have a mistake in the code but I debug this code for some days and I don't understand what is the problem?
And as my 70% or more of outputs are good I think It's not because of bytes or samples
Your code can break a sample apart, after that the stream is out of sync and you hear a loud noise.
How it happens? Your sample size is 4 bytes. So you must never copy anything that is not a multiple of 4. 10 seconds of audio will take 10x48000×4=1920000 bytes. However Sleep(10000) will always be near 10 seconds but not exactly 10 seconds. So you can get 1920012 bytes. Then you do:
dwSamplePerSec = waveHeader->dwBytesRecorded / 10; // 10 Secs
that returns 192001 (which is not multiple of 4) and the steam gets out of sync. If you're lucky you receive 1920040 bytes for 10 second and that remains multiple of 4 after division on 10 and you're ok.

memory mapped file access is very slow

I am writing to a 930GB file (preallocated) on a Linux machine with 976 GB memory.
The application is written in C++ and I am memory mapping the file using Boost Interprocess. Before starting the code I set the stack size:
ulimit -s unlimited
The writing was very fast a week ago, but today it is running slow. I don't think the code has changed, but I may have accidentally changed something in my environment (it is an AWS instance).
The application ("write_data") doesn't seem to be using all the available memory. "top" shows:
Tasks: 559 total, 1 running, 558 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 98.5%id, 1.5%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1007321952k total, 149232000k used, 858089952k free, 286496k buffers
Swap: 0k total, 0k used, 0k free, 142275392k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4904 root 20 0 2708m 37m 27m S 1.0 0.0 1:47.00 dockerd
56931 my_user 20 0 930g 29g 29g D 1.0 3.1 12:38.95 write_data
57179 root 20 0 0 0 0 D 1.0 0.0 0:25.55 kworker/u257:1
57512 my_user 20 0 15752 2664 1944 R 1.0 0.0 0:00.06 top
I thought the resident size (RES) should include the memory mapped data, so shouldn't it be > 930 GB (size of the file)?
Can someone suggest ways to diagnose the problem?
Memory mappings generally aren't eagerly populated. If some other program forced the file into the page cache, you'd see good performance from the start, otherwise you'd see poor performance as the file was paged in.
Given you have enough RAM to hold the whole file in memory, you may want to hint to the OS that it should prefetch the file, reducing the number of small reads triggered by page faults, substituting larger bulk reads. The posix_madvise API can be used to provide this hint, by passing POSIX_MADV_WILLNEED as the advice, indicating it should prefetch the whole file.

How can I stream multiple files at the same time using HTTP::Server?

I'm working on an HTTP service that serves big files. I noticed that parallel downloads are not possible. The process serves only one file at a time and all other downloads are waiting until the previous downloads finish. How can I stream multiple files at the same time?
require "http/server"
server = HTTP::Server.new(3000) do |context|
context.response.content_type = "application/data"
f = File.open "bigfile.bin", "r"
IO.copy f, context.response.output
end
puts "Listening on http://127.0.0.1:3000"
server.listen
Request one file at a time:
$ ab -n 10 -c 1 127.0.0.1:3000/
[...]
Percentage of the requests served within a certain time (ms)
50% 9
66% 9
75% 9
80% 9
90% 9
95% 9
98% 9
99% 9
100% 9 (longest request)
Request 10 files at once:
$ ab -n 10 -c 10 127.0.0.1:3000/
[...]
Percentage of the requests served within a certain time (ms)
50% 52
66% 57
75% 64
80% 69
90% 73
95% 73
98% 73
99% 73
100% 73 (longest request)
The problem here is that both File#read and context.response.output will never block. Crystal's concurrency model is based on cooperatively scheduled fibers, where switching fibers only happens when IO blocks. Reading from the disk using nonblocking IO is impossible which means the only part that's possible to block is writing to context.response.output. However, disk IO is a lot lot slower than network IO on the same machine, meaning that writing will never block because ab is reading at a rate much faster than the disk can provide data, even from the disk cache. This example is practically the perfect storm to break crystal's concurrency.
In the real world, it's much more likely that clients of the service will reside over the network from the machine, making the response write occasionally block. Furthermore, if you were reading from another network service or a pipe/socket you would also block. Another solution would be to use a threadpool to implement nonblocking file IO, which is what libuv does. As a side note, Crystal moved to libevent because libuv doesn't allow a multithreaded event loop (i.e. have any thread resume any fiber).
Calling Fiber.yield to pass execution to any pending fiber is the correct solution. Here's an example of how to block (and yield) while reading files:
def copy_in_chunks(input, output, chunk_size = 4096)
size = 1
while size > 0
size = IO.copy(input, output, chunk_size)
Fiber.yield
end
end
File.open("bigfile.bin", "r") do |file|
copy_in_chunks(file, context.response)
end
This is a transcription of the dicussion here: https://github.com/crystal-lang/crystal/issues/4628
Props to GitHub users #cschlack, #RX14 and #ysbaddaden

Windows shared memory access time slow

I am currently using shared memory with two mapped files (1.9 GBytes for the first one and 600 MBytes for the second) in a software.
I am using a process that read data from the first file, process the data and write the results to the second file.
I have noticed a strong delay sometimes (the reason is out of my knowledge) when reading or writing to the mapping view with memcpy function.
Mapped files are created this way :
m_hFile = ::CreateFileW(SensorFileName,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
m_hMappedFile = CreateFileMapping(m_hFile,
NULL,
PAGE_READWRITE,
dwFileMapSizeHigh,
dwFileMapSizeLow,
NULL);
And memory mapping is done this way :
m_lpMapView = MapViewOfFile(m_hMappedFile,
FILE_MAP_ALL_ACCESS,
dwOffsetHigh,
dwOffsetLow,
m_i64ViewSize);
The dwOffsetHigh/dwOffsetLow are "matching" granularity from the system info.
The process is reading about 300KB * N times, storing that in a buffer, processing and then writing 300KB * N times the processed contents of the previous buffer to the second file.
I have two different memory views (created/moved with MapViewOfFile function) with a size of 10 MBytes as default size.
For memory view size, I tested 10kBytes, 100kB, 1MB, 10MB and 100MB. Statistically no difference, 80% of the time reading process is as described below (~200ms) but writing process is really slow.
Normally :
1/ Reading is done in ~200ms.
2/ Process done in 2.9 seconds.
3/ Writing is done in ~200ms.
I can see that 80% of the time, either reading or writing (in the worst case both are slow) will take between 2 and 10 seconds.
Example : For writing, I am using the below code
for (unsigned int i = 0 ; i < N ; i++) // N = 500~3k
{
// Check the position of the memory view for ponderation
if (###)
MoveView(iOffset);
if (m_lpMapView)
{
memcpy((BYTE*)m_lpMapView + iOffset, pANNHeader, uiANNStatus);
// uiSize = ~300 kBytes
memcpy((BYTE*)m_lpMapView + iTemp, pLine[i], uiSize);
}
else
return uiANNStatus;
}
After using GetTickCount function to pinpoint where is the delay, I am seeing that the second memcpy call is always the one taking most of the time.
So, so far I am seeing N (for test, I used N = 500) calls to memcpy taking 10 seconds at the worst time when using those shared memories.
I made a temporary software that was doing the same quantity of memcpy calls, same amount of data and couldn't see the problem.
For tests, I used the following conditions, they all show the same delay :
1/ I can see this on various computers, 32 or 64 bits from windows 7 to windows 10.
2/ Using the main thread or multi-threads (up to 8 with critical sections for synchronization purpose) for reading/writing.
3/ OS on SATA or SSD, memory mapped files of the software physically on a SATA or SSD hard-disk, and if on external hard-disk, tests were done through USB1, USB2 or USB3.
I am kindly asking you what you would think my mistake is for memcpy to go slow.
Best regards.
I found a solution that works for me but not might be the case for others.
Following Thomas Matthews comments, I checked the MSDN and found two interesting functions FlushViewOfFile and FlushFileBuffers (but couldn't find anything interesting about locking memory).
Calling both after the for loop force update of the mapped file.
I am having no more "random" delay, but instead of the expected 200ms, I have an average of 400ms which is enough for my application.
After doing some tests I saw that calling those too often will cause heavy hard-disk access and will make the delay worse (10 seconds for every for loop) so the flush should be use carefully.
Thanks.

WriteFile Failure with error code 87 in 4096 Bytes per sector Disk

WriteFile() Win32 call with input buffer size = 512 Fails., when i try to write to the disk that has bytes per sector = 4096.[3 TB disk]. Same WriteFile with input buffer size = 4096 works fine.,
Can any body explain this behavior.
For low-level I/O operations, your buffer must be an integer multiple of the sector size. In your case, k*4096. Most likely your hard drive wasn't manufactured a long time ago. They are called "Advanced Format" and have 4096 bytes per sector. Mine doesn't mind if I set it to 512 because it's old. Try using the GetDiskFreeSpace function to learn more about your hard-drive.