fstream.write() gets stuck for a long time? - c++

fstream.write() gets stuck during perform to remove directory. It seems that my HDD is so busy while write and remove at the same time. So how can I avoid this stuck, any timeout parameter for the write() function?. Below is my procedure code:
#include <fstream>
#include <iostream>
int main()
{
std::ios_base::sync_with_stdio(false);
fstream myfile = std::fstream("sample.txt", std::ios::out | std::ios::binary);
//Start thread remove old data
INT64* paramsInput = new INT64[2];
char* dir = "D:\";
paramsInput[0] = (INT64)dir;
paramsInput[1] = 50; //GB
_beginthreadex(Null, 0, &remove_old_data, (VOID*)paramsInput);
int size = 0;
char* data = NULL;
while (true)
{
data = NULL;
size = getData(data); //data is available every 10 ms
if(size > 0 && data != NULL) //size ~= 30 KB
{
myfile.write(data, size); //write data to file
}
}
}
UINT32 __stdcall remove_old_data(VOID* _pArguments)
{
char* dir = (char*)_pArguments[0];
int freeSpaceThreshold = _pArguments[1];
delete[] _pArguments;
while(true)
{
int curFreeSpace = GetFreeSpace(dir);
if(curFreeSpace < freeSpaceThreshold )
{
//remove old files and directory here
ClearData(dir);//File size is about 10 MB, 40,000 files in dir
}
Sleep(10000);
}
}

It's a bit tough to say exactly what the bottleneck is in your case, although there could be several. Deleting files can be time consuming, particularly when doing several in a directory with many files. Writes to a mostly full drive are also slower, as it takes the OS longer to find empty space to store new data, and those chunks are smaller.
Here are some suggestions to improve the performance, in no particular order:
1) Use an SSD. This eliminates almost all the latency for hard disk access.
2) Use OS API functions for file access, and use unbuffered writes. This will avoid filling the disk cache with data that is only written and not read again, allowing the directory information to stay cached.
3) Use multiple subdirectories to store your data. File access in a directory can slow down if the directory size gets to be too large.
4) Cache the 30K data chunks locally, keeping multiple chunks queued up for writing, and only write them out when remove_old_data is not cleaning up the directory.

Related

How to concurrently write to a file in c++(in other words, whats the fastest way to write to a file)

I'm building a graphics engine, and I need to write te result image to a .bmp file. I'm storing the pixels in a vector<Color>. While also saving the width and the heigth of the image. Currently I'm writing the image as follows(I didn't write this code myself):
std::ostream &img::operator<<(std::ostream &out, EasyImage const &image) {
//temporaryily enable exceptions on output stream
enable_exceptions(out, std::ios::badbit | std::ios::failbit);
//declare some struct-vars we're going to need:
bmpfile_magic magic;
bmpfile_header file_header;
bmp_header header;
uint8_t padding[] =
{0, 0, 0, 0};
//calculate the total size of the pixel data
unsigned int line_width = image.get_width() * 3; //3 bytes per pixel
unsigned int line_padding = 0;
if (line_width % 4 != 0) {
line_padding = 4 - (line_width % 4);
}
//lines must be aligned to a multiple of 4 bytes
line_width += line_padding;
unsigned int pixel_size = image.get_height() * line_width;
//start filling the headers
magic.magic[0] = 'B';
magic.magic[1] = 'M';
file_header.file_size = to_little_endian(pixel_size + sizeof(file_header) + sizeof(header) + sizeof(magic));
file_header.bmp_offset = to_little_endian(sizeof(file_header) + sizeof(header) + sizeof(magic));
file_header.reserved_1 = 0;
file_header.reserved_2 = 0;
header.header_size = to_little_endian(sizeof(header));
header.width = to_little_endian(image.get_width());
header.height = to_little_endian(image.get_height());
header.nplanes = to_little_endian(1);
header.bits_per_pixel = to_little_endian(24);//3bytes or 24 bits per pixel
header.compress_type = 0; //no compression
header.pixel_size = pixel_size;
header.hres = to_little_endian(11811); //11811 pixels/meter or 300dpi
header.vres = to_little_endian(11811); //11811 pixels/meter or 300dpi
header.ncolors = 0; //no color palette
header.nimpcolors = 0;//no important colors
//okay that should be all the header stuff: let's write it to the stream
out.write((char *) &magic, sizeof(magic));
out.write((char *) &file_header, sizeof(file_header));
out.write((char *) &header, sizeof(header));
//okay let's write the pixels themselves:
//they are arranged left->right, bottom->top, b,g,r
// this is the main bottleneck
for (unsigned int i = 0; i < image.get_height(); i++) {
//loop over all lines
for (unsigned int j = 0; j < image.get_width(); j++) {
//loop over all pixels in a line
//we cast &color to char*. since the color fields are ordered blue,green,red they should be written automatically
//in the right order
out.write((char *) &image(j, i), 3 * sizeof(uint8_t));
}
if (line_padding > 0)
out.write((char *) padding, line_padding);
}
//okay we should be done
return out;
}
As you can see, the pixels are being written one by one. This is quite slow, I put some timers in my program, and found that the writing was my main bottleneck.
I tried to write entire (horizontal) lines, but I did not find how to do it(best I found was this.
Secondly, I wanted to write to the file using multithreading(not sure if I need to use threading or processing). using openMP. But that means I need to specify which byte address to write to, I think, which I couldn't solve.
Latstly, I thought about immidiatly writing to the file whenever I drew an object, but then I had the same issue with writing to specific locations in the file.
So, my question is: what's the best(fastest) way to tackle this problem. (Compiling this for windows and linux)
The fastest method to write to a file is to use hardware assist. Write your output to memory (a.k.a. buffer), then tell the hardware device to transfer from memory to the file (disk).
The next fastest method is to write all the data to a buffer then block write the data to the file. If you want other tasks or threads to execute during your writing, then create a thread that writes the buffer to the file.
When writing to a file, the more data per transaction, the more efficient the write will be. For example, 1 write of 1024 bytes is faster than 1024 writes of one byte.
The idea is to keep the data streaming. Slowing down the transfer rate may be faster than a burst write, delay, burst write, delay, etc.
Remember that the disk is essentially a serial device (unless you have a special hard drive). Bits are laid down on the platters using a bit stream. Writing data in parallel will have adverse effects because the head will have to be moved between the parallel activities.
Remember that if you use more than one core, there will be more traffic on the data bus. The transfer to the file will have to pause while other threads/tasks are using the data bus. So, if you can, block all tasks, then transfer your data. :-)
I've written programs that copy from slow memory to fast memory, then transferred from fast memory to the hard drive. That was also using interrupts (threads).
Summary
Fast writing to a file involves:
Keep the data streaming; minimize the pauses.
Write in binary mode (no translations, please).
Write in blocks (format into memory as necessary before writing the block).
Maximize the data in a transaction.
Use separate writing thread, if you want other tasks running "concurrently".
The hard drive is a serial device, not parallel. Bits are written to the platters in a serial stream.

Truly asynchronous file IO in C++

I have a super fast M.2 drive. How fast is it? It doesn’t matter because I cannot utilize this speed anyway. That’s why I’m asking this question.
I have an app that needs a real lot of memory. So much that it won’t fit in RAM. Fortunately it is not needed all at once. Instead it is used to save intermediate results from computations.
Unfortunately the application is not able to write and reads this data fast enough. I tried using multiple reader and writer threads but it only made it worse (later I read that it is because of this).
So my question is: Is it possible to have truly asynchronous file IO in C++ to fully exploit those advertised gigabytes per second? If it is than how (in a cross platform way)?
You could also recommend a library that’s good with tasks like that if you know one because I believe that there is no point in reinventing the wheel.
Edit:
Here is code that shows how I do file IO in my program. It isn't from the mentioned program because it wouldn't be that minimal. This one ilustrates the problem nevertheless. Do not mind Windows.h. It is used only to set thread affinity. In the actual program I also set affinity , so that's why I included it.
#include <fstream>
#include <thread>
#include <memory>
#include <string>
#include <Windows.h> // for SetThreadAffinityMask()
void stress_write(unsigned bytes, int num)
{
std::ofstream out("temp" + std::to_string(num));
for (unsigned i = 0; i < bytes; ++i)
{
out << char(i);
}
}
void lock_thread(unsigned core_idx)
{
SetThreadAffinityMask(GetCurrentThread(), 1LL << core_idx);
}
int main()
{
std::ios_base::sync_with_stdio(false);
lock_thread(0);
auto worker_count = std::thread::hardware_concurrency() - 1;
std::unique_ptr<std::thread[]> threads = std::make_unique<std::thread[]>(worker_count); // faster than std::vector
for (int i = 0; i < worker_count; ++i)
{
threads[i] = std::thread(
[](unsigned idx) {
lock_thread(idx);
stress_write(1'000'000'000, idx);
},
i + 1
);
}
stress_write(1'000'000'000, 0);
for (int i = 0; i < worker_count; ++i)
{
threads[i].join();
}
}
As you can see its just plain old fstream. On my machine this uses 100% CPU, but only 7-9% disk (around 190MB/s). I am wondering if it could be increased.
The easiest thing to get (up to) a 10x speed up is to change this:
void stress_write(unsigned bytes, int num)
{
std::ofstream out("temp" + std::to_string(num));
for (unsigned i = 0; i < bytes; ++i)
{
out << char(i);
}
}
to this:
void stress_write(unsigned bytes, int num)
{
constexpr auto chunk_size = (1u << 12u); // tune as needed
std::ofstream out("temp" + std::to_string(num));
for (unsigned chunk = 0; chunk < (bytes+chunk_size-1)/chunk_size; ++chunk)
{
char chunk_buff[chunk_size];
auto count = (std::min)( bytes - chunk_size*chunk, chunk_size );
for (unsigned j = 0; j < count; ++j)
{
unsigned i = j + chunk_size*chunk;
chunk_buff[j] = char(i); // processing
}
out.write( chunk_buff, count );
}
}
where we group writes up to 4096 bytes before sending to the std ofstream.
The streaming operations have a number of annoying, hard for compilers to elide, virtual calls that dominate performance when you are writing only a handful of bytes at a time.
By chunking data into larger pieces we make the vtable lookups rare enough that they no longer dominate.
See this SO post for more details asto why.
To get the last iota of performance, you may have to use something like boost.asio or access your platforms raw async file io libraries.
But when you are working at < 10% of the drive bandwidth while railing your CPU, aim at low hanging fruit first.
Chunking the I/O is indeed the most important optimization here and should suffice in most cases. However, the direct answer to the exact question asked about asynchronous IO is the following.
Boost::Asio added support for file operations in version 1.21.0. The interface is similar to the rest of Asio.
First, we need to create an object representing a file. The most common use cases would use either a random_access_file or a stream_file. In case of this example code, a streaming file is enough.
Reading is done through async_read_some, but the usual async_read helper function can be used to read a specific number of bytes at once.
If the operating system supports that, these operations do indeed run in the background and use little processor time. Both Windows and Linux do support this.

How to write a large binary file to a disk

I am writing a program which requires writing a large binary file (about 12 GiB or more) to a disk. I have created a small test program to test this functionality. Although allocating the RAM memory for the buffer is not a problem, my program does not write the data to a file. The file remains empty. Even for 3.72 GiB files.
//size_t bufferSize=1000; //ok
//size_t bufferSize=100000000; //ok
size_t bufferSize=500000000; //fails although it is under 4GiB, which shouldn't cause problem anyways
double mem=double(bufferSize)*double(sizeof(double))/std::pow(1024.,3.);
cout<<"Total memory used: "<<mem<<" GiB"<<endl;
double *buffer=new double[bufferSize];
/* //enable if you want to fill the buffer with random data
printf("\r[%i \%]",0);
for (size_t i=0;i<(size_t)bufferSize;i++)
{
if ((i+1)%100==0) printf("\r[%i %]",(size_t)(100.*double(i+1)/bufferSize));
buffer[i]=rand() % 100;
}
*/
cout<<endl;
std::ofstream outfile ("largeStuff.bin",std::ofstream::binary);
outfile.write ((char*)buffer,((size_t)(bufferSize*double(sizeof(double)))));
outfile.close();
delete[] buffer;
I actually compiled and ran the code exactly as you have pasted there and it works. It creates a 4GB file.
If you are on a FAT32 filesystem the max filesize is 4GB.
Otherwise I suggest you check:
The amount of free disk space you have.
Whether your user account has any disk usage limits in place.
The amount of free RAM you have.
Whether there are any runtime errors.
#enhzflep's suggestion about the number of prints (although that is
commented out)
It seems that you want to have a buffer that contains the whole file's contents prior to writing it.
You're doing it wrong, through: the virtual memory requirements are essentially double of what they need to be. Your process retains the buffer, but when you write that buffer to disk it gets duplicated in operating system's buffers. Now, most OSes will notice that you write sequentially and may discard their buffers quickly, but still: it's rather wasteful.
Instead, you should create an empty file, grow it to its desired size, then map its view into memory, and do the modifications on the file's view in memory. For 32 bit hosts your file size is limited to <1GB. For 64 bit hosts, it's limited by the filesystem only. On modern hardware, creating and filling a 1GB file that way takes on the order of 1 second (!) if you have enough free RAM available.
Thanks to the wonders of RAII, you don't need to do anything special to release the mapped memory, or to close/finalize the file. By leveraging boost you can avoid writing platform-specific code, too.
// https://github.com/KubaO/stackoverflown/tree/master/questions/mmap-boost-40308164
#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/filesystem.hpp>
#include <cassert>
#include <cstdint>
#include <fstream>
namespace bip = boost::interprocess;
void fill(const char * fileName, size_t size) {
using element_type = uint64_t;
assert(size % sizeof(element_type) == 0);
std::ofstream().open(fileName); // create an empty file
boost::filesystem::resize_file(fileName, size);
auto mapping = bip::file_mapping{fileName, bip::read_write};
auto mapped_rgn = bip::mapped_region{mapping, bip::read_write};
const auto mmaped_data = static_cast<element_type*>(mapped_rgn.get_address());
const auto mmap_bytes = mapped_rgn.get_size();
const auto mmap_size = mmap_bytes / sizeof(*mmaped_data);
assert(mmap_bytes == size);
element_type n = 0;
for (auto p = mmaped_data; p < mmaped_data+mmap_size; ++p)
*p = n++;
}
int main() {
const uint64_t G = 1024ULL*1024ULL*1024ULL;
fill("tmp.bin", 1*G);
}

Loop Around File Mapping Kills Performance

I have a circular buffer which is backed with file mapped memory (the buffer is in the size range of 8GB-512GB).
I am writing to (8 instances of) this memory in a sequential manner from the beginning to the end at which point it loops around back to the beginning.
It works fine until it reaches the end where it needs to perform two file mappings and loop around the memory, at which point IO performance is totally trashed and doesn't recover (even after several minutes). I can't quite figure it out.
using namespace boost::interprocess;
class mapping
{
public:
mapping()
{
}
mapping(file_mapping& file, mode_t mode, std::size_t file_size, std::size_t offset, std::size_t size)
: offset_(offset)
, mode_(mode)
{
const auto aligned_size = page_ceil(size + page_size());
const auto aligned_file_size = page_floor(file_size);
const auto aligned_file_offset = page_floor(offset % aligned_file_size);
const auto region1_size = std::min(aligned_size, aligned_file_size - aligned_file_offset);
const auto region2_size = aligned_size - region1_size;
if (region2_size)
{
const auto region1_address = mapped_region(file, read_only, 0, (region1_size + region2_size) * 2).get_address();
const auto region2_address = reinterpret_cast<char*>(region1_address) + region1_size;
region1_ = mapped_region(file, mode, aligned_file_offset, region1_size, region1_address);
region2_ = mapped_region(file, mode, 0, region2_size, region2_address);
}
else
{
region1_ = mapped_region(file, mode, aligned_file_offset, region1_size);
region2_ = mapped_region();
}
size_ = region1_.get_size() + region2_.get_size();
offset_ = aligned_file_offset;
}
auto offset() const -> std::size_t { return offset_; }
auto size() const -> std::size_t { return size_; }
auto data() const -> const void* { return region1_.get_address(); }
auto data() -> void* { return region1_.get_address(); }
auto flush(bool async = true) -> void
{
region1_.flush(async);
region2_.flush(async);
}
auto mode() const -> mode_t { return mode_; }
private:
std::size_t offset_ = 0;
std::size_t size_ = 0;
mode_t mode_;
mapped_region region1_;
mapped_region region2_;
};
struct loop_mapping::impl final
{
std::tr2::sys::path file_path_;
file_mapping file_mapping_;
std::size_t file_size_;
std::size_t map_size_ = page_floor(256000000ULL);
std::shared_ptr<mapping> mapping_ = std::shared_ptr<mapping>(new mapping());
std::shared_ptr<mapping> prev_mapping_;
bool write_;
public:
impl(std::tr2::sys::path path, bool write)
: file_path_(std::move(path))
, file_mapping_(file_path_.string().c_str(), write ? read_write : read_only)
, file_size_(page_floor(std::tr2::sys::file_size(file_path_)))
, write_(write)
{
REQUIRE(file_size_ >= map_size_ * 3);
}
~impl()
{
prev_mapping_.reset();
mapping_.reset();
}
auto data(std::size_t offset, std::size_t size, boost::optional<bool> write_opt) -> void*
{
offset = offset % page_floor(file_size_);
REQUIRE(size < file_size_ - map_size_ * 3);
const auto write = write_opt.get_value_or(write_);
REQUIRE(!write || write_);
if ((write && mapping_->mode() == read_only) || offset < mapping_->offset() || offset + size >= mapping_->offset() + mapping_->size())
{
auto new_mapping = std::make_shared<loop::mapping>(file_mapping_, write ? read_write : read_only, file_size_, page_floor(offset), std::max(size + page_size(), map_size_));
if (mapping_)
mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));
if (prev_mapping_)
prev_mapping_->flush(false);
prev_mapping_ = std::move(mapping_);
mapping_ = std::move(new_mapping);
}
return reinterpret_cast<char*>(mapping_->data()) + offset - mapping_->offset();
}
}
-
// 8 processes to 8 different files 128GB each.
loop_mapping loop(...);
for (auto n = 0; true; ++n)
{
auto src = get_new_data(5000000/8);
auto dst = loop.data(n * 5000000/8, 5000000/8, true);
std::memcpy(dst, src, 5000000/8); // This becomes very slow after loop around.
std::this_thread::sleep_for(std::chrono::seconds(1));
}
Any ideas?
Target System:
1x 3TB Seagate Constellation ES.3
2x Xeon E5-2400 (6 core, 2.6Ghz)
6x 8GB DDR3 1600Mhz ECC
Windows Server 2012
8 buffers each 8 to 512GiB in size on a system with 48GiB of physical memory means that your mapping will have to be swapped. No surprise there.
The issue, as you have already remarked yourself, is that prior to being able to write to a page, you encounter a fault, and the page is read in. That doesn't happen on the first run, since merely a zero page is used. To make matters worse, reading in pages again competes with write-behind of dirty pages.
Now, there is unluckily no way of telling Windows "I'm going to overwrite this anyway", nor is there any way of making the disk load your stuff faster. However, you can start the transfer earlier (maybe when you're 3/4 through the buffer).
Windows Server 2012 (which you're using) supports PrefetchVirtualMemory which is a somewhat half-assed substitute for POSIX madvise(MADV_WILLNEED).
That is, of course, not exactly what you want to do when you already know that you will overwrite the complete memory page (or several of them) anyway, but it is as good as you can get. It's worth a try in any case.
Ideally, you would want to do something like a destructive madvise(MADV_DONTNEED) as implemented e.g. under Linux (and I believe FreeBSD, too) immediately before you overwrite a page, but I am not aware of any way of doing this under Windows (...short of destroying the view and the mapping and mapping from scratch, but then you throw away all data, so that's a bit useless).
Even with prefetching early you will still be limited by disk I/O bandwidth, but at least you can hide the latency.
Another "obvious" (but probably not that easy) solution would be to make the consumer faster. That would allow for a smaller buffer to begin with, and even on a huge buffer it would keep the working set smaller (both producer and consumer force pages into RAM while accessing them, so if the consumer accesses data with less delay after the producer has written them, they will both be using mostly the same set of pages.) Smaller working sets fit into RAM more easily.
But I realize that you probably didn't choose a several-gigabyte buffer for no reason.
Since your code is devoid of any comment, filled with auto variables, not compilable as is and I don't have 512Gb available on my PC to test it anyway, this will remain a passing tought off the top of my head.
each of your process only writes a few hundreds Kb/s, so there should be ample time to flush that to disk in the background.
However, it seems you are asking the boost mapping system to flush either synchronously or asynchronously the previous chunk depending on your mysterious offset computations:
mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));
I guess the rollover triggers a synchronous flush, which is a likely culprit for the sudden slowdown.
What the operating system does at this point depends on the boost implementation, which is not described (or at least in a way obvious enough for me to get it after a cursory look at their man page).
If boost stuffed your 48 Gb of memory with unflushed pages, you could certainly experience a sudden and prolonged deceleration.
At least worth a comment in your code if this mysterious line does something clever and completely different I missed entirely.
If you are able to back the memory mapping with the page file rather than a specific file, you can use the MEM_RESET flag with VirtualAlloc to prevent Windows from paging in the old contents.
The main issue I would anticipate in using this approach is that you can't easily recover the disk space when you are done. It may also require the system's page file settings to be changed; I believe it will work with the default settings, but not if a maximum page file size has been set.
I am going to assume that by "Loop around" you mean that the RAM got full.
What happens is that until the RAM get full, all you have to do is allocate a page and write in it (RAM speed), after the RAM gets full every page allocation turns to 2 actions:
1. you have to write the dirty page back (DISK speed)
2. and allocate a page (RAM speed)
And worst case you also have to bring the page from the file in the disk (DISK speed) if you are reading something from it.
So instead of working only in RAM speed (page allocation), every page allocation runs in DISK speed.
This doesnt happen with 2x8GB because it is small enough for all of the memory of both files to remain fully in the RAM.
The problem here it turns out is that when overwrite a valid page in memory the page first has to be read from the drive before being overwritten. There is no way to get around this issue as far as I know when using memory mapped files.
The reason it doesn't happen during the first pass is that the pages being overwritten are not "valid" and thus they do not need to be read back.

Use sendfile() to copy file with threads or other efficient copy file method

I'm trying to use the Linux system call sendfile() to copy a file using threads.
I'm interested in optimizing these parts of the code:
fseek(fin, size * (number) / MAX_THREADS, SEEK_SET);
fseek(fout, size * (number) / MAX_THREADS, SEEK_SET);
/* ... */
fwrite(buff, 1, len, fout);
Code:
void* FileOperate::FileCpThread::threadCp(void *param)
{
Info *ft = (Info *)param;
FILE *fin = fopen(ft->fromfile, "r+");
FILE *fout = fopen(ft->tofile, "w+");
int size = getFileSize(ft->fromfile);
int number = ft->num;
fseek(fin, size * (number) / MAX_THREADS, SEEK_SET);
fseek(fout, size * (number) / MAX_THREADS, SEEK_SET);
char buff[1024] = {'\0'};
int len = 0;
int total = 0;
while((len = fread(buff, 1, sizeof(buff), fin)) > 0)
{
fwrite(buff, 1, len, fout);
total += len;
if(total > size/MAX_THREADS)
{
break;
}
}
fclose(fin);
fclose(fout);
}
File copying is not CPU bound; if it were you're likely to find that the limitation is at the kernel level and nothing you can do at the user leve would parallelize it.
Such "improvements" done on mechanical drives will in fact degrade the throughput. You're wasting time seeking along the file instead of reading and writing it.
If the file is long and you don't expect to need the read or written data anytime soon, it might be tempting to use the O_DIRECT flag on open. That's a bad idea, since the O_DIRECT API is essentially broken by design.
Instead, you should use posix_fadvise on both source and destination files, with POSIX_FADV_SEQUENTIAL and POSIX_FADV_NOREUSE flags. After the write (or sendfile) call is finished, you need to advise that the data is not needed anymore - pass POSIX_FADV_DONTNEED. That way the page cache will only be used to the extent needed to keep the data flowing, and the pages will be recycled as soon as the data has been consumed (written to disk).
The sendfile will not push file data over to the user space, so it further relaxes some of the pressure from memory and processor cache. That's about the only other sensible improvement you can make for copying of files that's not device-specific.
Choosing a sensible chunk size is also desirable. Given that modern drives push over a 100Mbytes/s, you might want to push a megabyte at a time, and always a multiple of the 4096 byte page size - thus (4096*256) is a decent starting chunk size to handle in a single sendfile or read/write calls.
Read parallelization, as you propose it, only makes sense on RAID 0 volumes, and only when both the input and output files straddle the physical disks. You can then have one thread per the lesser of the number of source and destination volume physical disks straddled by the file. That's only necessary if you're not using asynchronous file I/O. With async I/O you wouldn't need more than one thread anyway, especially not if the chunk sizes are large (megabyte+) and the single-thread latency penalty is negligible.
There's no sense for parallelization of a single file copy on SSDs, unless you were on some very odd system indeed.