Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
[Sorry for the confusion: The original post had "SSD" instead of "HDD" in the title, but I figured out that I performed the tests on an HDD by accident, as I was accessing the wrong mounting point. On an SSD this phenomenon did not occur. Still interesting that it happens for an HDD though.]
I am using the following code for reading in a loop from a given number of files of constant size. All files to be read exist and reading is successful.
It is clear that varying the file size has an effect on fMBPerSecond, because when reading files smaller than the page size, still the whole page is read. However, nNumberOfFiles has an effect on fMBPerSecond as well, which is what I do not understand. Apparently, it is not nNumberOfFiles itself that has an effect, but it is the product nNumberOfFiles * nFileSize.
But why should it have an effect? The files are opened/read/closed sequentially in a loop.
I tested with nFileSize = 65536. When choosing nNumberOfFiles = 10000 (or smaller) I get something around fMBPerSecond = 500 MB/s. With nNumberOfFiles = 20000 I get something around fMBPerSecond = 100 MB/s, which is a dramatic loss in performance.
Oh, and I should mention that I clear the disk cache before reading by executing:
sudo sync
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
Any ideas what is happening here behind the scenes would be welcome.
Pedram
void Read(int nFileSize, int nNumberOfFiles)
{
char szFilePath[4096];
unsigned char *pBuffer = new unsigned char[nFileSize];
Helpers::get_timer_value(true);
for (int i = 0; i < nNumberOfFiles; i++)
{
sprintf(szFilePath, "files/test_file_%.4i", i);
int f = open(szFilePath, O_RDONLY);
if (f)
{
if (read(f, pBuffer, (ssize_t) nFileSize) != (ssize_t) nFileSize)
printf("error: could not read file '%s'\n", szFilePath);
close(f);
}
else
{
printf("error: could not open file for reading '%s'\n", szFilePath);
}
}
const unsigned int t = Helpers::get_timer_value();
const float fMiliseconds = float(t) / 1000.0f;
const float fMilisecondsPerFile = fMiliseconds / float(nNumberOfFiles);
const float fBytesPerSecond = 1000.0f / fMilisecondsPerFile * float(nFileSize);
const float fMBPerSecond = fBytesPerSecond / 1024.0f / 1024.0f;
printf("t = %.8f ms / %.8i bytes - %.8f MB/s\n", fMilisecondsPerFile,
nFileSize, fMBPerSecond);
delete [] pBuffer;
}
There are several SSD models, especially the more expensive datacenter models, that combine an internal DRAM cache with the (slower) persistent NAND cells. As long as the data you read fits in the DRAM cache, you'll get faster response.
Related
I'm building a graphics engine, and I need to write te result image to a .bmp file. I'm storing the pixels in a vector<Color>. While also saving the width and the heigth of the image. Currently I'm writing the image as follows(I didn't write this code myself):
std::ostream &img::operator<<(std::ostream &out, EasyImage const &image) {
//temporaryily enable exceptions on output stream
enable_exceptions(out, std::ios::badbit | std::ios::failbit);
//declare some struct-vars we're going to need:
bmpfile_magic magic;
bmpfile_header file_header;
bmp_header header;
uint8_t padding[] =
{0, 0, 0, 0};
//calculate the total size of the pixel data
unsigned int line_width = image.get_width() * 3; //3 bytes per pixel
unsigned int line_padding = 0;
if (line_width % 4 != 0) {
line_padding = 4 - (line_width % 4);
}
//lines must be aligned to a multiple of 4 bytes
line_width += line_padding;
unsigned int pixel_size = image.get_height() * line_width;
//start filling the headers
magic.magic[0] = 'B';
magic.magic[1] = 'M';
file_header.file_size = to_little_endian(pixel_size + sizeof(file_header) + sizeof(header) + sizeof(magic));
file_header.bmp_offset = to_little_endian(sizeof(file_header) + sizeof(header) + sizeof(magic));
file_header.reserved_1 = 0;
file_header.reserved_2 = 0;
header.header_size = to_little_endian(sizeof(header));
header.width = to_little_endian(image.get_width());
header.height = to_little_endian(image.get_height());
header.nplanes = to_little_endian(1);
header.bits_per_pixel = to_little_endian(24);//3bytes or 24 bits per pixel
header.compress_type = 0; //no compression
header.pixel_size = pixel_size;
header.hres = to_little_endian(11811); //11811 pixels/meter or 300dpi
header.vres = to_little_endian(11811); //11811 pixels/meter or 300dpi
header.ncolors = 0; //no color palette
header.nimpcolors = 0;//no important colors
//okay that should be all the header stuff: let's write it to the stream
out.write((char *) &magic, sizeof(magic));
out.write((char *) &file_header, sizeof(file_header));
out.write((char *) &header, sizeof(header));
//okay let's write the pixels themselves:
//they are arranged left->right, bottom->top, b,g,r
// this is the main bottleneck
for (unsigned int i = 0; i < image.get_height(); i++) {
//loop over all lines
for (unsigned int j = 0; j < image.get_width(); j++) {
//loop over all pixels in a line
//we cast &color to char*. since the color fields are ordered blue,green,red they should be written automatically
//in the right order
out.write((char *) &image(j, i), 3 * sizeof(uint8_t));
}
if (line_padding > 0)
out.write((char *) padding, line_padding);
}
//okay we should be done
return out;
}
As you can see, the pixels are being written one by one. This is quite slow, I put some timers in my program, and found that the writing was my main bottleneck.
I tried to write entire (horizontal) lines, but I did not find how to do it(best I found was this.
Secondly, I wanted to write to the file using multithreading(not sure if I need to use threading or processing). using openMP. But that means I need to specify which byte address to write to, I think, which I couldn't solve.
Latstly, I thought about immidiatly writing to the file whenever I drew an object, but then I had the same issue with writing to specific locations in the file.
So, my question is: what's the best(fastest) way to tackle this problem. (Compiling this for windows and linux)
The fastest method to write to a file is to use hardware assist. Write your output to memory (a.k.a. buffer), then tell the hardware device to transfer from memory to the file (disk).
The next fastest method is to write all the data to a buffer then block write the data to the file. If you want other tasks or threads to execute during your writing, then create a thread that writes the buffer to the file.
When writing to a file, the more data per transaction, the more efficient the write will be. For example, 1 write of 1024 bytes is faster than 1024 writes of one byte.
The idea is to keep the data streaming. Slowing down the transfer rate may be faster than a burst write, delay, burst write, delay, etc.
Remember that the disk is essentially a serial device (unless you have a special hard drive). Bits are laid down on the platters using a bit stream. Writing data in parallel will have adverse effects because the head will have to be moved between the parallel activities.
Remember that if you use more than one core, there will be more traffic on the data bus. The transfer to the file will have to pause while other threads/tasks are using the data bus. So, if you can, block all tasks, then transfer your data. :-)
I've written programs that copy from slow memory to fast memory, then transferred from fast memory to the hard drive. That was also using interrupts (threads).
Summary
Fast writing to a file involves:
Keep the data streaming; minimize the pauses.
Write in binary mode (no translations, please).
Write in blocks (format into memory as necessary before writing the block).
Maximize the data in a transaction.
Use separate writing thread, if you want other tasks running "concurrently".
The hard drive is a serial device, not parallel. Bits are written to the platters in a serial stream.
I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:
int64_t MemoryAccessAllElements(const int64_t *data, size_t length) {
for (size_t id = 0; id < length; id++) {
volatile int64_t ignored = data[id];
}
return 0;
}
int64_t MemoryAccessEvery4th(const int64_t *data, size_t length) {
for (size_t id = 0; id < length; id += 4) {
volatile int64_t ignored = data[id];
}
return 0;
}
And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):
There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.
I tested this code on single core CPU with next caches configuration:
CPU Caches:
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative
At this pictures we can see many changes in graph behavior:
There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)
Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses
But there is a lot of questions about results that I can't explain:
Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?
Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?
Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?
I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.
Can someone help me to understand the internal processes of CPU caching?
UPD:
I use the following code to measure performance of memory access:
#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length) {
auto random = Random(0);
for (size_t id = 0; id < length; id++) {
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
}
}
static void MemoryAccessAllElements_Benchmark(State &state) {
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state) {
DoNotOptimize(MemoryAccessAllElements(array, size));
}
delete[] array;
}
static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark) {
for (int size = 2; size <= (1 << 24); size *= 2) {
benchmark->Arg(size);
}
}
BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();
You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.
I have a very resource intensive code, that I made, so I can split the workload over multiple pthreads. While everything works, the computation is done faster, etc. What I'm guessing happens is that other processes on that processor core get so slow, that they crash after a few seconds of runtime.
I already managed to kill random processes like Chrome tabs, the Cinnamon DE or even the entire OS (Kernel?).
Code: (It's late, and I'm too tired to make a pseudo code, or even comments..)
-- But it's a brute force code, not so much for cracking, but for testing passwords and or CPU IPS.
Any ideas how to fix this, while still keeping as much performance as possible?
static unsigned int NTHREADS = std::thread::hardware_concurrency();
static int THREAD_COMPLETE = -1;
static std::string PASSWORD = "";
static std::string CHARS;
static std::mutex MUTEX;
void *find_seq(void *arg_0)
{
unsigned int _arg_0 = *((unsigned int *) arg_0);
std::string *str_CURRENT = new std::string(" ");
while (true)
{
for (unsigned int loop_0 = _arg_0; loop_0 < CHARS.length() - 1; loop_0 += NTHREADS)
{
str_CURRENT->back() = CHARS[loop_0];
if (*str_CURRENT == PASSWORD)
{
THREAD_COMPLETE = _arg_0;
return (void *) str_CURRENT;
}
}
str_CURRENT->back() = CHARS.back();
for (int loop_1 = (str_CURRENT->length() - 1); loop_1 >= 0; loop_1--)
{
if (str_CURRENT->at(loop_1) == CHARS.back())
{
if (loop_1 == 0)
str_CURRENT->assign(str_CURRENT->length() + 1, CHARS.front());
else
{
str_CURRENT->at(loop_1) = CHARS.front();
str_CURRENT->at(loop_1 - 1) = CHARS[CHARS.find(str_CURRENT->at(loop_1 - 1)) + 1];
}
}
}
};
}
Areuz,
Can you post the full code? I suspect the issue is the NTHREADS value. On my Ubuntu box, the value is set to 8 which is the number of cores in the /proc/cpuinfo file. Kicking off 8 'hot' threads on my box hogs 100% of the CPU. The kernel will time slice for its own critical processes but in general all other processes will starve for CPU.
Check out the max processor value in /etc/cpuinfo and go at least one lower then that. The CPU's are numbered 0-7 on my box, so 7 would be the max for me. The actual max might be 3 since 4 of my cores are hyper-threads. For completely CPU processes, hyper-threading generally doesn't help.
Bottom line, don't hog all the CPU, it will destabilize the system.
--Matt
Thank you for your answers and especially Matthew Fisher for his suggestion to try it on another system.
After some trial and error I decided to pull back my CPU overclock that I thought was stable (I had it for over a year) and that solved this weird behaviour. I guess that I've never ran such a CPU intensive and (I'm guessing) efficient (In regards to not throttling the full CPU by yielding) script to see this happen.
As Matthew suggested I need to come up with a better way than to just constantly check the THREAD_COMPLETE variable with a while true loop, but I hope to resolve that in the comments.
Full and updated code for future visitors is here: pastebin.com/jbiYyKBu
Currently, I am working on real time interface with Visual Studio C++.
I faced problem is, when buffer is running for data store, that time .exe is not responding at the point data store in buffer. I collect data as 130Hz from motion sensor. I have tried to increase virtual memory of computer, but problem was not solved.
Code Structure:
int main(){
int no_data = 0;
float x_abs;
float y_abs;
int sensorID = 0;
while (1){
// Define Buffer
char before_trial_output_data[][8 * 4][128] = { { { 0, }, }, };
// Collect Real Time Data
x_abs = abs(inchtocm * record[sensorID].y);
y_abs = abs(inchtocm * record[sensorID].x);
//Save in buffer
sprintf(before_trial_output_data[no_data][sensorID], "%d %8.3f %8.3f\n",no_data,x_abs,y_abs);
//Increment point
no_data++;
// Break While loop, Press ESc key
if (GetAsyncKeyState(VK_ESCAPE)){
break;
}
}
//Data Save in File
printf("\nSaving results to 'RecordData.txt'..\n");
FILE *fp3 = fopen("RecordData.dat", "w");
for (i = 0; i<no_data-1; i++)
fprintf(fp3, output_data[i][sensorID]);
fclose(fp3);
printf("Complete...\n");
}
The code you posted doesn't show how you allocate more memory for your before_trial_output_data buffer when needed. Do you want me to guess? I guess you are using some flavor of realloc(), which needs to allocate ever-increasing amount of memory, fragmenting your heap terribly.
However, in order for you to save that data to a file later on, it doesn't need to be in continuous memory, so some kind of list will work way better than an array.
Also, there is no provision in your "pseudo" code for a 130Hz reading; it processes records as fast as possible, and my guess is - much faster.
Is your prinf() call also a "pseudo code"? Otherwise you are looking for trouble by having mismatch of the % format specifications and number and type of parameters passed in.
I have a circular buffer which is backed with file mapped memory (the buffer is in the size range of 8GB-512GB).
I am writing to (8 instances of) this memory in a sequential manner from the beginning to the end at which point it loops around back to the beginning.
It works fine until it reaches the end where it needs to perform two file mappings and loop around the memory, at which point IO performance is totally trashed and doesn't recover (even after several minutes). I can't quite figure it out.
using namespace boost::interprocess;
class mapping
{
public:
mapping()
{
}
mapping(file_mapping& file, mode_t mode, std::size_t file_size, std::size_t offset, std::size_t size)
: offset_(offset)
, mode_(mode)
{
const auto aligned_size = page_ceil(size + page_size());
const auto aligned_file_size = page_floor(file_size);
const auto aligned_file_offset = page_floor(offset % aligned_file_size);
const auto region1_size = std::min(aligned_size, aligned_file_size - aligned_file_offset);
const auto region2_size = aligned_size - region1_size;
if (region2_size)
{
const auto region1_address = mapped_region(file, read_only, 0, (region1_size + region2_size) * 2).get_address();
const auto region2_address = reinterpret_cast<char*>(region1_address) + region1_size;
region1_ = mapped_region(file, mode, aligned_file_offset, region1_size, region1_address);
region2_ = mapped_region(file, mode, 0, region2_size, region2_address);
}
else
{
region1_ = mapped_region(file, mode, aligned_file_offset, region1_size);
region2_ = mapped_region();
}
size_ = region1_.get_size() + region2_.get_size();
offset_ = aligned_file_offset;
}
auto offset() const -> std::size_t { return offset_; }
auto size() const -> std::size_t { return size_; }
auto data() const -> const void* { return region1_.get_address(); }
auto data() -> void* { return region1_.get_address(); }
auto flush(bool async = true) -> void
{
region1_.flush(async);
region2_.flush(async);
}
auto mode() const -> mode_t { return mode_; }
private:
std::size_t offset_ = 0;
std::size_t size_ = 0;
mode_t mode_;
mapped_region region1_;
mapped_region region2_;
};
struct loop_mapping::impl final
{
std::tr2::sys::path file_path_;
file_mapping file_mapping_;
std::size_t file_size_;
std::size_t map_size_ = page_floor(256000000ULL);
std::shared_ptr<mapping> mapping_ = std::shared_ptr<mapping>(new mapping());
std::shared_ptr<mapping> prev_mapping_;
bool write_;
public:
impl(std::tr2::sys::path path, bool write)
: file_path_(std::move(path))
, file_mapping_(file_path_.string().c_str(), write ? read_write : read_only)
, file_size_(page_floor(std::tr2::sys::file_size(file_path_)))
, write_(write)
{
REQUIRE(file_size_ >= map_size_ * 3);
}
~impl()
{
prev_mapping_.reset();
mapping_.reset();
}
auto data(std::size_t offset, std::size_t size, boost::optional<bool> write_opt) -> void*
{
offset = offset % page_floor(file_size_);
REQUIRE(size < file_size_ - map_size_ * 3);
const auto write = write_opt.get_value_or(write_);
REQUIRE(!write || write_);
if ((write && mapping_->mode() == read_only) || offset < mapping_->offset() || offset + size >= mapping_->offset() + mapping_->size())
{
auto new_mapping = std::make_shared<loop::mapping>(file_mapping_, write ? read_write : read_only, file_size_, page_floor(offset), std::max(size + page_size(), map_size_));
if (mapping_)
mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));
if (prev_mapping_)
prev_mapping_->flush(false);
prev_mapping_ = std::move(mapping_);
mapping_ = std::move(new_mapping);
}
return reinterpret_cast<char*>(mapping_->data()) + offset - mapping_->offset();
}
}
-
// 8 processes to 8 different files 128GB each.
loop_mapping loop(...);
for (auto n = 0; true; ++n)
{
auto src = get_new_data(5000000/8);
auto dst = loop.data(n * 5000000/8, 5000000/8, true);
std::memcpy(dst, src, 5000000/8); // This becomes very slow after loop around.
std::this_thread::sleep_for(std::chrono::seconds(1));
}
Any ideas?
Target System:
1x 3TB Seagate Constellation ES.3
2x Xeon E5-2400 (6 core, 2.6Ghz)
6x 8GB DDR3 1600Mhz ECC
Windows Server 2012
8 buffers each 8 to 512GiB in size on a system with 48GiB of physical memory means that your mapping will have to be swapped. No surprise there.
The issue, as you have already remarked yourself, is that prior to being able to write to a page, you encounter a fault, and the page is read in. That doesn't happen on the first run, since merely a zero page is used. To make matters worse, reading in pages again competes with write-behind of dirty pages.
Now, there is unluckily no way of telling Windows "I'm going to overwrite this anyway", nor is there any way of making the disk load your stuff faster. However, you can start the transfer earlier (maybe when you're 3/4 through the buffer).
Windows Server 2012 (which you're using) supports PrefetchVirtualMemory which is a somewhat half-assed substitute for POSIX madvise(MADV_WILLNEED).
That is, of course, not exactly what you want to do when you already know that you will overwrite the complete memory page (or several of them) anyway, but it is as good as you can get. It's worth a try in any case.
Ideally, you would want to do something like a destructive madvise(MADV_DONTNEED) as implemented e.g. under Linux (and I believe FreeBSD, too) immediately before you overwrite a page, but I am not aware of any way of doing this under Windows (...short of destroying the view and the mapping and mapping from scratch, but then you throw away all data, so that's a bit useless).
Even with prefetching early you will still be limited by disk I/O bandwidth, but at least you can hide the latency.
Another "obvious" (but probably not that easy) solution would be to make the consumer faster. That would allow for a smaller buffer to begin with, and even on a huge buffer it would keep the working set smaller (both producer and consumer force pages into RAM while accessing them, so if the consumer accesses data with less delay after the producer has written them, they will both be using mostly the same set of pages.) Smaller working sets fit into RAM more easily.
But I realize that you probably didn't choose a several-gigabyte buffer for no reason.
Since your code is devoid of any comment, filled with auto variables, not compilable as is and I don't have 512Gb available on my PC to test it anyway, this will remain a passing tought off the top of my head.
each of your process only writes a few hundreds Kb/s, so there should be ample time to flush that to disk in the background.
However, it seems you are asking the boost mapping system to flush either synchronously or asynchronously the previous chunk depending on your mysterious offset computations:
mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));
I guess the rollover triggers a synchronous flush, which is a likely culprit for the sudden slowdown.
What the operating system does at this point depends on the boost implementation, which is not described (or at least in a way obvious enough for me to get it after a cursory look at their man page).
If boost stuffed your 48 Gb of memory with unflushed pages, you could certainly experience a sudden and prolonged deceleration.
At least worth a comment in your code if this mysterious line does something clever and completely different I missed entirely.
If you are able to back the memory mapping with the page file rather than a specific file, you can use the MEM_RESET flag with VirtualAlloc to prevent Windows from paging in the old contents.
The main issue I would anticipate in using this approach is that you can't easily recover the disk space when you are done. It may also require the system's page file settings to be changed; I believe it will work with the default settings, but not if a maximum page file size has been set.
I am going to assume that by "Loop around" you mean that the RAM got full.
What happens is that until the RAM get full, all you have to do is allocate a page and write in it (RAM speed), after the RAM gets full every page allocation turns to 2 actions:
1. you have to write the dirty page back (DISK speed)
2. and allocate a page (RAM speed)
And worst case you also have to bring the page from the file in the disk (DISK speed) if you are reading something from it.
So instead of working only in RAM speed (page allocation), every page allocation runs in DISK speed.
This doesnt happen with 2x8GB because it is small enough for all of the memory of both files to remain fully in the RAM.
The problem here it turns out is that when overwrite a valid page in memory the page first has to be read from the drive before being overwritten. There is no way to get around this issue as far as I know when using memory mapped files.
The reason it doesn't happen during the first pass is that the pages being overwritten are not "valid" and thus they do not need to be read back.