I want to know how many I/O operations (one or many?) it would take for SAS to load a dataset from hard disk to hash table?
My understanding is:
there are three components when people load data from hard disk to hash table, their relationships are below:
Data on hard disk---> memory buffering---->PDV------>hash table(in memory)
the I/O operation is measured when data load from hard disk to memory buffering.
If space of memory buffering is small (smaller than data on hard disk) then it would take memory buffering more than 1 time (or many times) to load data from hard disk to memory buffer before transfer to PDV.
But if memory buffering is big (bigger than data on hard disk), then it can load all the data from hard disk in one go.
I want to know if my understanding is correct?
Related
I want to read and store a large CSV file into a map. I started by just reading the file and seeing how long it takes to process. This is my loop:
while(!gFile.eof()){
gFile >> data;
}
It is taking me ~35 mins to process the csv file that contains 35 million lines and six columns. Is there any way to speed this up? Pretty new to SO, so apologies if not asking correctly.
Background
Files are stream devices or concepts. The most efficient usage of reading a file is to keep the data streaming (flowing). For every transaction there is an overhead. The larger the data transfer, the less impact the overhead has. So, the goal is to keep the data flowing.
Memory faster than file access
Search memory is many times faster than searching a file. So, searching for a "word" or delimiter is going to be faster than reading a file character by character to find the delimiter.
Method 1: Line by line
Using std::getline is much faster than using operator>>. Although the input code may read a block of data; you are only performing one transaction to read a record versus one transaction per column. Remember, keep data flowing and searching memory for the columns is faster.
Method 2: Block reading
In the spirit of keeping the stream flowing, read a block of memory into a buffer (large buffer). Process the data from the buffer. This is more efficient than reading line by line because you can read in multiple lines of data with one transaction, reducing the overhead of a transaction.
One caveat is that you may have a record cross buffer boundaries, so you'll need to come up with an algorithm to handle that. The execution penalty is small and only happens once per transaction (consider this part of the overhead of a transaction).
Method 3: Multiple threads
In the spirit of keeping the data streaming, you could create multiple threads. One thread is in charge or reading the data into a buffer while another thread processes the data from the buffer. This technique will have better luck keeping the data flowing.
Method 4: Double buffering & multiple threads
This takes Method 3 above and adds multiple buffers. The reading thread can fill up one buffer then start filling a second buffer. The data processing thread will wait until the first buffer is filled before processing the data. This technique is used to better match the speed of reading data to the speed of processing the data.
Method 5: Memory mapped files
With a memory mapped file, the operating system handles the reading of the file into memory on demand. Less code that you have to write, but you don't get as much control over when the file is read into memory. This is still faster than reading field by field.
Lets start with the bottlenecks.
Reading from disk
Decoding the data
Store in map
Memory speed
Amount of memory
Read from disk
Read till you drop, if you can't read fast enough to use all
bandwidth on the disk you can go faster. Ignore all other steps and only read.
Start by adding buffers to your instream
Set hints for reading
use mmap
4GB is a trivial size, if you don't already have 32 GB upgrade
Too slow buy M.2 disk.
Still to slow then more exotic, change disk driver, dump OS. Mirror disks, only you $£€ is the limit.
Decode the data
if you data is in lines where are all the same length then all decodes can be done in parallel, limited only by memory bandwidth.
if the line lengths only wary a little the find end of line can be done in parallel followed by parallel decode.
if the order of the lines doesn't matter for the final map just split the file in #hardwarethreads parts and let each process their part until the first newline in the next threads part.
memory bandwidth will most likely be reach far before the CPU is anyway near used up.
Store in map
hopefully you have thought about this map in advance as none of the std maps are thread safe.
if you don't care about order a std::array can be used and you can run at full memory bandwidth.
lets say you want to use std::unordered_map, there is a problem that it needs to update the size after each write, so effectively your are limited to 1 thread writing to it.
You could use 1 thread at a time to update while the other precompute the hash of the record.
having one thread write has the problem that nearly every write will be a cache miss severely limiting speed.
so if that is not fast enough roll your own hash_map, without a size that must be updated every write.
to ensure thread safety you also need to protect the write, having one mutex makes you as slow or slower than the single writer.
you could try to make it lock and wait free ... if your not an expert you will get severe headache instead.
if you have selected a bucket design for you hash then you could make X times number of writer threads mutexes, use the hash value to select the mutex. The extra mutexes increase the likelihood that two threads won't collide.
Memory speed
Each line will be transferred at least 4 times over the memory bus, once from the disk to ram (at least once more if the driver is not good), once when the data is decoded, once when the map makes a reads request, and one more for when the map writes.
A good setup can save one more memory access if the driver writes to cache and therefore decode not will result in a LLC-miss.
Amount of memory
you should have enough memory to hold the total file, the data structure and some intermediate data.
Check if RAM is cheaper than your programming time.
I need to work with huge data matrix file that is larger than available RAM. For example, the matrix has 2500 rows and 1 million columns, which leads to ~20 GB. Basically I only need to read the data into memory, no writing operation at all.
I thought memory mapping would work. But it turned out that's not very efficient as the RAM would blow up. This is because the OS will always automatically cache the data (pages) into memory until the RAM is full. After that, as in data-larger-than-RAM case, there will be page faults, hence page-in and page-out process, which is essentially disk-read/write and slows the speed.
I need to point out that I would also want to randomly read some subset of data, say just Row 1000 to 1500 and Column 1000 to 5000.
[EDIT]
The data file is a txt file, and well formatted like a matrix. Basically, I need read in the big data matrix, and do crossproduct with another vector column by column.
[END EDIT]
My questions are:
are their other alternative approaches? Could direct reading chunk-by-chunk be better?
Or is there a smart way to programmatically free up page caches before RAM is full in memory mapping? I just think it might be better if we could page out data from memory before RAM is full.
Is there a way to read data file column-by-column?
Thank you very much in advance!
Is it possible to store columnoriented tables inmemory in Memsql? Standard is row oriented tables in memory, column oriented on disk.
MemSQL columnstore tables are always disk-backed, however columnstore data is of course cached in memory, so if all your data happens to fit in memory you will get in-memory performance. (The disk only needs to be involved in that writes must persist to disk for durability, and after restart data must be loaded from disk before it can be read, just like for any durable in-memory store.)
In the rowstore, we use data structures and algorithms (e.g. lockfree skiplists) that take advantage of the fact that the data is in-memory to improve performance on point reads and writes, especially with high concurrency, but columnstore query execution works on fast scans over blocks of data and batch writes, which works well whether the data resides in-memory or on-disk.
I'm writing streams of images to a hard disk using std::fstream. Since most hard disk drives have a 32MB cache, is it more efficient to create a buffer to accumulate image data up to 32MB and then write to disk, or is it as efficient to just write every image onto the disk?
The cache is used as a read/write cache to alleviate problems due to queuing.... Here are my experiences with disks:
If the disk is not a SSD, then it's better if you write serially, than seek to files.. Seek is a killer for I/O performance.
The disks typically writes in sector sizes. sector sizes are usually 512b or 4k (newer disks). Try to write data one sector at a time.
Bunching I/O is always faster than multiple small I/Os. The simple reason is that the processor on the disk has a smaller queue to flush.
Whatever you can serve from memory, serve. Use disk only if necessary. You can always do an modify/invalidate cache entry on write, depending on your reliability policy. Make sure you don't swap, so your memory cache size must be reasonable, to begin with.
If you're doing this I/O management, make sure you don't double-buffer with your OS page cache. O_DIRECT for this.
Use non-blocking, if reliability isn't an issue. O_NONBLOCK
Every part of your system, from fstream down to the disk driver knows more about specific efficiency than your application even has access to.
You couldn't improve upon the various buffering schemes if you tried, so don't bother.
I am writing application to monitor a file and then match some pattern in that file.
I want to know what is the fastest way to read a file in C++
Is reading line by line is faster of reading chunk of the file is faster.
Your question is more about the performance of hardware, operating systems and run time libraries than it has to do with programming languages. When you start reading a file, the OS is probably loading the file in chunks anyway since the file is stored that way on disk, it makes sense for the OS to load each chunk entirely on first access and caching it rather than reading the chunk, extracting the requested data and discarding the rest.
Which is faster? Line by line or chunk at a time? As always with these things, the answer is not something you can predict, the only way to know for sure is to write a line-by-line version and a chunk-at-a-time version and profile them (measure how long it takes each version).
In general, reading large amounts of a file into a buffer, then parsing the buffer is a lot faster than reading individual lines. The actual proof is to profile code that reads line by line, then profile code reading in large buffers. Compare the profiles.
The foundation for this justification is:
Reduction of I/O Transactions
Keeping the Hard Drive Spinning
Parsing Memory Is Faster
I improved the performance of one application from 65 minutes down to 2 minutes, by appling these techniques.
Reduction of I/O Transactions
Reducing the I/O transactions results in few calls to the operating system, reducing time there. Reducing the number of branches in your code; improving the performance of the instruction pipeline in your processor. And also reduces traffic to the hard drive. The hard drive has less commands to process so it has less overhead.
Keeping the Hard Drive Spinning
To access a file, the hard drive has to ramp up the motors to a decent speed (which takes time), position the head to the desired track and sector, and read the data. Positioning the head and ramping up the motor is overhead time required by all transactions. The overhead in reading the data is very little. The objective is to read as much data as possible in one transaction because this is where the hard drive is most efficient. Reducing the number of transactions will reduce the wait times for ramping up the motors and positioning the heads.
Although modern computers have caches for both data and commands, reducing the quantity will speed things up. Larger "payloads" will allow more efficient use of the their caches and not require overhead of sorting the requests.
Parsing Memory Is Faster
Always, reading from memory is faster than reading from an external source. Reading a second line of text from a buffer requires incrementing a pointer. Reading a second line from a file requires an I/O transaction to get the data into memory. If your program has memory to spare, haul the data into memory then search the memory.
Too Much Data Negates The Performance Savings
There is a finite amount of RAM on the computer for applications to share. Accessing more memory than this memory may cause the computer to "page" or forward the request to the hard drive (as known as virtual memory). In this case, there may be little savings because the hard drive is accessed anyway (by the Operating System without knowledge by your program). Profiling will give you a good indication as to the optimum size of the data buffer.
The application I optimized was reading one byte at a time from a 2 GB file. The performance greatly improved when I changed the program to read 1 MB chunks of data. This also allowed for addition performance with loop unrolling.
Hope this helps.
You could try to map the file directly to memory using a memory-mapped-file, and then use standard C++ logic to find the patterns that you want.
The OS (or even the C++ class you use) probably reads the file in chunks and caches it, even if you read it line by line to improve performance on minimizing disk access (on the operational system point of view would be faster for it to read data from a memory buffer than from a hard disk device).
Notice that a good way to improve performance on your programs (if it is really time critical), is to minimize the number of calls to operational system functions (which manage its resources).