Hdfs difference between one big file vs several small files - hdfs

could someone explain what would be the difference in storing one big (lets say 512MB) file and have 4x 128MB for one table? If blocksize would be 128MB? I know that 512MB will be splitted into 4 blocks. But in general what advantage/disadvantage would be to store data in one file?
Thanks for explanation.

Related

Sorting data in RAM

Suppose you are in an embedded environment and you have, oh, 1MB of RAM to play with. Now let's pretend you have a JSON based file management system in which each file produced by the device (let's say its metadata files) is entered into this (or these) JSON files. You let this device do whatever it does and a month later it comes back with 10,000 files and entries stored into the file system and JSON file. Each entry consumed around 200 bytes, so you have 10,000 * 200 = 2MB. Now you want to sort all those files by some piece of data, let's say by file name, which is 100 bytes each.
In order to sort, maybe alphabetically, do you need to load all 10,000 file names into RAM at once, or are there sequential ways to sort this kind of data? Maybe by first sorting it into subfiles and then sorting those files further? Is it even possible?
This a C++ environment.
In order to sort, maybe alphabetically, do you need to load all 10,000 file names into RAM at once,…
No, you do not.
… or are there sequential ways to sort this kind of data?
Of course ways exist to sort external data exist, although they are not necessarily “sequential.” Sorting data that is not all in main memory at once is called external sorting.
Maybe by first sorting it into subfiles and then sorting those files further?
Stack Overflow is for answering specific questions or problems. Whether algorithms exist and what they are called are specific questions, so I have answered those. What the algorithms are and what their properties and benefits are is a general question, so you should do further research on your own regarding them.
Is it even possible?
Yes.
If you need to access them in a specific order, it might be a good idea to just store them in this way on the file system in the first place.
As #Eric mentioned, this is not a specific question. You should just improve your C/C++ skills in order to answer these questions for yourself. There are a lot of free resources in the web.

Most memory efficient way to transpose a large file in C++

I have an input file, which is 40,000 columns by 2 million rows. This file is roughly 70GB in memory and thus to large to fit in memory at one go.
I need to effectively transpose this file, however there are some lines which are junk and should not be added to the output.
How I have currently implemented this is using ifstream and a nested get line, which effectively reads the whole file into memory (and thus lets the OS handle memory management), and then outputs the transpose like this. This works in an acceptable timescale however obviously has a large memory footprint for the application.
I now have to run this program on a cluster which makes me specify memory requirements ahead of time, and thus a large memory footprint increases job queuing time in the cluster.
I feel there has to be a more memory efficient approach to doing this. One thought I had was using mmap, which would allow me to do the transposition without reading the file into memory at all. Are there any other alternatives?
To be clear, I am happy to use any language and any method that can do this in a reasonable amount of time (my current program takes around 4 minutes on this large file on a local workstation).
Thanks
I would probably do this with a pre-processing pass over the file, that only needs to have a line at a time in its working set.
Filter the junk and make every line the same (binary) size.
Now, you can memory map the temp file, and stride the columns as rows for the output.
I think that the best way for you to do this would be to instead parse each line and find out whether it is junk or not. After this, you could put the remaining lines into output. This may take more time, but it would save a lot of memory and save you from using so much for lines which are completely useless to any text you are trying to print. However, using an mmap would also be a great way to achieve your goal
Hope this helps!!

What is the best way to read huge data file that is larger than RAM in C++?

I need to work with huge data matrix file that is larger than available RAM. For example, the matrix has 2500 rows and 1 million columns, which leads to ~20 GB. Basically I only need to read the data into memory, no writing operation at all.
I thought memory mapping would work. But it turned out that's not very efficient as the RAM would blow up. This is because the OS will always automatically cache the data (pages) into memory until the RAM is full. After that, as in data-larger-than-RAM case, there will be page faults, hence page-in and page-out process, which is essentially disk-read/write and slows the speed.
I need to point out that I would also want to randomly read some subset of data, say just Row 1000 to 1500 and Column 1000 to 5000.
[EDIT]
The data file is a txt file, and well formatted like a matrix. Basically, I need read in the big data matrix, and do crossproduct with another vector column by column.
[END EDIT]
My questions are:
are their other alternative approaches? Could direct reading chunk-by-chunk be better?
Or is there a smart way to programmatically free up page caches before RAM is full in memory mapping? I just think it might be better if we could page out data from memory before RAM is full.
Is there a way to read data file column-by-column?
Thank you very much in advance!

Memory Mapped Files and Max File Size

I am using boost::iostreams::mapped_file_source to create a memory mapped files. In excess of 1024. To my surprise when I have created around 1024 memory mapped files my program throws an exception stating there are too many files open. After some research I found that Ubuntu uses a max file size per processes of 1024 (found from ulimit -n). Unfortunately, I need all of the files to be open at the same time. Does anyone know a way around this? Is it possible to make the files not count towards the limit someway? I was thinking of trying to combine them into a single file; however, I would like to avoid that if possible due to the performance. And I would also like to not modify the operating system by changing the value. Any points in the correct direction are much appreciated!
Why do you need many mapped files open? That seems very inefficient. Maybe you can map (regions of) a single large file?
Q. I was thinking of trying to combine them into a single file; however, I would like to avoid that if possible due to the performance
That's ... nonsense. The performance could basically only increase.
One particular thing to keep in mind is to align the different regions inside your "big mapped file" to multiple of your memory page/disk block size. 4k should be a nice starter for this coarse alignment.

reading big data in C++

Im using C++ to read large files with over 30000 lines and 3000 colums. (30000 x 3000) matrix. im using a 2d vector to push the read data. But i need to do this process a couple of times. Is there any way to optimize the reading process?
I Will give you some ideas but not exact solution. Because I do not know full details of your system.
Actually if you have this much big file with data and only some data change in next reading. try to use some Data base methodology.
For performance you can use concurrent file reading (read same file part by part by using multiple thread).
If you need to process data as well, then use separate thread(s) to process and may possible to link by a queue or parallel queues.
If your data length is fixed (such as fix length numbers). and if you know the changed location, try to read only changed data instead of reading and processing whole file again and again.
if any of above not helped use memory mapping methodology. If you looking for portability, Boost Memory-Mapped Files will support you to reduce your works
Memory map mechanism is Ok, since there are only reading operations.