I have a 35MB file with ~ 1 million lines. I loaded every line to APC as a key (10-12 character) and an integer as value. If i fetch about the 80% of the entries dont exist, only the lines from end of the file are there. Is it possible that APC's memory got full and the first entries got overwritten by the new ones?
apc.shm_size: 512M
512 Mb should enough to store data from a 35MB file, shouldnt?
Is there any visualization tool (WAMP) to see the list of entries and memory usage of APC?
EDIT
I made some modification on code to see when will be the first entry overwritten. It was about the 220.000th row. Additionally i increased the apc.shm_size to 1024M. And the number of the rows increased by 2 times. Why do I need more then 2GB to store 35MB data in the memory?
If i crease the apc.shm_size over from 1024M to 1025, Apache wont start.
Related
I'm UNLOADing tables from Redshift to S3 for backup. So I am checking to make sure the files are complete if we need them again.
I just did UNLOAD on a table that has size = 1,056 according to:
select "table", size, tbl_rows
FROM svv_table_info;
According to the documentation, the size is "in 1 MB data blocks", so this table is using 1,056 MB. But after copying to S3, the file size is 154 MB (viewing in AWS console).
I copied back to Redshift and all the rows are there, so this has to do with "1 MB data blocks". This is related to how it's saved in the file system, yes?
Can someone please explain? Thank you.
So you're asking why the SVV_TABLE_INFO view claims that the table consumes 1 GB, but when you dump it to disk the result is only 154 MB?
There are two primary reasons. The first is that you're actively updating the table but not vacuuming it. When a row is updated or deleted, Redshift actually appends a new row (yes, stored as columns) and tombstones the old row. To reclaim this space, you have to regularly vacuum the table. While Redshift will do some vacuuming in the background, this may not be enough, or it may not have happened at the time you're looking.
The second reason is that there's overhead required to store table data. Each column in a table is stored as a list of 1 MB blocks, one block per slice (and multiple slices per node). Depending on the size of your cluster and the column data type, this may lead to a lot of wasted space.
For example, if you're storing 32-bit integers, one 1MB block can store 256,000 such integers, requiring a total of 4 blocks to store 1,000,000 values (which is probably close to number of rows in your table). But, if you have a 4-node cluster, with 2 slices per node (ie, a dc2.large), then you'll actually require 8 blocks, because the column will be partitioned across all slices.
You can see the number of blocks that each column uses in STV_BLOCKLIST.
I have a text file which looks like below:
0.001 ETH Rx 1 1 0 B45678810000000000000000AF0000 555
0.002 ETH Rx 1 1 0 B45678810000000000000000AF 23
0.003 ETH Rx 1 1 0 B45678810000000000000000AF156500
0.004 ETH Rx 1 1 0 B45678810000000000000000AF00000000635254
I need a way to read this file and form a structure and send it to client application.
Currently, I can do this with the help of circular queue by Boost.
The need here is to access different data at different time.
Ex: If I want to access data at 0.03sec while I am currently at 100sec, how can I do this in a best way instead of having file pointer track, or saving whole file to a memory which causes performance bottleneck? (Considering I have a file of size 2 GB with the above kind of data)
Usually the best practice for handling large files depends on the platform architecture (x86/x64) and OS (Windows/Linux etc.)
Since you mentioned boost, have you considered using boost memory mapped file?
Boost Memory Mapped File
Its all depends on
a. how frequently the data access is
b. what pattern the data access is
Splitting the file
If you need to access the data once in a while then this 2GB log
design is fine, if not the logger can be tuned to generate log with
periodic interval/ latter a logic can split the 2GB files into needed fashion of
smaller files. So that fetching the ranged log file and then reading
the log data and then sort out the needed lines is easier since file
read bytes will be reduced here.
Cache
For very frequent data access, for faster response maintaining cache is one the nice solution, again as you said it has its own bottleneck. The size and pattern of the cache memory selection is all depends on the b. what pattern of data access is. Also greater the cache size also slower the response, it should be optimum.
Database
If the searching pattern is un-ordered/dynamically grown on usage then data-base will work. Again here it will not give faster response like small cache.
A mix of database with perfect table organization to support the type of query + smaller cache layer will give optimum result.
Here is the solution I found:
Used Circular buffers (Boost lock free Buffers) for parsing file and to save the structured format of line
Used Separate threads:
One will continuously parse the file and push to lock free queue
One will continuously read from the buffer, process the line, form a structure and push to another queue
Whenever user needs random data, based on time, I will move the file pointer to particular line and read only the particular line.
Both threads have mutex wait mechanisms to stop parsing once the predefined buffer limit reached
User will get data at any time, and no need of storing the complete file contents. As and when the frame is read, I will be deleting the frame from queue. So file size doesn't matter. Parallel threads which fills the buffers allows to not spend time on reading file every time.
If I want to move to other line, move file pointer, wipe off existing data, start threads again.
Note:
Only issue is now to move the file pointer to particular line.
I need to parse line by line till I reach the point.
If there exist any solution to move file pointer to required line it would be helpful. Binary search or any efficient search algorithm can be used and will get what I want.
I appreciate if anybody gives solution for the above new issue!
I want to understand how mapreduce job process the data blocks. As per my understanding for each data block one mapper is invoked.
Let me put my query with one example.
Suppose, I have a long text file having data stored in HDFS in 4 blocks ( 64 MB ) on 4 nodes (let's forget about the replication here )
In this case 4 map task will be invoked on each machine ( all 4 data nodes/machines)
Here is question : this splitting may have resulted partial record stored on two blocks. Like last record may have been get stored in block 1(at end) partially and other part on block 2.
In this case, how does mapreduce program ensure that complete record is getting processed?
I hope, I have been able to put my query
Please read this article at http://www.hadoopinrealworld.com/inputsplit-vs-block/
This is an excerpt from the same article I posted above.
A block is a hard division of data at the block size. So if the block size in the cluster is 128 MB, each block for the dataset will be 128 MB except for the last block which could be less than the block size if the file size is not entirely divisible by the block size. So a block is a hard cut at the block size and blocks can end even before a logical record ends.
Consider the block size in your cluster is 128 MB and each logical record is your file is about 100 Mb. (yes.. huge records)
So the first record will perfectly fit in the block no problem since the record size 100 MB is well with in the block size which is 128 MB. However the 2nd record can not fit in the block, so the record number 2 will start in block 1 and will end in block 2.
If you assign a mapper to a block 1, in this case, the Mapper can not process Record 2 because block 1 does not have the complete record 2. That is exactly the problem InputSplit solves. In this case InputSplit 1 will have both record 1 and record 2. InputSplit 2 does not start with Record 2 since Record 2 is already included in the Input Split 1. So InputSplit 2 will have only record 3. As you can see record 3 is divided between Block 2 and 3 but still InputSplit 2 will have the whole of record 3.
Blocks are physical chunks of data store in disks where as InputSplit is not physical chunks of data. Inputsplit is a Java class with pointers to start and end locations in blocks. So when Mapper tries to read the data it clearly knows where to start reading and where to stop reading. The start location of an InputSplit can start in a block and end in another block.
Happy Hadooping
I have a file as follows:
The file consists of 2 parts: header and data.
The data part is separated into equally sized pages. Each page holds data for a specific metric. Multiple pages (needs not to be consecutive) might be needed to hold data for a single metric. Each page consists of a page header and a page body. A page header has a field called "Next page" that is the index of the next page that holds data for the same metric. A page body holds real data. All pages have the same & fixed size (20 bytes for header and 800 bytes for body (if data amount is less than 800 bytes, 0 will be filled)).
The header part consists of 20,000 elements, each element has information about a specific metric (point 1 -> point 20000). An element has a field called "first page" that is actually index of the first page holding data for the metric.
The file can be up to 10 GB.
Requirement: Re-order data of the file in the shortest time, that is, pages holding data for a single metric must be consecutive, and from metric 1 to metric 20000 according to alphabet order (header part must be updated accordingly).
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
One possible solution is to create an index from the file, containing the page number and the page metric that you need to sort on. Create this index as an array, so that the first entry (index 0) corresponds to the first page, the second entry (index 1) the second page, etc.
Then you sort the index using the metric specified.
When sorted, you end up with a new array which contains a new first, second etc. entries, and you read the input file writing to the output file in the order of the sorted index.
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
Yes. After you get a working solution, measure it's efficiency, then decide which parts you wish to optimize. What and how you optimize will depend greatly on what results you get here (what are your bottlenecks).
A few generic things to consider:
if you have one set of steps that read data for a single metric and move it to the output, you should be able to parallelize that (have 20 sets of steps instead of one).
a 10Gb file will take a bit to process regardless of what hardware you run your code on (concievably, you could run it on a supercomputer but I am ignoring that case). You / your client may accept a slower solution if it displays it's progress / shows a progress bar.
do not use string comparisons for sorting;
Edit (addressing comment)
Consider performing the read as follows:
create a list of block offset for the blocks you want to read
create a list of worker threads, of fixed size (for example, 10 workers)
each idle worker will receive the file name and a block offset, then create a std::ifstream instance on the file, read the block, and return it to a receiving object (and then, request another block number, if any are left).
read pages should be passed to a central structure that manages/stores pages.
Also consider managing the memory for the blocks separately (for example, allocate chunks of multiple blocks preemptively, when you know the number of blocks to be read).
I first read header part, then sort metrics in alphabetic order. For each metric in the sorted list I read all data from the input file and write to the output file. To remove bottlenecks at reading data step, I used memory mapping. The results showed that when using memory mapping the execution time for an input file of 5 GB was reduced 5 ~ 6 times compared with when not using memory mapping. This way temporarily solve my problems. However, I will also consider suggestions of #utnapistim.
The number of map() spawned is equal to the number of 64MB blocks of input data. Suppose we have 2 input files of 1MB size, both the files will be stored in a single block. But when I run my MR program with 1 namenode and 2 jobnodes, I see 2 map() spawned, one for each file. So is this because the system tried to split the job between 2 nodes i.e.,
Number of map() spawned = number of 64MB blocks of input data * number of jobnodes ?
Also, in the mapreduce tutorial, its written than for a 10TB file with blocksize being 128KB, 82000 maps will be spawned. However, according to the logic that number of maps is only dependent on block size, 78125 jobs must be spawned (10TB/128MB). I am not understanding how few extra jobs have been spawned? It will be great if anyone can share your thoughts on this? Thanks. :)
By Default one mapper per input file is spawned and if the size of input file is greater than than split size(which is normally kept same as block size) then for that file number of mappers will be ceil of filesize/split size.
Now say you 5 input files and split size is kept as 64 MB
file1 - 10 MB
file2 - 30 MB
file3 - 50 MB
file4 - 100 MB
file5 - 1500 MB
number of mapper launched
file1 - 1
file2 - 1
file3 - 1
file4 - 2
file5 - 24
total mappers - 29
Additionally, input split size and block size is not always honored. If input file is a gzip, it is not splittable. So if one of the gzip file is 1500mb, it will not be split. It is better to use Block compression with Snappy or LZO along with sequence file format.
Also, input split size is not used if input is HBASE table. In case of HBase table, only to split is to maintain correct region size for the table. If table is not properly distributed, manually split the table into multiple regions.
Number of mappers depends on just one thing, the no of InputSplits created by the InputFormat you are using(Default is TextInputFormat which creates splits taking \n as the delimiter). It does not depend on the no. of nodes or the file or the block size(64MB or whatever). It's very good if the split is equal to the block. But this is just an ideal situation and cannot be guaranteed always. MapReudce framework tries its best to optimise the process. And in this process things like creating just 1 mapper for the entire file happen(if the filesize is less than the block size). Another optimization could be to create lesser number of mappers than the number of splits. For example if your file has 20 lines and you are using TextInputFormat then you might think that you'll get 20 mappers(as no. of mappers = no. of splits and TextInputFormat creates splits based on \n). But this does not happen. There will be unwanted overhead in creating 20 mappers for such a small file.
And if the size of a split is greater than the block size, the remaining data is moved in from the other remote block on a different machine in order to gets processed.
About the MapReduce tutorial :
If you have 10TB data, then -
(10*1024*1024)/128 = 81,920 mappers, which almost = 82,000
Hope this clears some of the things.