MapReduce basics - mapreduce

I have a text file of 300mb with block size of 128mb.
So total 3 blocks 128+128+44 mb would be created.
Correct me - For map reduce default input split is same as block size that is 128mb which can be configured.
Now record reader will read through each split and create key value pair were key is offset and value is single line. (TextInputFormat)
Question is if at last line of my block the block ends but the line does end in another block, will the rest of the line be taken from different node or will the remaining line run in another node.
Also how will the second node understand that its 1st line is already taken for processing and it dont need to process again.
Eg
This is stackoverflow.This (end of block 1/input split) is a map reduce example. (end of line)

3 mapper will be generated for this scenario. Hadoop uses a pointer at the end of every block which indicates the location of next block , so mapper 1 will processed the complete line , which may be the part of block 2 and mapper 2 will start processing by leaving that line.

Related

Rope data structure & Lines

I'm using a Rope to store a large amount (GB's) of text. The text can be tens of millions of lines long.
The rope itself is extremely fast inserting at any position, and is also fast getting a character at a specific position.
However, how would I get where a specific line (\n for this case) starts? For example, how would I get where line 15 starts? There are a couple options that I can see.
Don't have any extra data. Whenever you want say the 15th line, you iterate through all the characters in the Rope, find the newlines, and when you reach the 15th newline, then you stop.
Store the start and length of each line in a vector. So you would have your Rope data structure containing all the characters, and then a separate std::vector<line>. The line structure would just consist of 2 fields; start and length. Start represents where the line starts inside of the Rope, and length is the length of the line. To get where the 15th line starts, just do lines[14].start
Problems:
#1 is a horrible way to do it. It's extremely slow because you have to go through all of the characters.
#2 is also not good. Although getting where a line starts is extremely fast (O(1)), every time you insert a line, you have to move all the lines ahead of it, which is O(N). Also, storing this means that for every line you have, it takes up an extra 16 bytes of data. (assuming start and length are 8 bytes each). That means if you have 13,000,000 lines, it would take up 200MB of extra memory. You could use a linked list, but it just makes the access slow.
Is there any better & more efficient way of storing the line positions for quick access & insert? (Preferably O(log(n)) for inserting & accessing lines)
I was thinking of using a BST, and more specifically a RB-Tree, but I'm not entirely sure how that would work with this. I saw VSCode do this but with a PieceTable instead.
Any help would be greatly appreciated.
EDIT:
The answer that #interjay provided seems good, but how would I handle CRLF if the CR and LF were split between 2 leaf nodes?
I also noticed ropey, which is a rust library for the Rope. I was wondering if there was something similar but for C++.
In each rope node (both leaves and internal nodes), in addition to holding the number of characters in that subtree, you can also put the total number of newlines contained in the subtree.
Then finding a specific newline will work exactly the same way as finding the node holding a specific character index. You would look at the "number of newlines" field instead of the "number of characters" field.
All rope operations will work mostly the same. When creating a new internal node, you just need to add its children's number of newlines. Complexity of all operations is the same.

Acquiring the first column of a CSV file without reading entire rows in C++

Let's assume that we have a header-less CSV file (I'm including a header row for clarification but the actual file doesn't contain it):
ID,BookTitle,Author,Price
110,book1,author1,price1
178,book2,author2,price2
917,book3,author3,price3
How can I acquire the ID column without having to read whole rows of data into memory? i.e. read ID: 110 and add to a vector, go to next row (line), read ID: 178 and add to vector, and so on.
You cannot. Files don't have rows and columns. The content is just characters and a \n denotes a line break. Hence, you cannot know where a line starts or ends without reading characters until you find a line break.
Situation is different when lines have a fixed width. Then you can skip ahead and start reading the next line.
For any future readers who might stumble upon this question:
#largest_prime 's answer is the reasonable one. If your row data isn't that large (which for 99.99% of times, this is the case), it makes perfect sense to read it all, acquire the ID data and finally discard the rest of the row data.
However, in that 0.01% of the times that a row might contain big data (such as a whole book's text), #z80crew 's solution might work out nicely.

Namenode file division into blocks

I am aware that a file is split into blocks by name node while stored into HDFS . But when the file is divided , there is chance that a block will contain a part of the line . Is my understanding correct ? So if i will have any map operation that needs to read each line then i will miss some part of a line to process by mapper .
Thanks!!
In addition to #RojoSam's answer, SPLIT_SLOP parameter is used by RecordReader to read data of a single file from other blocks. SPLIT_SLOP gives hadoop job to read % of data from remote host if data is not completely available at single datanode.
Yes, It is posible that a line be splitted in two blocks. The Reader used by the mapper at the end of the block read the first line for the next block and process it. If it isn't the first block, the reader always skip the first line. At least for text files. Other formats work different.
You won't miss any part of a line

Algorithm for writing limited number of lines to text file

I have a program where I need to write text lines to a log file very frequently. I would like to limit the number of lines in the log file to 1000. When I write lines to the file, it should append them normally. Once the file reaches 1000 lines, I'd like to get rid of the first line and then append the new one. Does anyone know if there is a way to do this without rewriting the entire file each time?
Generally it's a little bit better for a case like this to remove more than one line at a time from the beginning.
That is, if your limit is 1000 lines, and you hit 1000 lines, delete the first 300 or so, and then resume writing. That way, you're not performing the delete operation with every single line written thereafter, only every 300 times. If you need to persist 1000 lines, then instead keep up to 1300 and delete 300 when 1300 is reached.
All files have to be aligned to FS cluster size. So, no, there's no way. You can append a line to a file, but you can't delete the first line without file rewriting.
You can use 2 files by turns.
Or use some buffer in memory and flush it periodically.
I think you still have to scan the file to find out how many lines are in the file at this moment. In that case, you can put it in some sort of buffer that you could easily add and delete from.
Then you do your logging and when you are done, you could "re-write" the file with the buffer (or only last 1000 lines).
Other alternatives are discussed above.
And yeah, try to avoid deleting line-by-line. Generally, it is a costly operation.
I've found some similar topics here and on CodeProject:
Small logger class;
Flexible logger class using standard streams in C++
http://www.codeproject.com/Articles/584794/Simple-logger-for-Cplusplus
Hope you find them useful :)
Any time you want to log, you can open the file, read your write index, jump to the position, and write the fixed-width log entry. When your index hits your upper threshold, simply set it back to 0.
There are a lot of warnings with this, though - first is that each proper log entry (assuming you close the file in between) will require an open, a read, a seek, a write, a seek, a write and a close - to find your index, go to it, write the new entry, then update your index. You also have the inherent issues of writing a fixed-size data element. Also, a human reader will depend on your content to know where the "beginning" of the file is. Most people expect "line 1" to be the first line.
I'm a much bigger advocate for simply having a few files and "rolling" them, so that each file on its own is coherent, but if you want just one file with a fixed number of lines, the circular buffer idea can work.
When you only want to use one file, and the length of the lines are not constant, there is no way without rewriting the whole file.
Depending on how often you are appending to the file, I don't see any problem doing so. 1000 lines of approx 100 chars are only approx 100kb, which is not to much. Additionally you may add some hysteresis.
However:
If the line length is constant (or you hard-limit the line length to some constant), you could just overwrite the oldest line. But then you have to keep track of the log file positions of old/new lines
I would use two files: The first one where you append lines. When the file gets full, rename it to a second one, and fill the first one from the beginning.

Append data columnwise in C/C++

I want to add columns of data to a text file, one column in each iteration (one space between each column). If I open file for appending, it adds next column at the bottom of first column. Is it possible to append sideways?
All data isn't available at the start. Only one column of data becomes available in each iteration, and it gets lost in the next iteration.
Consider the file to be one long stream of characters, some of them just happen to be line breaks. Append always starts at the end of the file. If I'm reading you right you need to use seekp(seek new position to put new characters at) on your fstream to get to the right position before writing.
You know the format of your file, therefore you can calculate how much to skip in each line.
Something like this might work:
read line
while line != "":
skip forward the right number of " "
write new column
read new line