Namenode file division into blocks - hdfs

I am aware that a file is split into blocks by name node while stored into HDFS . But when the file is divided , there is chance that a block will contain a part of the line . Is my understanding correct ? So if i will have any map operation that needs to read each line then i will miss some part of a line to process by mapper .
Thanks!!

In addition to #RojoSam's answer, SPLIT_SLOP parameter is used by RecordReader to read data of a single file from other blocks. SPLIT_SLOP gives hadoop job to read % of data from remote host if data is not completely available at single datanode.

Yes, It is posible that a line be splitted in two blocks. The Reader used by the mapper at the end of the block read the first line for the next block and process it. If it isn't the first block, the reader always skip the first line. At least for text files. Other formats work different.
You won't miss any part of a line

Related

MapReduce basics

I have a text file of 300mb with block size of 128mb.
So total 3 blocks 128+128+44 mb would be created.
Correct me - For map reduce default input split is same as block size that is 128mb which can be configured.
Now record reader will read through each split and create key value pair were key is offset and value is single line. (TextInputFormat)
Question is if at last line of my block the block ends but the line does end in another block, will the rest of the line be taken from different node or will the remaining line run in another node.
Also how will the second node understand that its 1st line is already taken for processing and it dont need to process again.
Eg
This is stackoverflow.This (end of block 1/input split) is a map reduce example. (end of line)
3 mapper will be generated for this scenario. Hadoop uses a pointer at the end of every block which indicates the location of next block , so mapper 1 will processed the complete line , which may be the part of block 2 and mapper 2 will start processing by leaving that line.

C++ - Randomly access lines of several text files

I have 10 text files (named file0.txt to file9.txt) with arbitrary lengths and number of lines. I need to randomly pick a file, randomly access 1-3 lines from that file, process them and repeat until all the lines of all the files have been processed. This only needs to be done once. For the sake of this question let's say "process" means print the lines. Does anyone have any suggestions on how I can go about doing this without loading all the text files into memory?
There's not really any way to 'randomly access' (in the sense that you can randomly access a vector) lines in a text file since the only way to find the lines is to search the file linearly for newlines. This means you'll at least need to stream through the files once to access lines even if you don't load them fully into memory.
You could achieve what you're describing by passing over all the files once to count the number of lines in them and then passing over them again to pull out randomly selected lines. I'm not sure what the benefit of that would be though. What are you really trying to achieve?
you can scan the file one to index where line starts, and keep that in memory (or even persist that if you need to do the same file more than once).
once you have that you can just seek into the line beginning and just read it till newline/eof before processing.
Suggestion:
1/ Make a copy of the files
2/ Erase a line when it is read
3/ update number of lines in file
That way you randomly pick a line that exist and that was not already read.
Lot of read/write...not efficient

Algorithm for writing limited number of lines to text file

I have a program where I need to write text lines to a log file very frequently. I would like to limit the number of lines in the log file to 1000. When I write lines to the file, it should append them normally. Once the file reaches 1000 lines, I'd like to get rid of the first line and then append the new one. Does anyone know if there is a way to do this without rewriting the entire file each time?
Generally it's a little bit better for a case like this to remove more than one line at a time from the beginning.
That is, if your limit is 1000 lines, and you hit 1000 lines, delete the first 300 or so, and then resume writing. That way, you're not performing the delete operation with every single line written thereafter, only every 300 times. If you need to persist 1000 lines, then instead keep up to 1300 and delete 300 when 1300 is reached.
All files have to be aligned to FS cluster size. So, no, there's no way. You can append a line to a file, but you can't delete the first line without file rewriting.
You can use 2 files by turns.
Or use some buffer in memory and flush it periodically.
I think you still have to scan the file to find out how many lines are in the file at this moment. In that case, you can put it in some sort of buffer that you could easily add and delete from.
Then you do your logging and when you are done, you could "re-write" the file with the buffer (or only last 1000 lines).
Other alternatives are discussed above.
And yeah, try to avoid deleting line-by-line. Generally, it is a costly operation.
I've found some similar topics here and on CodeProject:
Small logger class;
Flexible logger class using standard streams in C++
http://www.codeproject.com/Articles/584794/Simple-logger-for-Cplusplus
Hope you find them useful :)
Any time you want to log, you can open the file, read your write index, jump to the position, and write the fixed-width log entry. When your index hits your upper threshold, simply set it back to 0.
There are a lot of warnings with this, though - first is that each proper log entry (assuming you close the file in between) will require an open, a read, a seek, a write, a seek, a write and a close - to find your index, go to it, write the new entry, then update your index. You also have the inherent issues of writing a fixed-size data element. Also, a human reader will depend on your content to know where the "beginning" of the file is. Most people expect "line 1" to be the first line.
I'm a much bigger advocate for simply having a few files and "rolling" them, so that each file on its own is coherent, but if you want just one file with a fixed number of lines, the circular buffer idea can work.
When you only want to use one file, and the length of the lines are not constant, there is no way without rewriting the whole file.
Depending on how often you are appending to the file, I don't see any problem doing so. 1000 lines of approx 100 chars are only approx 100kb, which is not to much. Additionally you may add some hysteresis.
However:
If the line length is constant (or you hard-limit the line length to some constant), you could just overwrite the oldest line. But then you have to keep track of the log file positions of old/new lines
I would use two files: The first one where you append lines. When the file gets full, rename it to a second one, and fill the first one from the beginning.

Efficiently read the last row of a csv file

Is there an efficient C or C++ way to read the last row of a CSV file? The naive approach involves reading in the entire file and then going to the end. Is there a quicker way this can be done (particularly if the CSV files are large)?
What you can do is guess the line length, then jump 2-3 lines before the end of the file and read the remaining lines. The last line you read is the last one, as long you read at least one line prior (otherwise, you still start again with a bigger offset)
I posted some sample code for doing a similar thing (reading last N lines) in this answer (in PHP, but serves as an illustration)
For implementations in a variety of languages, see
C++ : c++ fastest way to read only last line of text file?
Python : Efficiently finding the last line in a text file
Perl : How can I read lines from the end of file in Perl?
C# : Get last 10 lines of very large text file > 10GB c#
PHP : how to read only 5 last line of the txt file
Java: Read last n lines of a HUGE file
Ruby: Reading the last n lines of a file in Ruby?
Objective-C : How to read data from NSFileHandle line by line?
You can try working backwards. Read some size block of bytes from the end of the file, and look for the newline. If there is no newline in that block, then read the previous block, and so on.
Note that if the size of a row relative to the size of the file is large that this may result in worse performance, because most file caching schemes assume someone reads forward in the file.
You can use Perl module File::ReadBackwards.
Your problem falls into the same domain as searching for a string within a file. As you rightly point out, it's not always a great idea to read the entire file into memory and then search for your string. But you can always do the next best thing. Memory map your file. Then use your string searching functions to search backwards from the end of the string for your newline.
It's an extremely efficient mechanism with minimal memory footprint and optimum disk I/O.
Read with what and on what? On a Unix system, if you want the last line, it is as simple as
tail -n1 file.csv
If you want this approach from within your C++ app, you can do something like
system("tail -n1 file.csv")
if you want a quick and dirty way to accomplish this task.

Seeking to a line in a file in g++

Is there a way that I can seek to a certain line in a file to read or write data?
Let's say I want to write some data starting on the 10th line in a text file. There might be some data already in the first few lines, or the file could even be empty. Is there a way I can seek directly to the line I want without having to worry about what's already in the file?
Only if the lines are all the same length (seek to 9 * bytes_per_line). Otherwise, you'll just have to scan your way to the appropriate spot in the file.
Also be wary of writing into the middle of a file. It may not do what you expect (insert new lines). It will simply overwrite whatever content is already there, and won't respect existing line boundaries.
You can seek to a position in a file, but that position must be a character offset from the start, end or current position - see for example fseek(). There is no way of seeking to a particular line, unless all the lines are exactly the same length.
No, you have to process the data to find the line delimiters (unless you have fixed length lines). Have a look at getline(), ftell() and fseek(). http://www.pixelbeat.org/programming/readline/cpp.cpp
The easy best way is to read the file in memory inserting for instance each line in a vector of strings, then modifying/adding whatever you want, and re-write each line in a new file.
(supposing the file fits in memory)