I have 2 text files of 1 MB each stored in HDFS as inputs to my MapReduce program. In the following line, we consider the input pair for the map() is .
class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
What is LongWritable key here? How is it assumed by the MapReduce? Does each line of text in the input file has a key or a single key is assumed for complete text in the file (i.e, all the lines at a time)?
1- The LongWritable key here is the offset of the line which means position of that line in the file and not the line number. There is a lot of difference between these 2.
2- It is not assumed by MapReduce framework. It is the duty of the InputFormat you are using in you MR job to decide it. Default is TextInputFormat which gives byte offset of the line as the key, as in your case.
3- Well, it again depends on your InputFormat. If you are using TextInputFormat then each line will be treated as the value with its offset as the key. But you could have your own custom InputFormat which may give you just one key for the whole file and all the lines of that file altogether as the value.
It is byte offset of line or you can say line number
Related
I have a very long file, in which it can be assumed (if it helps) every line of which has the same format. I want to read a specific line of the file. Is it possible in C++ to move the pointer to that line via a binary search instead of starting at the top of the file and reading line by line and counting lines? That is, is it possible to access some line_of_file pointer and move it by binary search? If not in C++, is this task possible in assembly language or some other language?
You cannot usefully use binary search to find a line by line number in a text file, because text files are not indexed by line number. In other words, there is no way to figure out the line number of a given offset in the file other than starting at the beginning, reading every character, and counting the number of new line characters.
There is only one exception, and in that case binary search won't help you either. If every line in the file is the exact same length, then you can find the offset of a specific line by multiplying that length by the line number (using 0 as the number of the first line). Don't forget to include the newline character in the line length. You can use istream::seekg or ostream::seekp to position the next input or output operation, respectively. (You need to use the two-argument version. Some other warnings apply on platforms which translate newline characters to multicharacter sequences; here's looking at you, windows.)
C++ program that displays on the screen item codes with corresponding
item descriptions and prices. It asks the user to enter the code of the item
purchased by a customer. It looks for a match of the item code stored in items.txt.
How can I output only a specific line from a text file after the user inputs the item code?
You need to read the file line-by-line (std::getline), extract (depending on the exact format, e.g. by searching for a whitespace in the string) and compare the code and then return the corresponding line on a match.
It is not possible to access lines from a text file directly by index or content.
This is assuming that you mean the file contains lines in the form
code1 item1
code2 item2
//...
If the code is just the index of the line, then you only need to call std::getline in a loop with a loop counter for the current index of the line.
If you do this multiple times on the same file, you should probably parse the whole content first line-by-line into a std::vector<std::string> or a std::(unordered_)map<std::string, std::string> or something similar to avoid the costly repeated iteration.
Depending on the use case, maybe it would be even better to parse the data into a database first and then query the database, even if it is only e.g. sqlite or something like that.
I would like to read a text file in C++ in following manner:
Ignore the entire first line as it is simply meant as an introduction.
Only read the following lines from a specific position.
That starting position for reading is a fixed one and remains the same for every line; however, the numbers after that may be of variable length. I need to save all of these numbers from line 2 to line n into an Array.
At the moment I can read a regular 2D Array with getline.
How can I work around these things?
An example for a line I want to read could be:
Person1: 25 988.3 0.0023 7
To set the file to a position, use std::ifstream::seekg().
To set the file to the beginning of a line, you must read and count the line endings. Many text files have variable length text lines.
How can I work around these things?
You can't, unless you can ensure that all of the data lines after the first line are all the same length.
If you can't ensure that, then all you can do is read through all of the preceding lines.
An alternative I have employed in the past is to generate an 'index' of line start positions in a secondary file in binary format (so that I CAN jump directly to the right place in that file), and use that to jump to the right place in the text file. Of course that means that you need to regenerate that index file every time you replace/amend the data file.
I am aware that a file is split into blocks by name node while stored into HDFS . But when the file is divided , there is chance that a block will contain a part of the line . Is my understanding correct ? So if i will have any map operation that needs to read each line then i will miss some part of a line to process by mapper .
Thanks!!
In addition to #RojoSam's answer, SPLIT_SLOP parameter is used by RecordReader to read data of a single file from other blocks. SPLIT_SLOP gives hadoop job to read % of data from remote host if data is not completely available at single datanode.
Yes, It is posible that a line be splitted in two blocks. The Reader used by the mapper at the end of the block read the first line for the next block and process it. If it isn't the first block, the reader always skip the first line. At least for text files. Other formats work different.
You won't miss any part of a line
I have to extract information from a text file.
In the text file there is a list of strings.
This is an example of a string: AAA101;2015-01-01 00:00:00;0.784
The value after the last ; is a non integer value, which changes from line to line, so every line has different lenght of characters.
I want to map all of these lines into a structured vector as I can access to a specific line anytime I need without scan the whole file again.
I did some research and I found some threads about a command called, which permit me to reach a specific line of a text file but I read it only works if any line has the same characters lenght of the others.
I was thinking about converting all the lines in the file in a proper format in order to be able to map that file as I want but I hope there is a better and quick way
You can try TStringList*. It creates a list of AnsiStrings. Then each AnsiString can be accessed via ->operator [](numberOfTheLine).