Creating input splits (HADOOP) - mapreduce

I have a file with size 39MB, I set the block size as 36MB. When the file is uploaded to HDFS, it successfully stores the file in two blocks. Now when I run a Map-Reduce job(simple reading job) on this file, the job counters show :
"INFO mapreduce.JobSubmitter: number of splits:1"
That is , it is considering the 2 blocks as a single split, so I looked around and found the formula for calculating the split size which is as follows:
split size = max(minsize,min(maxsize,blocksize))
where minsize=mapreduce.input.fileinputformat.split.minsize and maxsize=minsize=mapreduce.input.fileinputformat.split.maxsize.
Now in my MR code I set the following properties:
Configuration conf = new Configuration()
conf.set("mapreduce.input.fileinputformat.split.minsize","1")
conf.set("mapreduce.input.fileinputformat.split.maxsize","134217728")
That is minsize=1 byte and maxsize=128 MB, so according to the formula the split size should be 36MB and hence two splits should be there, but still I am getting the same counter output as :
"INFO mapreduce.JobSubmitter: number of splits:1"
Can anyone explain why ?

The last split of a file can overflow by 10%.
This is called as SPLIT_SLOP and it is set at 1.1.
In this scenario,
39MB (Remaining Bytes) / 36MB (Input Split Size) = 1.08 is less than 1.1 (SPLIT_SLOP)
Thus the entire file is considered as one split.
Snippet on how splits are divided,
long bytesRemaining = FileSize;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,splitHosts[0], splitHosts[1]));
bytesRemaining -= splitSize;
}
Refer getSplits() method to know how splits are divided for each file.

Related

Python 2.7 - Append processed file contents from multiple files to one large CSV file with original filename headers separating

I have not done any programming in about 12 years and have been asked by one of my colleagues to help with what is apparently a basic Python 2.7 script. My question is very similar to what this person asked (though has not been answered):
Python - Batch combine Multiple large CSV, filter data, skip header, appending vertically into a single CSV
I need to prompt the user for the folder path, read in each file from that folder (there are hundreds of CSV files), conduct processing, and then output the finished processing from each file into a single CSV file with the output of each file separated by a blank line and the filename of the file that it was read from.
It would result in something like this:
CHEM_0_5
etc etc
etc etc
etc etc
LAW_4_1
etc etc
etc etc
LAW_7_3
etc etc
etc etc
Currently the script has to be edited with the name of the file it has to read, saved, and then run. Then the contents of the output file has to be manually copied into a new csv file. It is very tedious and time consuming.
This is what I currently have. Please note I have removed some of the processing from the example.
import time
import datetime
x = 0
stamp = 0
compare = 1
values = []
## INSERT NAME OF FILE YOU WANT TO CLEAN
g = open('CHEM_0_5.csv','r')
for line in g:
lis=[line.split() for line in g]
lis.pop(0)
lis.pop(0)
timestamps = []
results = []
x = 0
for i in cl:
## INSERT WHAT YOU WANT TO SAVE THE FILE AS
fd = open('new.csv','a')
fd.write(str(ts[x]) + "," + str(i) + "\n")
fd.close()
x = x + 1
g.close()
I have been trying to re-learn python in the process of searching for answers but given that I don't really know what I'm doing I feel that this could be something to do after I've completed the task for my colleague.
Thank you for taking the time to read my submission!

BadDataError when editing a .dbf file using dbf package

I have recently produced several thousand shapefile outputs and accompanying .dbf files from an atmospheric model (HYSPLIT) on a unix system. The converter txt2dbf is used to convert shapefile attribute tables (text file) to a .dbf.
Unfortunately, something has gone wrong (probably a separator/field length error) because there are 2 problems with the output .dbf files, as follows:
Some fields of the dbf contain data that should not be there. This data has "spilled over" from neighbouring fields.
An additional field has been added that should not be there (it actually comes from a section of the first record of the text file, "1000 201").
This is an example of the first record in the output dbf (retrieved using dbview unix package):
Trajnum : 1001 2
Yyyymmdd : 0111231 2
Time : 300
Level : 0.
1000 201:
Here's what I expected:
Trajnum : 1000
Yyyymmdd : 20111231
Time : 2300
Level : 0.
Separately, I'm looking at how to prevent this from happening again, but ideally I'd like to be able to repair the existing .dbf files. Unfortunately the text files are removed for each model run, so "fixing" the .dbf files is the only option.
My approaches to the above problems are:
Extract the information from the fields that do exist to a new variable using dbf.add_fields and dbf.write (python package dbf), then delete the old incorrect fields using dbf.delete_fields.
Delete the unwanted additional field.
This is what I've tried:
with dbf.Table(db) as db:
db.add_fields("TRAJNUMc C(4)") #create new fields
db.add_fields("YYYYMMDDc C(8)")
db.add_fields("TIMEc C(4)")
for record in db: #extract data from fields
dbf.write(TRAJNUMc=int(str(record.Trajnum)[:4]))
dbf.write(YYYYMMDDc=int(str(record.Trajnum)[-1:] + str(record.Yyyymmdd)[:7]))
dbf.write(TIMEc=record.Yyyymmdd[-1:] + record.Time[:])
db.delete_fields('Trajnum') # delete the incorrect fields
db.delete_fields('Yyyymmdd')
db.delete_fields('Time')
db.delete_fields('1000 201') #delete the unwanted field
db.pack()
But this produces the following error:
dbf.ver_2.BadDataError: record data is not the correct length (should be 31, not 30)
Given the apparent problem that there has been with the txt2dbf conversion, I'm not surprised to find an error in the record data length. However, does this mean that the file is completely corrupted and that I can't extract the information that I need (frustrating because I can see that it exists)?
EDIT:
Rather than attempting to edit the 'bad' .dbf files, it seems a better approach to 1. extract the required data to a text from the bad files and then 2. write to a new dbf. (See Ethan Furman's comments/answer below).
EDIT:
An example of a faulty .dbf file that I need to fix/recover data from can be found here:
https://www.dropbox.com/s/9y92f7m88a8g5y4/p0001120110.dbf?dl=0
An example .txt file from which the faulty dbf files were created can be found here:
https://www.dropbox.com/s/d0f2c0zehsyy8ab/attTEST.txt?dl=0
To fix the data and recreate the original text file, this snippet should help:
import dbf
table = dbf.Table('/path/to/scramble/table.dbf')
with table:
fixed_data = []
for record in table:
# convert to str/bytes while skipping delete flag
data = record._data[1:].tostring()
trajnum = data[:4]
ymd = data[4:12]
time = data [12:16]
level = data[16:].strip()
fixed_data.extend([trajnum, ymd, time, level])
new_file = open('repaired_data.txt', 'w')
for line in fixed_data:
new_file.write(','.join(line) + '\n')
Assuming all your data files look like your sample (the big IF being the data has no embedded commas), then this rough code should help translate your text files into dbfs:
raw_data = open('some_text_file.txt').read().split('\n')
final_table = dbf.Table(
'dest_table.dbf',
'trajnum C(4); yyyymmdd C(8); time C(4); level C(9)',
)
with final_table:
for line in raw_data:
fields = line.split(',')
final_table.append(tuple(fields))
# table has been populated and closed
Of course, you could get fancier and use actual date, and number fields if you want to:
# dbf string becomes
'trajnum N; yyyymmdd D; time C(4), level N'
#appending data loop becomes
for line in raw_data:
trajnum, ymd, time, level = line.split(',')
trajnum = int(trajnum)
ymd = dbf.Date(ymd[:4], ymd[4:6], ymd[6:])
level = int(level)
final_table.append((trajnum, ymd, time, level))

To count total number of rows in a csv/.txt file and write it to a new csv file in python

I want to count the total number of rows in a csv file/.txt, output/write it to a new csv file, then clean the file and write a 2nd column to the new file with total number of rows after cleaning. ( I currently have the code for cleaning, I only need help with accepting a file and writing the total rows to a new file before and after cleaning) I have attached the code below which writes only the column name to a new csv file and doesn't print the result.
import csv
data = open ('/anusha.csv','r')
#numline = len(file.readlines(data))
#print(numline)
before_clean = []
with open('out_anusha.csv', 'w') as f1:
for row in data:
f1 = len(file.readlines(data))
before_clean.append(f1)
writer = csv.writer(f1)
f1.write("Before_clean")
Any help is appreciated!
One way to count number of lines in file without going through whole reading process is to use wc utility if this program is supposed to run on *nix system.
You can refer Running "wc -l <filename>" within Python Code

Proper reading of MP3 file disrupted by ID3 tags

My semestral project is due this Thursday and I have major problem with reading MP3 file (the project is about sound analysis, don't ask my what exactly is it about and why I'm doing it so late).
First, I read first 10 bytes to check for ID3 tags. If they're present, I'll just skip to the first MP3 header - or at least that's the big idea. Here is how I count ID3 tag size:
if (inbuf[0] == 'I' && inbuf[1] == 'D' && inbuf[2] == '3') //inbuf contains first 10 bytes from file
{
int size = inbuf[3] * 2097152 + inbuf[4] * 16384 + inbuf[5] * 128 + inbuf[6]; //Will change to binary shifts later
//Do something else with it - skip rest of ID3 tags etc
}
It works ok for files without ID3 tags and for some files with them, but for some other files ffmpeg (which I use for decoding) returns "no header" error, which means it didn't catch MP3 header correctly. I know that since if I remove ID3 from that .mp3 file (with Winamp for example), no errors occur. The conclusion is that size count algorithm isn't always valid.
So the question is: how do I get to know how big exactly is entire ID3 part of the .mp3 file (all possible tags, album picture and whatever)? I'm looking for it everywhere but I just keep finding this algorithm I posted above. Sometimes also something about some 10 bytes footer I need to take into account, but it seems it frequently gets more than 10 bytes for it to eventually catch proper MP3 frame.
The size of an ID3v1 Tag is always fixed 128 Bytes.
I will find the following description
If you one sum the the size of all these fields we see that 30+30+30+4+30+1 equals 125 bytes and not 128 bytes. The missing three bytes can be found at the very beginning of the tag, before the song title. These three bytes are always "TAG" and is the identification that this is indeed a ID3 tag. The easiest way to find a ID3v1/1.1 tag is to look for the word "TAG" 128 bytes from the end of a file.
Source: http://id3.org/ID3v1
There is another version, called ID3v2:
One of the design goals were that the ID3v2 should be very flexible and expandable...
Since each frame can be 16MB and the entire tag can be 256MB you'll probably never again be in the same situation as when you tried to write a useful comment in the old ID3 being limited to 30 characters.
This ID3v2 always starts at the begin of an audio file, as you can read here: http://id3.org/ID3v2Easy
ID3v2/file identifier "ID3"
ID3v2 version $03 00
ID3v2 flags %abc00000
ID3v2 size 4 * %0xxxxxxx
The ID3v2 tag size is encoded with four bytes where the most significant bit (bit 7) is set to zero in every byte, making a total of 28 bits. The zeroed bits are ignored, so a 257 bytes long tag is represented as $00 00 02 01.
bool LameDecoder::skipDataIfRequired()
{
auto data = m_file.read(3);
Q_ASSERT(data.size() == 3);
if (data.size() != 3)
return false;
if (memcmp(data.constData(), "ID3", 3))
{
m_file.seek(0);
return true;
}
// ID3v2 tag is detected; skip it
m_file.seek(3+2+1);
data = m_file.read(4);
if (data.size() != 4)
return false;
qint32 size = (data[0] << (7*3)) | (data[1] << (7*2)) |
(data[2] << 7) | data[3];
m_file.seek(3+2+1+4+size);
return true;
}

Sampling random lines from a CSV

I'm working with large CSV. How can I take a random sample of rows—say, 200 total—and recombine them into a CSV with the same structure as the original?
The procedure I would use is as follows:
Generate 200 unique numbers between 0 and the number of lines in the CSV file.
Read each line of the CSV file and keep a track of which line number your are reading. If its line number matches one of the numbers above, then output it.
Use the Resevoir Sampling random sampling technique that does not require all records be in memory or the actual number of records be known. With it, you stream in you records one-by-one and probabilistically select them into the sample. Once the stream is exhausted, output the final sample records. The technique guarantees each record in the stream has the same probability of being in the final sample. That is to say, it generates a simple random sample.
You can use random module's random.sample method to randomize a list of line offsets as shown below.
import random
# Fetching line offsets.
# Courtesy: Adam Rosenfield's tip about how to read a HUGE text file.
# http://stackoverflow.com/questions/620367/
# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
line_offset.append(offset)
offset += len(line)
file.seek(0)
# Part where you pick the random lines and copy to your new file
# My 2 cents.
randoffsets = random.sample(line_offset, 200)
with open('your_file') as f:
for k in randoffsets:
f.seek(k)
f.readline() # and append to your new file
You could try to use linecache if it works for you but since linecache reads the entire file into memory I'm not sure how well it would work for a 6GB file.