How to get this output from pig Latin in MapReduce - mapreduce

I want to get the following output from Pig Latin / Hadoop
((39,50,60,42,15,Bachelor,Male),5)
((40,35,HS-grad,Male),2)
((39,45,15,30,12,7,HS-grad,Female),6)
from the following data sample
data sample for adult data
I have written the following Pig Latin script:
sensitive = LOAD '/mdsba/sample2.csv' using PigStorage(',') as (AGE,EDU,SEX,SALARY);
BV= group sensitive by (EDU,SEX) ;
BVA= foreach BV generate group as EDU, COUNT (sensitive) as dd:long;
Dump BVA ;
Unfortunately, the results come out like this
((Bachelor,Male),5)
((HS-grad,Male),2)

Than try to project the AGE data too.
Something like this:
BVA= foreach BV generate
sensitive.AGE as AGE,
FLATTEN(group) as (EDU,SEX),
COUNT(sensitive) as dd:long;
Another suggestion is to specify the datatype when you load the data.
sensitive = LOAD '/mdsba/sample2.csv' using PigStorage(',') as (AGE:int,EDU:chararray,SEX:chararray,SALARY:chararray);

Related

REGEX certain rows of Data and keep the rest

I'm trying to extract the following data and convert it to a final column in BigQuery.
Raw Data
SAY LOWERS = BAD.Q
Virginia
SAY LOWERS = BAD.U
Oregon
Georgia
SAY LOWERS = BAD.U
SAY LOWERS = BAD.A
California
Final Version
BAD.Q
Virginia
BAD.U
Oregon
Georgia
BAD.U
BAD.A
California
Basically, I'm trying to remove "SAY LOWERS = " from all the data that has it, and keep everything after it, and keep everything that doesn't have that phrase.
This answer covers how to run regexp_replace in Google BigQuery, here is the query adapted for your use case:
SELECT regexp_replace(your_column_name, r'SAY LOWERS = ', '') final_column_name
FROM your_table_name
You don't need a regex to remove a constant string from another one. Just use REPLACE:
SELECT REPLACE(your_column, 'SAY LOWERS = ', '') AS final_column
FROM your_table

counting genres in pig

I deal with the dataset movies.dat provided by movielensdata. First 5 rows of the data is
1:Toy Story (1995):Adventure|Animation|Children|Comedy|Fantasy
2:Jumanji (1995):Adventure|Children|Fantasy
3:Grumpier Old Men (1995):Comedy|Romance
4:Waiting to Exhale (1995):Comedy|Drama|Romance
5:Father of the Bride Part II (1995):Comedy
I want to count exact number of occurences of each genre. To do this, the following mapreduce (python) code is sufficient.
#!/usr/bin/env python
import sys
#mapper
for line in sys.stdin:
for genre in line.strip().split(":")[-1].split("|"):
print("{x}\t1".format(x=genre))
#!/usr/bin/env python
#reducer
import sys
genre_dict={}
for line in sys.stdin:
data=line.strip().split("\t")
if len(data)!=2:
continue
else:
if data[0] not in genre_dict.keys():
genre_dict[data[0]]=1
else:
genre_dict[data[0]]+=1
a=list(genre_dict.items())
a.sort(key=lambda x:x[1],reverse=True)
for genre,count in a:
print("{x}\t{y}".format(x=genre,y=count))
Any suggestion for the pig's query to do the same task?
Thanks in advance...
TOKENIZE and FLATTEN can help you out here. The TOKENIZE operator in Pig takes a string and a delimiter, splits the string into parts based on the delimiter and puts the parts into a bag. The FLATTEN operator in Pig takes a bag and explodes each element in the bag into a new record. The code will look as follows:
--Load you initial data and split into columns based on ':'
data = LOAD 'path_to_data' USING PigStorage(':') AS (index:long, name:chararray, genres:chararray);
--Split & Explode each individual genre into a separate record
dataExploded = FOREACH data GENERATE FLATTEN(TOKENIZE(genres, '|')) AS genre;
--GROUP and get counts for each genre
dataWithCounts = FOREACH (GROUP dataExploded BY genre) GENERATE
group AS genre,
COUNT(dataExploded) AS genreCount;
DUMP dataWithCounts;

REGEX_EXTRACT_ALL not returning correct results in APACHE PIG

I want to transform unstructured data into structured form. The data is of the following form - (showing 1 row of data)
Agra - Ahmedabad### Sat, 24 Jan### http://www.cleartrip.com/m/flights/results?from=AGR&to=AMD&depart_date=24/01/2015&adults=1&childs=0&infants=0&class=Economy&airline=&carrier=&intl=n&page=loaded Air India### 15:30 -
14:35### 47h 5m, 3 stops , AI 406### Rs. 30,336###
and I want to extract the data in the following format using APACHE PIG
(Agra - Ahmedabad,Sat, 24 Jan,http://www.cleartrip.com/m/flights/results?from=AGR&to=AMD&depart_date=24/01/2015&adults=1&childs=0&infants=0&class=Economy&airline=&carrier=&intl=n&page=loaded,Air India,15:30 - 14:35,47h 5m, 3 , AI 406 , 30,336)
I am using the following lines in APACHE PIG :
A = LOAD '/prodqueue_cleartrip_23rdJan15.txt' using PigStorage as (value: chararray);
B = foreach A generate REGEX_EXTRACT_ALL('value', '([^#]+)#+\\s+([^#]+)#+\\s+([^\\s]+)\\s+([^#]+)#+\\s+([0-9]{1,2}:[0-9]{1,2}\\s-\\n[0-9]{1,2}:[0-9]{1,2})#+\\s+([^,]+),\\s([0-9]+)\\sstops\\s,\\s([^#]+)#+\\s+Rs.\\s([^#]+)#+
');
C = LIMIT B 5;
The output I am getting is this :
()
()
()
()
()
What is the mistake?
This may be just a typo for your question but
REGEX_EXTRACT_ALL('value', '([^#]+)#+\\s...
would just search a literal "value". You probably want to take out the single quote so that it matches the field value.
REGEX_EXTRACT_ALL(value, '([^#]+)#+\\s...

BadDataError when editing a .dbf file using dbf package

I have recently produced several thousand shapefile outputs and accompanying .dbf files from an atmospheric model (HYSPLIT) on a unix system. The converter txt2dbf is used to convert shapefile attribute tables (text file) to a .dbf.
Unfortunately, something has gone wrong (probably a separator/field length error) because there are 2 problems with the output .dbf files, as follows:
Some fields of the dbf contain data that should not be there. This data has "spilled over" from neighbouring fields.
An additional field has been added that should not be there (it actually comes from a section of the first record of the text file, "1000 201").
This is an example of the first record in the output dbf (retrieved using dbview unix package):
Trajnum : 1001 2
Yyyymmdd : 0111231 2
Time : 300
Level : 0.
1000 201:
Here's what I expected:
Trajnum : 1000
Yyyymmdd : 20111231
Time : 2300
Level : 0.
Separately, I'm looking at how to prevent this from happening again, but ideally I'd like to be able to repair the existing .dbf files. Unfortunately the text files are removed for each model run, so "fixing" the .dbf files is the only option.
My approaches to the above problems are:
Extract the information from the fields that do exist to a new variable using dbf.add_fields and dbf.write (python package dbf), then delete the old incorrect fields using dbf.delete_fields.
Delete the unwanted additional field.
This is what I've tried:
with dbf.Table(db) as db:
db.add_fields("TRAJNUMc C(4)") #create new fields
db.add_fields("YYYYMMDDc C(8)")
db.add_fields("TIMEc C(4)")
for record in db: #extract data from fields
dbf.write(TRAJNUMc=int(str(record.Trajnum)[:4]))
dbf.write(YYYYMMDDc=int(str(record.Trajnum)[-1:] + str(record.Yyyymmdd)[:7]))
dbf.write(TIMEc=record.Yyyymmdd[-1:] + record.Time[:])
db.delete_fields('Trajnum') # delete the incorrect fields
db.delete_fields('Yyyymmdd')
db.delete_fields('Time')
db.delete_fields('1000 201') #delete the unwanted field
db.pack()
But this produces the following error:
dbf.ver_2.BadDataError: record data is not the correct length (should be 31, not 30)
Given the apparent problem that there has been with the txt2dbf conversion, I'm not surprised to find an error in the record data length. However, does this mean that the file is completely corrupted and that I can't extract the information that I need (frustrating because I can see that it exists)?
EDIT:
Rather than attempting to edit the 'bad' .dbf files, it seems a better approach to 1. extract the required data to a text from the bad files and then 2. write to a new dbf. (See Ethan Furman's comments/answer below).
EDIT:
An example of a faulty .dbf file that I need to fix/recover data from can be found here:
https://www.dropbox.com/s/9y92f7m88a8g5y4/p0001120110.dbf?dl=0
An example .txt file from which the faulty dbf files were created can be found here:
https://www.dropbox.com/s/d0f2c0zehsyy8ab/attTEST.txt?dl=0
To fix the data and recreate the original text file, this snippet should help:
import dbf
table = dbf.Table('/path/to/scramble/table.dbf')
with table:
fixed_data = []
for record in table:
# convert to str/bytes while skipping delete flag
data = record._data[1:].tostring()
trajnum = data[:4]
ymd = data[4:12]
time = data [12:16]
level = data[16:].strip()
fixed_data.extend([trajnum, ymd, time, level])
new_file = open('repaired_data.txt', 'w')
for line in fixed_data:
new_file.write(','.join(line) + '\n')
Assuming all your data files look like your sample (the big IF being the data has no embedded commas), then this rough code should help translate your text files into dbfs:
raw_data = open('some_text_file.txt').read().split('\n')
final_table = dbf.Table(
'dest_table.dbf',
'trajnum C(4); yyyymmdd C(8); time C(4); level C(9)',
)
with final_table:
for line in raw_data:
fields = line.split(',')
final_table.append(tuple(fields))
# table has been populated and closed
Of course, you could get fancier and use actual date, and number fields if you want to:
# dbf string becomes
'trajnum N; yyyymmdd D; time C(4), level N'
#appending data loop becomes
for line in raw_data:
trajnum, ymd, time, level = line.split(',')
trajnum = int(trajnum)
ymd = dbf.Date(ymd[:4], ymd[4:6], ymd[6:])
level = int(level)
final_table.append((trajnum, ymd, time, level))

date validation using regular expression in pig - getting ERROR 1003: Unable to find an operator for alias

I have sampledata.csv which contains data as below,
2,4/1/2010,5.97
2,4/6/2010,12.71
2,4/7/2010,34.52
2,4/12/2010,7.89
2,4/14/2010,17.17
2,4/16/2010,9.25
2,4/19/2010,26.74
I want to filter the data in pig script so that only data with valid date are considered.
Say if the date is like '4//2010' or '/9/2010', then it has to be filtered out.
Below is the pig script I have written and the output I am getting while dumping the data.
script:
data = load 'sampledata.csv' using PigStorage(',') as (custid:int, date:chararray,amount:float);
cleadata = FILTER data by REGEX_EXTRACT(date, '(([1-9])|(1[0-2]))/(([0-2][1-9])|([3][0-1]))/([1-9]{4})', 1) != null;
Output:
2014-09-14 18:21:30,587 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1003: Unable to find an operator for alias cleandata
I am a beginner in pig scripting. If you have come across this kind of error,please let me know how to resolve.
here the solution for your problem. I have modified the Regex also, if you want you can change the regex according to your need.
input.txt
2,04/1/0000,5.97
2,04/1/2010,5.97
2,44/6/2010,12.71
2,4/07/2010,34.52
2,4/\12/2010,7.89
2,4/14/2010/,17.17
2,/16/2010,9.25
2,4/19//2010,26.74
2,4//19/2010,26.74
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS (custid:int,date:chararray,amount:float);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(date, '(0?[1-9]|1[0-2])/([1-2][0-9]|[3][0-1]|0?[1-9])/([1-2][0-9]{3})')) AS (month,day,year);
C = FOREACH B GENERATE CONCAT(month,'/',day,'/',year) AS extractedDate;
D = FILTER C BY extractedDate is not null;
DUMP D;
Output:
(04/1/2010)
(4/07/2010)