Group tuples by field1 OR field2 in PIG? - mapreduce

given tuples with two fields:
(x,y)
Group them together based on either having a common value for field x or field y.
For example, given:
(1,a)
(2,a)
(2,b)
(3,c)
(3,d)
If done correctly, this would result in the list being split into two groups.
group1:
(1,a)
(2,a)
(2,b)
group2:
(3,c)
(3,d)
I was thinking something like this:
a = LOAD 'givenTuples' using PigStorage(",");
b = GROUP a by $0;
matchleft = foreach b {
--join $1 of b tuples with $0 of a;
}
but a join can't be used in a nested foreach. i'm guessing this problem can't be solved in Map-reduce without an unknown amount of map-reduces.

Related

Reading Values from a Bag of Tuples in pig

I have UDF output as :-
Sample records:-
({(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,5),(Todd,10),(Todd,20),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10)})
({(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,5),(Jon,10),(Jon,20),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,5),(Jon,20),(Jon,1)})
Schema for UDF:- name:chararray(1 single column)
Now i want to read this bag of tuples and generate output as :-
Todd,240
Jon,422
The output of the UDF i stored in a temp file and read it back using different schema as:-
D = LOAD '/home/training/pig/pig/UDFdata.txt' AS (B: bag {T: tuple(name:chararray, denom:int)});
After that i am trying to use foreach loop and reference dot notation to find the sum.
X = foreach D generate B.T.name,SUM(B.T.denom);
2017-03-04 13:52:59,507 ERROR org.apache.pig.tools.grunt.Grunt: ERROR
1128: Cannot find field T in name:chararray,denom:int Details at
logfile: /home/training/pig_1488648405070.log
Can you please let me know how to find it? I am new to Apache Pig so not sure how it traverse in Bag of Tuples and find sum.
GROUP the dataset on name before performing SUM.
FLATTEN the bag to perform GROUP.
flattened = FOREACH D GENERATE FLATTEN(B);
dump flattened;
...
(Todd,10)
(Todd,10)
(Jon,1)
(Jon,1)
....
Then, GROUP them on name
grouped = GROUP flattened by name;
dump grouped;
(Jon,{(Jon,1),(Jon,20),(Jon,5),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,20),(Jon,10),(Jon,5),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1)})
(Todd,{(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,20),(Todd,10),(Todd,5),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1)})
And apply SUM() over the result
final_sum = FOREACH grouped GENERATE group, SUM(flattened.denom);
dump final_sum;
(Jon,106)
(Todd,100)

PIG regex extract then filter the unnamed regex tuple

I have a string as:
[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]
I want to regex_extract_all to turn it into elements in a tuple and sepereated by ",". Then I need to filter out the ones don't contain structure and location.
However, I got an error that can't filter regex type. Any idea?
By the way, the ending goal is to parse out the longest hierarchy like (topic|news|politics|elections|primary)
update the script:
data = load load '/web/visit_log/20160303'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir
The syntax for filter matches seems incorrect.The data doesn't seem to have () in it.
c = filter b by not extr matches '(structure|location)';
Try changing this to
c = filter b by not (extr matches 'structure|location');

Isolating the elements of a row in python

I am working on a project where i am trying to plot the rainfall pattern of various states of my country. By using this command i fetch the data from my database:
cur.execute('SELECT JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DECEMBER FROM Rainfall_In_Cm where STATE_UT = %s && DISTRICT = %s' ,(state , district))
The result comes in the form of a list with 1 element(the 1 row from the query output) :
(Decimal('17.5'), Decimal('9.9'), Decimal('8.9'), Decimal('4.0'), Decimal('9.3'), Decimal('53.8'), Decimal('227.1'), Decimal('280.90'), Decimal('125.4'), Decimal('28.1'), Decimal('5.0'), Decimal('4.7'))
Now i want all the elements to be in form of a list that i can use with matplotlib to plot a graph and I want to remove 'Decimal' string from infront of every value . How can i do it?
Your result is a tuple, not a list, but that is not a problem.
Casting should work for these Decimal objects. You can use a list comprehension:
#Data from your example
foo = (Decimal('17.5'), Decimal('9.9'), Decimal('8.9'), Decimal('4.0'), Decimal('9.3'), Decimal('53.8'), Decimal('227.1'), Decimal('280.90'), Decimal('125.4'), Decimal('28.1'), Decimal('5.0'), Decimal('4.7'))
bar = [float(i) for i in foo]
If you want to have rather integers, use:
bar = [int(i) for i in foo]

Match a string pattern from other data.frame

Suppose I have two dataframes a and b,
a has one column called 'detail':
pure water
wood fire
mineral water
water
fire work
and b has one column called 'type':
water
fire
Many R functions require input text to get match, grep('fire',a), but my question is if there is a way to match a using b? I tried loop but failed. Following SQLDF got all false result for match.
ab <- sqldf(select *,case when detail in (select distinct types from b) then 1 else 0 end as match) from a)
Ideally, one can using something like c <- grep(a$detail,b$types). not sure if it is allowed in R though.
Thanks in advance!
Create a type column in a and then merge on it:
merge(transform(a, type = sub(".* ", "", a$detail)), b, all = TRUE)

Splitting a tuple of bags into multiple tuples using regex_extarct_all in pig

I have my input string as
{(100),(200),(300)}
and I want to split them into different columns using regex_extract_all.
I have used REGEX_EXTRACT_ALL(data,'[^{(,)}]+') and I am getting an error saying:
Could not infer the matching function for org.apache.pig.builtin.REGEX_EXTRACT_ALL as multiple or none of them fit. Please use an explicit cast.
What am I missing ?
If it's a tuple of bags, you can't use it as a chararray field - thus you can't use REGEX_EXTRACT_ALL with it. If you want to split it into multiple tuples, use the function BagToTuple and FLATTEN.
e.g.:
...
a = load '/tmp/1.txt';
b = group a by $0;
c = foreach b generate group, FLATTEN(BagToTuple(a.$1));
...