I have UDF output as :-
Sample records:-
({(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,5),(Todd,10),(Todd,20),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10)})
({(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,5),(Jon,10),(Jon,20),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,5),(Jon,20),(Jon,1)})
Schema for UDF:- name:chararray(1 single column)
Now i want to read this bag of tuples and generate output as :-
Todd,240
Jon,422
The output of the UDF i stored in a temp file and read it back using different schema as:-
D = LOAD '/home/training/pig/pig/UDFdata.txt' AS (B: bag {T: tuple(name:chararray, denom:int)});
After that i am trying to use foreach loop and reference dot notation to find the sum.
X = foreach D generate B.T.name,SUM(B.T.denom);
2017-03-04 13:52:59,507 ERROR org.apache.pig.tools.grunt.Grunt: ERROR
1128: Cannot find field T in name:chararray,denom:int Details at
logfile: /home/training/pig_1488648405070.log
Can you please let me know how to find it? I am new to Apache Pig so not sure how it traverse in Bag of Tuples and find sum.
GROUP the dataset on name before performing SUM.
FLATTEN the bag to perform GROUP.
flattened = FOREACH D GENERATE FLATTEN(B);
dump flattened;
...
(Todd,10)
(Todd,10)
(Jon,1)
(Jon,1)
....
Then, GROUP them on name
grouped = GROUP flattened by name;
dump grouped;
(Jon,{(Jon,1),(Jon,20),(Jon,5),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,20),(Jon,10),(Jon,5),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1)})
(Todd,{(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,20),(Todd,10),(Todd,5),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1)})
And apply SUM() over the result
final_sum = FOREACH grouped GENERATE group, SUM(flattened.denom);
dump final_sum;
(Jon,106)
(Todd,100)
I have a string as:
[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]
I want to regex_extract_all to turn it into elements in a tuple and sepereated by ",". Then I need to filter out the ones don't contain structure and location.
However, I got an error that can't filter regex type. Any idea?
By the way, the ending goal is to parse out the longest hierarchy like (topic|news|politics|elections|primary)
update the script:
data = load load '/web/visit_log/20160303'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir
The syntax for filter matches seems incorrect.The data doesn't seem to have () in it.
c = filter b by not extr matches '(structure|location)';
Try changing this to
c = filter b by not (extr matches 'structure|location');
I am working on a project where i am trying to plot the rainfall pattern of various states of my country. By using this command i fetch the data from my database:
cur.execute('SELECT JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DECEMBER FROM Rainfall_In_Cm where STATE_UT = %s && DISTRICT = %s' ,(state , district))
The result comes in the form of a list with 1 element(the 1 row from the query output) :
(Decimal('17.5'), Decimal('9.9'), Decimal('8.9'), Decimal('4.0'), Decimal('9.3'), Decimal('53.8'), Decimal('227.1'), Decimal('280.90'), Decimal('125.4'), Decimal('28.1'), Decimal('5.0'), Decimal('4.7'))
Now i want all the elements to be in form of a list that i can use with matplotlib to plot a graph and I want to remove 'Decimal' string from infront of every value . How can i do it?
Your result is a tuple, not a list, but that is not a problem.
Casting should work for these Decimal objects. You can use a list comprehension:
#Data from your example
foo = (Decimal('17.5'), Decimal('9.9'), Decimal('8.9'), Decimal('4.0'), Decimal('9.3'), Decimal('53.8'), Decimal('227.1'), Decimal('280.90'), Decimal('125.4'), Decimal('28.1'), Decimal('5.0'), Decimal('4.7'))
bar = [float(i) for i in foo]
If you want to have rather integers, use:
bar = [int(i) for i in foo]
Suppose I have two dataframes a and b,
a has one column called 'detail':
pure water
wood fire
mineral water
water
fire work
and b has one column called 'type':
water
fire
Many R functions require input text to get match, grep('fire',a), but my question is if there is a way to match a using b? I tried loop but failed. Following SQLDF got all false result for match.
ab <- sqldf(select *,case when detail in (select distinct types from b) then 1 else 0 end as match) from a)
Ideally, one can using something like c <- grep(a$detail,b$types). not sure if it is allowed in R though.
Thanks in advance!
Create a type column in a and then merge on it:
merge(transform(a, type = sub(".* ", "", a$detail)), b, all = TRUE)
I have my input string as
{(100),(200),(300)}
and I want to split them into different columns using regex_extract_all.
I have used REGEX_EXTRACT_ALL(data,'[^{(,)}]+') and I am getting an error saying:
Could not infer the matching function for org.apache.pig.builtin.REGEX_EXTRACT_ALL as multiple or none of them fit. Please use an explicit cast.
What am I missing ?
If it's a tuple of bags, you can't use it as a chararray field - thus you can't use REGEX_EXTRACT_ALL with it. If you want to split it into multiple tuples, use the function BagToTuple and FLATTEN.
e.g.:
...
a = load '/tmp/1.txt';
b = group a by $0;
c = foreach b generate group, FLATTEN(BagToTuple(a.$1));
...