I have UDF output as :-
Sample records:-
({(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,5),(Todd,10),(Todd,20),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10)})
({(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,5),(Jon,10),(Jon,20),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,5),(Jon,20),(Jon,1)})
Schema for UDF:- name:chararray(1 single column)
Now i want to read this bag of tuples and generate output as :-
Todd,240
Jon,422
The output of the UDF i stored in a temp file and read it back using different schema as:-
D = LOAD '/home/training/pig/pig/UDFdata.txt' AS (B: bag {T: tuple(name:chararray, denom:int)});
After that i am trying to use foreach loop and reference dot notation to find the sum.
X = foreach D generate B.T.name,SUM(B.T.denom);
2017-03-04 13:52:59,507 ERROR org.apache.pig.tools.grunt.Grunt: ERROR
1128: Cannot find field T in name:chararray,denom:int Details at
logfile: /home/training/pig_1488648405070.log
Can you please let me know how to find it? I am new to Apache Pig so not sure how it traverse in Bag of Tuples and find sum.
GROUP the dataset on name before performing SUM.
FLATTEN the bag to perform GROUP.
flattened = FOREACH D GENERATE FLATTEN(B);
dump flattened;
...
(Todd,10)
(Todd,10)
(Jon,1)
(Jon,1)
....
Then, GROUP them on name
grouped = GROUP flattened by name;
dump grouped;
(Jon,{(Jon,1),(Jon,20),(Jon,5),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,20),(Jon,10),(Jon,5),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1)})
(Todd,{(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,20),(Todd,10),(Todd,5),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1)})
And apply SUM() over the result
final_sum = FOREACH grouped GENERATE group, SUM(flattened.denom);
dump final_sum;
(Jon,106)
(Todd,100)
Related
In my case i have a data frame contaning some biological data which are: protein name, ecnumber (which could be more than one) and protein domains (which could be also more than one domain). The data frame is a one column containing all those data which i would like to split it into three columns, but the problem is that if a line (containing more than one ECnumber) is splitted, the second ECnumber goes to the third column and the domains will be then disappeared.
here is my code:
val df = rdd.toDF()
val mydf = df.withColumn("_tmp", split($"value", ";")).select(
$"_tmp".getItem(0).as("Entry"),
$"_tmp".getItem(1).as("ECnumber"),
$"_tmp".getItem(2).as("Domains")
And here is the result
enter image description here
Based on the provided reference data, I see you can use the following regular expression to retrieve your data into independent columns (by extracting using regular expression):
val dataFrameValueRegex = "(\\w++);(([0-9.-]*+;)++)((\\w++;?)++)".r
For example if data frame value has got the following:
val dataFrameValue = "A6MML6;2.1.-.-;2.1.3.16;IPR037431;IPR037432;IPR037433"
Now using regular expression, you can extract independent values from data frame value:
val dataFrameValueRegex(entry, ecNumbers, _, domains, _) = dataFrameValue
Above: All values will be retrieved in corresponding variables:
1.) entry: Entry string
2.) ecNumbers: Complete string of ecnumbers separated by semicolon's. There would be a semicolon present at the end of the string.
3.) domains: Complete string of domains separated by semicolon's.
Note: If for any reason the data frame value was not as expected you would get a MatchError exception been thrown.
In the below code just printing variables information.
println(s"Data value: Entry = [$entry], ECnumbers = [${ecNumbers.init}], Domains = [$domains]")
val ecNumber = ecNumbers.init.split(";")
ecNumber.foreach(e => println(s"ecNumber = [$e]"))
val domain = domains.split(";")
domain.foreach(d => println(s"Domain = [$d]"))
I'm quite new in python coding and I canĀ“t solve the following problem:
I have a list with trackingpoints for different animals(ID,date,time,lat,lon) given in strings:
aList = [[id,date,time,lat,lon],
[id2,date,time,lat,lon],
[...]]
The txt file is very big and the IDs(a unique animal) is occuring multiple times:
i.e:
aList = [['25','20-05-13','15:16:17','34.89932','24.09421'],
['24','20-05-13','15:16:18','35.89932','23.09421],
['25','20-05-13','15:18:15','34.89932','24.13421'],
[...]]
What I'm trying to do is order the ID's in dictionaries so each unique ID will be the key and all the dates, times, latitudes and longitudes will be the values. Then I would like to write each individual ID to a new txt file so all the values for a specific ID are in one txt file. The output should look like this:
{'25':['20-05-13','15:16:17','34.89932','24.09421'],
['20-05-13','15:18:15','34.89932','24.13421'],
[...],
'24':['20-05-13','15:16:18','35.89932','23.09421'],
[...]
}
I have tried the following (and a lot of other solutions which didn't work):
items = {}
for line in aList:
key,value = lines[0],lines[1:]
items[key] = value
Which results in a key with the last value in the list forthat particular key :
{'25':['20-05-13','15:18:15','34.89932','24.13421'],
'24':['20-05-13','15:16:18','35.89932','23.09421']}
How can I loop through my list and assign the same IDs to the same key and all the corresponding values?
Is there any simple solution to this? Other "easier to implement" solutions are welcome!
I hope it makes sense :)
Try adding all the lists that match to the same ID as list of lists:
aList = [['25','20-05-13','15:16:17','34.89932','24.09421'],
['24','20-05-13','15:16:18','35.89932','23.09421'],
['25','20-05-13','15:18:15','34.89932','24.13421'],
]
items = {}
for line in aList:
key,value = line[0],line[1:]
if key in items:
items[key].append(value)
else:
items[key] = [value]
print items
OUTPUT:
{'24': [['20-05-13', '15:16:18', '35.89932', '23.09421']], '25': [['20-05-13', '15:16:17', '34.89932', '24.09421'], ['20-05-13', '15:18:15', '34.89932', '24.13421']]}
Given the following list:
colors=['#c85200','#5f9ed1','lightgrey','#ffbc79','#006ba4','dimgray','#ff800e','#a2c8ec'
,'grey','salmon','cyan','silver']
And this list:
Hospital=['a','b','c','d']
After I get the number of colors based on the length of the list - 'Hospital':
num_hosp=len(Hospital)
colrs=colors[:num_hosp]
colrs
['#c85200', '#5f9ed1', 'lightgrey', '#ffbc79']
...and zip the lists together:
hcolrs=zip(Hospitals,colrs)
Next, I'd like to be able to select 1 or more colors from hcolrs if given a list of one or more hospitals from 'Hospitals'.
Like this:
newHosps=['a','c'] #input
newColrs=['#c85200','lightgrey'] #output
Thanks in advance!
Pass the result of zip to the dict constructor to make lookup simple/fast:
# Don't need to slice colors; zip stops when shortest iterable exhausted
hosp_to_color = dict(zip(Hospitals, colors))
then use it:
newHosps = ['a','c']
newColrs = [hosp_to_color[h] for h in newHosps]
I have a string as:
[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]
I want to regex_extract_all to turn it into elements in a tuple and sepereated by ",". Then I need to filter out the ones don't contain structure and location.
However, I got an error that can't filter regex type. Any idea?
By the way, the ending goal is to parse out the longest hierarchy like (topic|news|politics|elections|primary)
update the script:
data = load load '/web/visit_log/20160303'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir
The syntax for filter matches seems incorrect.The data doesn't seem to have () in it.
c = filter b by not extr matches '(structure|location)';
Try changing this to
c = filter b by not (extr matches 'structure|location');
I have a Pig script which returns output as tuples as shown below:
(1,2,3)
(a,b,c)
I am storing this output into a file and it gets stored with the parenthesis ( and ) as above. I would like to stored the records as below in the file:
1,2,3
a,b,c
How can I get rid of the parenthesis before using 'STORE INTO' in Pig?
DUMP will display the records in a tuple that's the reason why you are seeing 1,2,3 enclosed in parenthesis.()
Ref : http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#pl_dump
Using STORE on the alias will save the values alone.
Ref : http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#pl_store