In my case i have a data frame contaning some biological data which are: protein name, ecnumber (which could be more than one) and protein domains (which could be also more than one domain). The data frame is a one column containing all those data which i would like to split it into three columns, but the problem is that if a line (containing more than one ECnumber) is splitted, the second ECnumber goes to the third column and the domains will be then disappeared.
here is my code:
val df = rdd.toDF()
val mydf = df.withColumn("_tmp", split($"value", ";")).select(
$"_tmp".getItem(0).as("Entry"),
$"_tmp".getItem(1).as("ECnumber"),
$"_tmp".getItem(2).as("Domains")
And here is the result
enter image description here
Based on the provided reference data, I see you can use the following regular expression to retrieve your data into independent columns (by extracting using regular expression):
val dataFrameValueRegex = "(\\w++);(([0-9.-]*+;)++)((\\w++;?)++)".r
For example if data frame value has got the following:
val dataFrameValue = "A6MML6;2.1.-.-;2.1.3.16;IPR037431;IPR037432;IPR037433"
Now using regular expression, you can extract independent values from data frame value:
val dataFrameValueRegex(entry, ecNumbers, _, domains, _) = dataFrameValue
Above: All values will be retrieved in corresponding variables:
1.) entry: Entry string
2.) ecNumbers: Complete string of ecnumbers separated by semicolon's. There would be a semicolon present at the end of the string.
3.) domains: Complete string of domains separated by semicolon's.
Note: If for any reason the data frame value was not as expected you would get a MatchError exception been thrown.
In the below code just printing variables information.
println(s"Data value: Entry = [$entry], ECnumbers = [${ecNumbers.init}], Domains = [$domains]")
val ecNumber = ecNumbers.init.split(";")
ecNumber.foreach(e => println(s"ecNumber = [$e]"))
val domain = domains.split(";")
domain.foreach(d => println(s"Domain = [$d]"))
I have UDF output as :-
Sample records:-
({(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,5),(Todd,10),(Todd,20),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10)})
({(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,5),(Jon,10),(Jon,20),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,5),(Jon,20),(Jon,1)})
Schema for UDF:- name:chararray(1 single column)
Now i want to read this bag of tuples and generate output as :-
Todd,240
Jon,422
The output of the UDF i stored in a temp file and read it back using different schema as:-
D = LOAD '/home/training/pig/pig/UDFdata.txt' AS (B: bag {T: tuple(name:chararray, denom:int)});
After that i am trying to use foreach loop and reference dot notation to find the sum.
X = foreach D generate B.T.name,SUM(B.T.denom);
2017-03-04 13:52:59,507 ERROR org.apache.pig.tools.grunt.Grunt: ERROR
1128: Cannot find field T in name:chararray,denom:int Details at
logfile: /home/training/pig_1488648405070.log
Can you please let me know how to find it? I am new to Apache Pig so not sure how it traverse in Bag of Tuples and find sum.
GROUP the dataset on name before performing SUM.
FLATTEN the bag to perform GROUP.
flattened = FOREACH D GENERATE FLATTEN(B);
dump flattened;
...
(Todd,10)
(Todd,10)
(Jon,1)
(Jon,1)
....
Then, GROUP them on name
grouped = GROUP flattened by name;
dump grouped;
(Jon,{(Jon,1),(Jon,20),(Jon,5),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,20),(Jon,10),(Jon,5),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1)})
(Todd,{(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,20),(Todd,10),(Todd,5),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1)})
And apply SUM() over the result
final_sum = FOREACH grouped GENERATE group, SUM(flattened.denom);
dump final_sum;
(Jon,106)
(Todd,100)
I have a string as:
[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]
I want to regex_extract_all to turn it into elements in a tuple and sepereated by ",". Then I need to filter out the ones don't contain structure and location.
However, I got an error that can't filter regex type. Any idea?
By the way, the ending goal is to parse out the longest hierarchy like (topic|news|politics|elections|primary)
update the script:
data = load load '/web/visit_log/20160303'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir
The syntax for filter matches seems incorrect.The data doesn't seem to have () in it.
c = filter b by not extr matches '(structure|location)';
Try changing this to
c = filter b by not (extr matches 'structure|location');
I am new to pig and am trying to parse a json with the following structure
{"id1":197,"id2":[
{"id3":"109.11.11.0","id4":"","id5":1391233948301},
{"id3":"10.10.15.81","id4":"","id5":1313393100648},
...
]}
The above file is called jsonfile.txt
alias = load 'jsonfile.txt' using JsonLoader('id1:int,id2:[id3:chararray,id4:chararray,id5:chararray]');
This is the error I get.
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'id3' expecting RIGHT_BRACKET
Do you know what i could be doing wrong?
Your JSON schema is not well formatted.
The formats for complex data types are shown here:
Tuple: enclosed by (), items separated by ","
Non-empty tuple: (item1,item2,item3)
Empty tuple is valid: ()
Bag: enclosed by {}, tuples separated by ","
Non-empty bag: {code}{(tuple1),(tuple2),(tuple3)}{code}
Empty bag is valid: {}
Map: enclosed by [], items separated by ",", key and value separated by "#"
Non-empty map: [key1#value1,key2#value2]
Empty map is valid: []
Source : http://pig.apache.org/docs/r0.10.0/func.html#jsonloadstore
In other words, [] aren't arrays, they're associative tables (maps) where the key character is "#" to split keys and values. Try using tuples (parenthesis) instead.
'id1:int,id2:(id3:chararray,id4:chararray,id5:chararray)'
OR
'id1:int,id2:{(id3:chararray,id4:chararray,id5:chararray)}'
I couldn't test it and never trying Pig but according to documentation, it should work just fine.
(based on the following example)
a = load 'a.json' using JsonLoader('a0:int,a1:{(a10:int,a11:chararray)},a2:(a20:double,a21:bytearray),a3:[chararray]');
I have my input string as
{(100),(200),(300)}
and I want to split them into different columns using regex_extract_all.
I have used REGEX_EXTRACT_ALL(data,'[^{(,)}]+') and I am getting an error saying:
Could not infer the matching function for org.apache.pig.builtin.REGEX_EXTRACT_ALL as multiple or none of them fit. Please use an explicit cast.
What am I missing ?
If it's a tuple of bags, you can't use it as a chararray field - thus you can't use REGEX_EXTRACT_ALL with it. If you want to split it into multiple tuples, use the function BagToTuple and FLATTEN.
e.g.:
...
a = load '/tmp/1.txt';
b = group a by $0;
c = foreach b generate group, FLATTEN(BagToTuple(a.$1));
...