Pig - Convert tuple to a string

Pig - Convert tuple to a string - tuples

I have a Pig script which returns output as tuples as shown below:
(1,2,3)
(a,b,c)
I am storing this output into a file and it gets stored with the parenthesis ( and ) as above. I would like to stored the records as below in the file:
1,2,3
a,b,c
How can I get rid of the parenthesis before using 'STORE INTO' in Pig?

DUMP will display the records in a tuple that's the reason why you are seeing 1,2,3 enclosed in parenthesis.()
Ref : http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#pl_dump
Using STORE on the alias will save the values alone.
Ref : http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#pl_store

Related

How to use regular expression in spark dataframe using scala?

In my case i have a data frame contaning some biological data which are: protein name, ecnumber (which could be more than one) and protein domains (which could be also more than one domain). The data frame is a one column containing all those data which i would like to split it into three columns, but the problem is that if a line (containing more than one ECnumber) is splitted, the second ECnumber goes to the third column and the domains will be then disappeared.
here is my code:
val df = rdd.toDF()
val mydf = df.withColumn("_tmp", split($"value", ";")).select(
$"_tmp".getItem(0).as("Entry"),
$"_tmp".getItem(1).as("ECnumber"),
$"_tmp".getItem(2).as("Domains")
And here is the result
enter image description here

Based on the provided reference data, I see you can use the following regular expression to retrieve your data into independent columns (by extracting using regular expression):
val dataFrameValueRegex = "(\\w++);(([0-9.-]*+;)++)((\\w++;?)++)".r
For example if data frame value has got the following:
val dataFrameValue = "A6MML6;2.1.-.-;2.1.3.16;IPR037431;IPR037432;IPR037433"
Now using regular expression, you can extract independent values from data frame value:
val dataFrameValueRegex(entry, ecNumbers, _, domains, _) = dataFrameValue
Above: All values will be retrieved in corresponding variables:
1.) entry: Entry string
2.) ecNumbers: Complete string of ecnumbers separated by semicolon's. There would be a semicolon present at the end of the string.
3.) domains: Complete string of domains separated by semicolon's.
Note: If for any reason the data frame value was not as expected you would get a MatchError exception been thrown.
In the below code just printing variables information.
println(s"Data value: Entry = [$entry], ECnumbers = [${ecNumbers.init}], Domains = [$domains]")
val ecNumber = ecNumbers.init.split(";")
ecNumber.foreach(e => println(s"ecNumber = [$e]"))
val domain = domains.split(";")
domain.foreach(d => println(s"Domain = [$d]"))

Reading Values from a Bag of Tuples in pig

I have UDF output as :-
Sample records:-
({(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,5),(Todd,10),(Todd,20),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10)})
({(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,5),(Jon,10),(Jon,20),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,5),(Jon,20),(Jon,1)})
Schema for UDF:- name:chararray(1 single column)
Now i want to read this bag of tuples and generate output as :-
Todd,240
Jon,422
The output of the UDF i stored in a temp file and read it back using different schema as:-
D = LOAD '/home/training/pig/pig/UDFdata.txt' AS (B: bag {T: tuple(name:chararray, denom:int)});
After that i am trying to use foreach loop and reference dot notation to find the sum.
X = foreach D generate B.T.name,SUM(B.T.denom);
2017-03-04 13:52:59,507 ERROR org.apache.pig.tools.grunt.Grunt: ERROR
1128: Cannot find field T in name:chararray,denom:int Details at
logfile: /home/training/pig_1488648405070.log
Can you please let me know how to find it? I am new to Apache Pig so not sure how it traverse in Bag of Tuples and find sum.

GROUP the dataset on name before performing SUM.
FLATTEN the bag to perform GROUP.
flattened = FOREACH D GENERATE FLATTEN(B);
dump flattened;
...
(Todd,10)
(Todd,10)
(Jon,1)
(Jon,1)
....
Then, GROUP them on name
grouped = GROUP flattened by name;
dump grouped;
(Jon,{(Jon,1),(Jon,20),(Jon,5),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,20),(Jon,10),(Jon,5),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1)})
(Todd,{(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,20),(Todd,10),(Todd,5),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1)})
And apply SUM() over the result
final_sum = FOREACH grouped GENERATE group, SUM(flattened.denom);
dump final_sum;
(Jon,106)
(Todd,100)

PIG regex extract then filter the unnamed regex tuple

I have a string as:
[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]
I want to regex_extract_all to turn it into elements in a tuple and sepereated by ",". Then I need to filter out the ones don't contain structure and location.
However, I got an error that can't filter regex type. Any idea?
By the way, the ending goal is to parse out the longest hierarchy like (topic|news|politics|elections|primary)
update the script:
data = load load '/web/visit_log/20160303'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir

The syntax for filter matches seems incorrect.The data doesn't seem to have () in it.
c = filter b by not extr matches '(structure|location)';
Try changing this to
c = filter b by not (extr matches 'structure|location');

Pig JsonLoader issue - is not parsing custom json correctly

I am new to pig and am trying to parse a json with the following structure
{"id1":197,"id2":[
{"id3":"109.11.11.0","id4":"","id5":1391233948301},
{"id3":"10.10.15.81","id4":"","id5":1313393100648},
...
]}
The above file is called jsonfile.txt
alias = load 'jsonfile.txt' using JsonLoader('id1:int,id2:[id3:chararray,id4:chararray,id5:chararray]');
This is the error I get.
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'id3' expecting RIGHT_BRACKET
Do you know what i could be doing wrong?

Your JSON schema is not well formatted.
The formats for complex data types are shown here:
Tuple: enclosed by (), items separated by ","
Non-empty tuple: (item1,item2,item3)
Empty tuple is valid: ()
Bag: enclosed by {}, tuples separated by ","
Non-empty bag: {code}{(tuple1),(tuple2),(tuple3)}{code}
Empty bag is valid: {}
Map: enclosed by [], items separated by ",", key and value separated by "#"
Non-empty map: [key1#value1,key2#value2]
Empty map is valid: []
Source : http://pig.apache.org/docs/r0.10.0/func.html#jsonloadstore
In other words, [] aren't arrays, they're associative tables (maps) where the key character is "#" to split keys and values. Try using tuples (parenthesis) instead.
'id1:int,id2:(id3:chararray,id4:chararray,id5:chararray)'
OR
'id1:int,id2:{(id3:chararray,id4:chararray,id5:chararray)}'
I couldn't test it and never trying Pig but according to documentation, it should work just fine.
(based on the following example)
a = load 'a.json' using JsonLoader('a0:int,a1:{(a10:int,a11:chararray)},a2:(a20:double,a21:bytearray),a3:[chararray]');

Splitting a tuple of bags into multiple tuples using regex_extarct_all in pig

I have my input string as
{(100),(200),(300)}
and I want to split them into different columns using regex_extract_all.
I have used REGEX_EXTRACT_ALL(data,'[^{(,)}]+') and I am getting an error saying:
Could not infer the matching function for org.apache.pig.builtin.REGEX_EXTRACT_ALL as multiple or none of them fit. Please use an explicit cast.
What am I missing ?

If it's a tuple of bags, you can't use it as a chararray field - thus you can't use REGEX_EXTRACT_ALL with it. If you want to split it into multiple tuples, use the function BagToTuple and FLATTEN.
e.g.:
...
a = load '/tmp/1.txt';
b = group a by $0;
c = foreach b generate group, FLATTEN(BagToTuple(a.$1));
...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pig - Convert tuple to a string - tuples

Related

How to use regular expression in spark dataframe using scala?

Reading Values from a Bag of Tuples in pig

PIG regex extract then filter the unnamed regex tuple

Pig JsonLoader issue - is not parsing custom json correctly

Splitting a tuple of bags into multiple tuples using regex_extarct_all in pig

Categories

Resources