Using Regex in Pig in hadoop - regex

I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.
For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches '.*favorite.*';
D = order C by tweetid;
How does the regular expression work here in splitting the output in desired way?
I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);
the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw #samantha0wen and #DakotaFears at the drake concert #waddup")
(396124554172432384,"#Yutika_Diwadkar I'm just so bright 😁")
(396124554609033216,"#TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")
(396124554805776385,"#MichaelThe_Lion me too 😒")
(396124552540852226,"Happy Halloween from us 2 #maddow & #Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>
Please help.

Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.
" in the csv
” in the regex code.
To get the tweetid try this:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1)) AS (tweetid:long);

Related

Pandas: Grouping rows by list in CSV file?

In an effort to make our budgeting life a bit easier and help myself learn; I am creating a small program in python that takes data from our exported bank csv.
I will give you an example of what I want to do with this data. Say I want to group all of my fast food expenses together. There are many different names with different totals in the description column but I want to see it all tabulated as one "Fast Food " expense.
For instance the Csv is setup like this:
Date Description Debit Credit
1/20/20 POS PIN BLAH BLAH ### 1.75 NaN
I figured out how to group them with an or statement:
contains = df.loc[df['Description'].str.contains('food court|whataburger', flags = re.I, regex = True)]
I ultimately would like to have it read off of a list? I would like to group all my expenses into categories and check those category variable names so that it would only output from that list.
I tried something like:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
That obviously didn't work.
If there is a better way of doing this I am wide open to suggestions.
Also I have looked through quite a few posts here on stack and have yet to find the answer (although I am sure I overlooked it)
Any help would be greatly appreciated. I am still learning.
Thanks
You can assign a new column using str.extract and then groupby:
df = pd.DataFrame({"description":['Macdonald something', 'Whataburger something', 'pizza hut something',
'Whataburger something','Macdonald something','Macdonald otherthing',],
"debit":[1.75,2.0,3.5,4.5,1.5,2.0]})
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
df["found"] = df["description"].str.extract(f'({"|".join(fast_food)})',flags=re.I)
print (df.groupby("found").sum())
#
debit
found
Macdonald 5.25
Whataburger 6.50
pizza hut 3.50
Use dynamic pattern building:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
pattern = r"\b(?:{})\b".format("|".join(map(re.escape, fast_food)))
contains = df.loc[df['Description'].str.contains(pattern, flags = re.I, regex = True)]
The \b word boundaries find whole words, not partial words.
The re.escape will protect special characters and they will be parsed as literal characters.
If \b does not work for you, check other approaches at Match a whole word in a string using dynamic regex

Regex (re2 googlesheets) multiple values in multiline cell

Getting stuck on how to read and pretty up these values from a multiline cell via arrayformula.
Im using regex as preceding line can vary.
just formulas please, no custom code
The first column looks like a set of these:
```
[config]
name = the_name
texture = blah.dds
cost = 1000
[effect0]
value = 1000
type = ATTR_A
[effect1]
value = 8
type = ATTR_B
[feature0]
name = feature_blah
[components]
0 = comp_one,1
[resources]
res_one = 1
res_five = 1
res_four = 1
<br/>
Where to be useful elsewhere, at minimum it needs each [tag] set ([effect\d], [feature\d], ect) to be in one column each, for example the 'effects' column would look like:
ATTR_A:1000,ATTR_B:8
and so on.
Desired output can also be seen in the included spreadsheet
<br/>
<b>Here is the example spreadsheet:</b>
https://docs.google.com/spreadsheets/d/1arMaaT56S_STTvRr2OxCINTyF-VvZ95Pm3mljju8Cxw/edit?usp=sharing
**Current REGEXREPLACE**
Kinda works, finds each 'type' and 'value' great, just cant figure out how to extract just that from the rest, tried capture (and non-capturing) groups before and after but didnt work
=ARRAYFORMULA(REGEXREPLACE($A3:$A,"[\n.][effect\d][\n.](.)\n(.)","1:$1 2:$2"))
**Current SUBSTITUTE + REGEXEXTRACT + REGEXREPLACE**
A different approach entirely, also kinda works, longer form though and left with having to parse the values out of that string, where got stuck again. Idea was to use this to simplify, then regexreplace like above. Getting stuck removing content around the final matches though, and if can do that then above approach is fine too.
// First ran a substitute
=ARRAYFORMULA(SUBSTITUTE(SUBSTITUTE($A3:$A,char(10),";"),";;",char(10)))
// Then variation of this (gave up on single line 'effect/d' so broke it up to try and get it working)
=ARRAYFORMULA(IF(A3:A<>"",IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect0]);(.)$")&";;")&""&IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect1]);(.)$")&";;")&""&IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect2]);(.)$")&";;"),""))
// Then use regexreplace like above
=ARRAYFORMULA(REGEXREPLACE($B3:$B,"value = (.);type = (.);;","1:$1 2:$2"))
**--EDIT--**
Also, as my updated 'Desired Output' sheet shows (see timestamped comment below), bonus kudos if you can also extract just the values of matching 'type's to those extra columns (see spreadsheet).
All good if you cant though, just realized would need that too for lookups.
**--END OF EDIT--**
<br/>
Ive tried dozens of things, discarding each in turn, had a quick look in version history to grab out two promising attempts and shared them in separate sheets.
One of these also used SUBSTITUTE to simplify input column, im happy for a solution using either RAW or the SUBSTITUTE results.
<br/>
**Potentially Useful links:**
https://github.com/google/re2/wiki/Syntax
<br/>
<b>Just some more words:</b>
I also have looked at dozens of stackoverflow and google support pages, so tried both REGEXEXTRACT and REGEXREPLACE, both promising but missing that final tweak. And i tried dozens of tweaks already on both.
Any help would be great, and hopefully help others in future since examples with spreadsheets are great since every new REGEX seems to be a new adventure ;)
<br/>
P.S. if we can think of better title for OP, please say in comment or your answer :)
paste in B3:
=ARRAYFORMULA(SUBSTITUTE(TRIM(TRANSPOSE(QUERY(TRANSPOSE(
IF(C3:E<>"", C2:E2&":"&C3:E, )),,999^99))), " ", ", "))
paste in C3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&C2)))
paste in D3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&D2)))
paste in E3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&E2)))
paste in F3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(A3:A, "\[feature\d+\]\nname = (.*)")))
paste in G3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(A3:A, "\[components\]\n\d+ = (.*)")))
paste in H3:
=ARRAYFORMULA(IFNA(REGEXREPLACE(INDEX(SPLIT(REGEXEXTRACT(
REGEXREPLACE(A3:A, "\n", ", "), "\[resources\], (.*)"), "["),,1), ", , $", )))
spreadsheet demo
This was a fun exercise. :-)
Caveat first: I have added some "input data". Examples:
[feature1]
name = feature_active_spoiler2
[components]
0 = spoiler,1
1 = spoilerA, 2
So the output has "extra" output.
See the tab ADW's Solution.

regex to return all values not just first found one

I'm learning Pig Latin and am using regular expressions. Not sure if the regex is language agnostic or not but here is what I'm trying to do.
If I have a table with two fields: tweet id and tweet, I'd like to go through each tweet and pull out all mentions up to 3.
So if a tweet goes something like "#tim bla #sam #joe something bla bla" then the line item for that tweet will have tweet id, tim, sam, joe.
The raw data has twitter ids not the actual handles so this regex seems to return a mention (.*)#user_(\\S{8})([:| ])(.*)
Here is what I have tried:
a = load 'data.txt' AS (id:chararray, tweet:chararray);
b = foreach a generate id, LOWER(tweet) as tweet;
// filter data so only tweets with mentions
c = FILTER b BY tweet MATCHES '(.*)#user_(\\S{8})([:| ])(.*)';
// try to pull out the mentions.
d = foreach c generate id,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1}',3) as mention1,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1,2}',3) as mention2,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){2,3}',3) as mention3;
e = limit d 20;
dump e;
So in that try I was playing with quantifiers, trying to return the first, second and 3rd instance of a match in a tweet {1}, {1,2}, {2,3}.
That did not work, mention 1-3 are just empty.
So I tried changing d:
d = foreach c generate id,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)',2) as mention1,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',5) as mention2,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',8) as mention3,
But, instead of returning each user mentioned, this returned the same mention 3 times. I had expected that by cutting n pasting the expression again I'd get the second match, and pasting it a 3rd time would get the 3rd match.
I'm not sure how well I've managed to word this question but to put it another way, imagine that the function regex_extract() returned an array of matched terms. I would like to get mention[0], mention[1], mention[2] on a single line item.
Whenever you use PATTERN_EXTRACT or PATTERN_EXTRACT_ALL udf, keep in mind that it is just pure regex handled by Java.
It is easier to test the regex through a local Java test. Here is the regex I found to be acceptable :
Pattern p = Pattern.compile("#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?");
String input = "So if a tweet goes something like #tim bla #sam #joe #bill something bla bla";
Matcher m = p.matcher(input);
if(m.find()){
for(int i=0; i<=m.groupCount(); i++){
System.out.println(i + " -> " + m.group(i));
}
}
With this regex, if there is at least a mention, it will returns three fields, the seconds and/or third being null if a second/third mention is not found.
Therefore, you may use the following PIG code :
d = foreach c generate id, REGEX_EXTRACT_ALL(
tweet, '#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?');
You do not even need to filter the data first.

Pig capture matching string with regex

I am trying to capture image url's from inside tweets.
REGISTER 'hdfs:///user/cloudera/elephant-bird-pig-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-core-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-hadoop-compat-4.1.jar';
--Load Json
loadJson = LOAD '/user/cloudera/tweetwall' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map []);
B = FOREACH loadJson GENERATE flatten(json#'tweets') as (m:map[]);
tweetText = FOREACH B GENERATE FLATTEN(m#'text') as (str:chararray);
intermediate date looks like this:
(#somenameontwitter your nan makes me laugh with some of the things she comes out with like http://somepics.com/my.jpg)
then I try to do the following to get only the image url back :
x = foreach tweetText generate REGEX_EXTRACT_ALL(str, '((http)(.*)(.jpg|.bmp|.png))');
dump x;
but that doesn't seem to work. I have also been trying with filter to no avail.
Even when trying the above with .* it returns empty results () or (())
I'm not good with regex and pretty new to Pig so it could be that I'm missing something simple here that I'm just not seeing.
update
example input data
{"tweets":[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"#Beace_ your nan makes me laugh with some of the things she comes out with blabla http://t.co/b7hjMWNg is an url, but not a valid one http://www.something.com/this.jpg should be a valid url","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}]}
Try this and let me know if this works
x = foreach tweetText generate REGEX_EXTRACT(str,'.*(http://.*.[jpg|bmp|png])',1);
DUMP x;
I managed to get it working (though I doubt it is totally optimal)
x = foreach tweetText generate REGEX_EXTRACT(str,'(http://.*(.jpg|.bmp|.png))',1) as image;
filtered = FILTER x BY $0 is not null;
dump filtered;
so the initial problem was just the regex (and my lack of knowledge on the subject).
Thanks for the assistance sivasakthi jayaraman!

Stata: Efficient way to replace numerical values with string values

I have code that currently looks like this:
replace fname = "JACK" if id==103
replace lname = "MARTIN" if id==103
replace fname = "MICHAEL" if id==104
replace lname = "JOHNSON" if id==104
And it goes on for multiple pages like this, replacing an ID name with a first and last name string. I was wondering if there is a more efficient way to do this en masse, perhaps by using the recode command?
I will echo the other answers that suggest a merge is the best way to do this.
But if you absolutely must code the lines item-wise (again, messy) you can generate a long list ("pages") of replace commands by using MS Excel to "help" you write the code. Here is a picture of your Excel sheet with one example, showing the MS Excel formula:
columns:
A B C D
row: 1 last first id code
2 MARTIN JACK 103 ="replace fname=^"&B2&"^ if id=="&C2
You type that in, make sure it looks like Stata code when the formula calculates (aside from the carets), and copy the formula in column D down to the end of your list. Then copy the whole block of Stata code in column D generated by the formulas into your do-file, and do a find and replace (be careful here if you are using the caret elsewhere for mathematical uses!!) for all ^ to be replaced with ", which will end up generating proper Stata syntax.
(This is truly a brute force way of doing this, and is less dynamic in the case that there are subsequent changes to your generation list. All--apologies in advance for answering a question here advocating use of Excel :) )
You don't explain where the strings you want to add come from, but what is generally the best technique is explained at
http://www.stata.com/support/faqs/data-management/group-characteristics-for-subsets/index.html
Create an associative array of ids vs Fname,Lname
103 => JACK,MARTIN
104 => MICHAEL,JOHNSON
...
Replace
id => hash{id} ( fname & lname )
The efficiency of doing this will be taken care by the programming language used