regex to return all values not just first found one - regex

I'm learning Pig Latin and am using regular expressions. Not sure if the regex is language agnostic or not but here is what I'm trying to do.
If I have a table with two fields: tweet id and tweet, I'd like to go through each tweet and pull out all mentions up to 3.
So if a tweet goes something like "#tim bla #sam #joe something bla bla" then the line item for that tweet will have tweet id, tim, sam, joe.
The raw data has twitter ids not the actual handles so this regex seems to return a mention (.*)#user_(\\S{8})([:| ])(.*)
Here is what I have tried:
a = load 'data.txt' AS (id:chararray, tweet:chararray);
b = foreach a generate id, LOWER(tweet) as tweet;
// filter data so only tweets with mentions
c = FILTER b BY tweet MATCHES '(.*)#user_(\\S{8})([:| ])(.*)';
// try to pull out the mentions.
d = foreach c generate id,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1}',3) as mention1,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1,2}',3) as mention2,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){2,3}',3) as mention3;
e = limit d 20;
dump e;
So in that try I was playing with quantifiers, trying to return the first, second and 3rd instance of a match in a tweet {1}, {1,2}, {2,3}.
That did not work, mention 1-3 are just empty.
So I tried changing d:
d = foreach c generate id,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)',2) as mention1,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',5) as mention2,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',8) as mention3,
But, instead of returning each user mentioned, this returned the same mention 3 times. I had expected that by cutting n pasting the expression again I'd get the second match, and pasting it a 3rd time would get the 3rd match.
I'm not sure how well I've managed to word this question but to put it another way, imagine that the function regex_extract() returned an array of matched terms. I would like to get mention[0], mention[1], mention[2] on a single line item.

Whenever you use PATTERN_EXTRACT or PATTERN_EXTRACT_ALL udf, keep in mind that it is just pure regex handled by Java.
It is easier to test the regex through a local Java test. Here is the regex I found to be acceptable :
Pattern p = Pattern.compile("#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?");
String input = "So if a tweet goes something like #tim bla #sam #joe #bill something bla bla";
Matcher m = p.matcher(input);
if(m.find()){
for(int i=0; i<=m.groupCount(); i++){
System.out.println(i + " -> " + m.group(i));
}
}
With this regex, if there is at least a mention, it will returns three fields, the seconds and/or third being null if a second/third mention is not found.
Therefore, you may use the following PIG code :
d = foreach c generate id, REGEX_EXTRACT_ALL(
tweet, '#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?');
You do not even need to filter the data first.

Related

Scoring multiple TRUES in Pythton RE Search

Background
I have a list of "bad words" in a file called bad_words.conf, which reads as follows
(I've changed it so that it's clean for the sake of this post but in real-life they are expletives);
wrote (some )?rubbish
swore
I have a user input field which is cleaned and striped of dangerous characters before being passed as data to the following script, score.py
(for the sake of this example I've just typed in the value for data)
import re
data = 'I wrote some rubbish and swore too'
# Get list of bad words
bad_words = open("bad_words.conf", 'r')
lines = bad_words.read().split('\n')
combine = "(" + ")|(".join(lines) + ")"
#set score incase no results
score = 0
#search for bad words
if re.search(combine, data):
#add one for a hit
score += 1
#show me the score
print(str(score))
bad_words.close()
Now this finds a result and adds a score of 1, as expected, without a loop.
Question
I need to adapt this script so that I can add 1 to the score every time a line of "bad_words.conf" is found within text.
So in the instance above, data = 'I wrote some rubbish and swore too' I would like to actually score a total of 2.
1 for "wrote some rubbish" and +1 for "swore".
Thanks for the help!
Changing combine to just:
combine = "|".join(lines)
And using re.findall():
In [33]: re.findall(combine,data)
Out[33]: ['rubbish', 'swore']
The problem with having the multiple capturing groups as you originally were doing is that re.findall() will return each additional one of those as an empty string when one of the words is matched.

PIG regex extract then filter the unnamed regex tuple

I have a string as:
[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]
I want to regex_extract_all to turn it into elements in a tuple and sepereated by ",". Then I need to filter out the ones don't contain structure and location.
However, I got an error that can't filter regex type. Any idea?
By the way, the ending goal is to parse out the longest hierarchy like (topic|news|politics|elections|primary)
update the script:
data = load load '/web/visit_log/20160303'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir
The syntax for filter matches seems incorrect.The data doesn't seem to have () in it.
c = filter b by not extr matches '(structure|location)';
Try changing this to
c = filter b by not (extr matches 'structure|location');

Using Regex in Pig in hadoop

I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.
For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches '.*favorite.*';
D = order C by tweetid;
How does the regular expression work here in splitting the output in desired way?
I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);
the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw #samantha0wen and #DakotaFears at the drake concert #waddup")
(396124554172432384,"#Yutika_Diwadkar I'm just so bright 😁")
(396124554609033216,"#TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")
(396124554805776385,"#MichaelThe_Lion me too 😒")
(396124552540852226,"Happy Halloween from us 2 #maddow & #Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>
Please help.
Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.
" in the csv
” in the regex code.
To get the tweetid try this:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1)) AS (tweetid:long);

Pig capture matching string with regex

I am trying to capture image url's from inside tweets.
REGISTER 'hdfs:///user/cloudera/elephant-bird-pig-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-core-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-hadoop-compat-4.1.jar';
--Load Json
loadJson = LOAD '/user/cloudera/tweetwall' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map []);
B = FOREACH loadJson GENERATE flatten(json#'tweets') as (m:map[]);
tweetText = FOREACH B GENERATE FLATTEN(m#'text') as (str:chararray);
intermediate date looks like this:
(#somenameontwitter your nan makes me laugh with some of the things she comes out with like http://somepics.com/my.jpg)
then I try to do the following to get only the image url back :
x = foreach tweetText generate REGEX_EXTRACT_ALL(str, '((http)(.*)(.jpg|.bmp|.png))');
dump x;
but that doesn't seem to work. I have also been trying with filter to no avail.
Even when trying the above with .* it returns empty results () or (())
I'm not good with regex and pretty new to Pig so it could be that I'm missing something simple here that I'm just not seeing.
update
example input data
{"tweets":[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"#Beace_ your nan makes me laugh with some of the things she comes out with blabla http://t.co/b7hjMWNg is an url, but not a valid one http://www.something.com/this.jpg should be a valid url","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}]}
Try this and let me know if this works
x = foreach tweetText generate REGEX_EXTRACT(str,'.*(http://.*.[jpg|bmp|png])',1);
DUMP x;
I managed to get it working (though I doubt it is totally optimal)
x = foreach tweetText generate REGEX_EXTRACT(str,'(http://.*(.jpg|.bmp|.png))',1) as image;
filtered = FILTER x BY $0 is not null;
dump filtered;
so the initial problem was just the regex (and my lack of knowledge on the subject).
Thanks for the assistance sivasakthi jayaraman!

pig - parsing string with regex

I'm stuck on string parsing in Pig.
I have looked at the documentation around regex_extract and regex_extract_all and hoped to use one of those functions.
I have file '/logs/test.log':
cat '/logs/test.log'
user=242562&friend=6226&friend=93856&age=35&friend=35900
I want to extract the friend tags from the url, and in this case, I have 3 identical tags. regex_extract seems to only work for the first instance, which is what I expected, and for regex_extract_all, it seems like I have know the whole string pattern, which changes on each row of the source file.
It looked ok with regex_extract, but this option only gives me the first one.
[root#test]# pig -x local
A = LOAD './test.log';
B = FOREACH A GENERATE REGEX_EXTRACT($0, 'friend=([0-9]*)',1);
dump B;
(6226)
The examples I see for regex_extract_all show regex where you seek out all the tags:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL($0, 'user=([0-9]+?)&friend=([0-9]+?)&friend=([0-9]+?)&.+?'));
dump B;
(242562,6226,93856)
That seems to work, but I really just want to extract the friends - (6226,93856,35900). I also have cases where there might be more-than or less-than 3 friends per user.
Any ideas?
Also looking at using something like FLATTEN(TOKENIZE($0,'&')) and then somehow only filtering on the SUBSTRING($0,0,INDEXOF($0,'=')) == 'friend' or something like that, but wanted to see if anyone knew a good regex approach.
This can be achieved by simple string manipulations:
inputs = LOAD 'input' AS (line: chararray);
tokenized = FOREACH inputs GENERATE FLATTEN(TOKENIZE(line, '&')) AS parameter;
filtered = FILTER tokenized BY INDEXOF(parameter, 'friend=') != -1;
result = FOREACH filtered GENERATE SUBSTRING(parameter, 7, (int)SIZE(parameter)) AS friend_number;
DESCRIBE tokenized;
DUMP tokenized;
DESCRIBE filtered;
DUMP filtered;
DESCRIBE result;
DUMP result;
result:
tokenized: {parameter: chararray}
(user=242562)
(friend=6226)
(friend=93856)
(age=35)
(friend=35900)
filtered: {parameter: chararray}
(friend=6226)
(friend=93856)
(friend=35900)
result: {friend_number: chararray}
(6226)
(93856)
(35900)
Try this:
a = LOAD '/logs/test.log' USING PigStorage('&') as (f1, f2, f3, f4, f5);
b = FOREACH a GENERATE REGEX_EXTRACT(f2,'friend=([0-9]*)', 1),
REGEX_EXTRACT(f3,'friend=([0-9]*)', 1),
REGEX_EXTRACT(f5,'friend=([0-9]*)', 1);
DUMP b;