Pig: extracting email details from raw text using REGEX - regex

I am trying to extract email details from raw text using pig.
Here's the sample data:
Sample data for email abc.123#gmail.com
Sample data for email xyz#abc.com
I am trying with REGEX method, Regular expression i took from: http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/
here's the script:
A = Load '----' using PigStorage as (value: chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(value, '^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z]{2,})$')) AS (f1: chararray)
dump B;
After dumping the output into the terminal, i am getting blank output:
()
()
Is there any problem in script syntax?
Please share some links also regarding regular expression writing, it would be very much helpful.
Your help is appreciated, thank you.

For following input data
abc.123#gmail.com
xyz#abc.com
Output of your code is
.123 .com
.com
So there are couple of problems in your code
You need to add parenthesis around the whole regex to capture the complete email address. The code should then work if you have only one token (word or email-id) in each line
If each input line can be a sentence, then you have to first tokenize and then on tokens you can to do regex match.
The reason that the regex you have works only on token and not on line is "^" indicates beginning of string and "$" indicates end of string, so the match is going to successful only when the entire line is an email-id which means you can have only one token per line.

Related

Regex to match forward slash surrounded by double quotes

I have a serialised string that comes from Spring hosted end-points. On the frontend which is javascript based, I wanted to prettify the serialised string that comes from API to a string that is parsable through JSON.parse();
Please let me know the regex to match and replace the required fields as below.
sample string: \"address\":\"<VALUE>"\"}, I want to replace all the instances of "\" which comes at the end of VALUE with \"
Tried doing this: str.replaceAll('\"/\\\"', '/\\\"') but no luck.
Here is the code, we have to escape characters to put the wanted values into the variable:
testString='\\\"address\\\":\\\"<VALUE>"\\\"},';
alert(testString);
alert(testString.replace(/\"\\\"/,'\\\"'));
The first alert gives us the originale testString:
\"address\":\"<VALUE>"\"},
and the second the modified testString
\"address\":\"<VALUE>\"},
Tested with https://www.webtoolkitonline.com/javascript-tester.html

How to write a REGEX that captures a string between a string (ie the words between specific words)

The Text string below is coming through in the Integromat text parser. I'm trying to capture the values from a form a user filled out using the built in Integromat Regex text parser.
For example, the test string comes in as (unfortunately the info is not coming through on individual lines):
Information First Name:Frank Last Name:McTester Email:mctesterxmas#rya.com
Guest Name:Debby McTester Party RegistrationNumber of Dinner Guests: 2 [http://
I need the regex to pull the info FRANK, which is between the string First Name: and Last Name:, so on and so forth.
My current regex works great for emails where these strings are on their own lines. For example if the email comes in with each string on its own line, then this regex works well.
First Name:\s*(.*)|Last Name:\s*(.*)|Email:\s*(.*)|Guest Name:\s*(.*)|Number of Dinner Guests:\s*(.*)
But when everything is mashed up, I cannot figure out how to use regex to parse the string.
Instead of using alternatives, match the entire line with each field in order.
First Name:\s*(.*?)\s+Last Name:\s*(.*?)\s+Email:\s*(.*)|Guest Name:\s*(.*?)Number of Dinner Guests:\s*(.*)
This was the final solution if anyone else needs it.
Transaction Date: *(?<date>\S+)|^Information[^:]*:\s*(?<name>.*)Last Name:\s*(?<lastname>.*)Email:(?<email>.*)Guest\n?Name:(?<guestname>.*)|Dinner Guests:\s*(?<guestcount>\d+)
or
Transaction Date: *(?<date>[^\s\n]+)|^Information[^:]*:\s*(?<name>.*)Last Name:\s*(?<lastname>.*)Email:(?<email>.*)Guest\n?Name:(?<guestname>.*)|Dinner Guests:\s*(?<guestcount>\d+)

How to extract FirstName and LastName from html tags with regex?

I have response body which contains
"<h3 class="panel-title">Welcome
First Last </h3>"
I want to fetch 'First Last' as a output
The regular expression I have tried are
"Welcome(\s*([A-Za-z]+))(\s*([A-Za-z]+))"
"Welcome \s*([A-Za-z]+)\s*([A-Za-z]+)"
But not able to get the result. If I remove the newline and take it as
"<h3 class="panel-title">Welcome First Last </h3>" it is detecting in online regex maker.
I suspect your problem is the carriage return between "Welcome" and the user name. If you use the "single-line mode" flag (?s) in your regex, it will ignore newlines. Try these:
(?s)Welcome(\s*([A-Za-z]+))(\s*([A-Za-z]+))
(?s)Welcome \s*([A-Za-z]+)\s*([A-Za-z]+)
(this works in jMeter and any other java or php based regex, but not in javascript. In the comments on the question you say you're using javascript and also jMeter - if it is a jMeter question, then this will help. if javaScript, try one of the other answers)
Well, usually I don't recommend regex for this kind of work. DOM manipulation plays at its best.
but you can use following regex to yank text:
/(?:<h3.*?>)([^<]+)(?:<\/h3>)/i
See demo at https://regex101.com/r/wA2sZ9/1
This will extract First and Last names including extra spacing. I'm sure you can easily deal with spaces.
In jmeter reg exp extractor you can use:
<h3 class="panel-title">Welcome(.*?)</h3>
Then take value using $1$.
In the data you shown welcome is followed by enter.If actually its part of response then you have to use \n.
<h3 class="panel-title">Welcome\n(.*?)</h3>
Otherwise above one is enough.
First verify this in jmeter using regular expression tester of response body.
Welcome([\s\S]+?)<
Try this, it will definitely work.
Regular expressions are greedy by default, try this
Welcome\s*([A-Za-z]+)\s*([A-Za-z]+)
Groups 1 and 2 contain your data
Check it here

Powershell with regex: Unable to find and replace ALL occurences of specified string in a set of data

I am new to regular expressions and stackoverflow. Any help would be greatly appreciated.
I am trying to remove unwanted data from a data set. The data is contained in a .csv file column with multiple cells, each cell containing data similar to this:
OSVDB #109124,OSVDB #109125,OSVDB #109126,OSVDB #109127,OSVDB #109128,OSVDB #109129,OSVDB #109130,OSVDB #109131,OSVDB #109132,OSVDB #109133,OSVDB #109134,OSVDB #109135,OSVDB #109136,OSVDB #109137,OSVDB #109138,OSVDB #109139,OSVDB #109140,OSVDB #109141,OSVDB #109142,OSVDB #109143,VMSA #2014-0012,OSVDB #102715,OSVDB #104972,OSVDB #106710,OSVDB #115364,IAVA #2014-A-0191,IAVB #2014-B-0160,IAVB #2014-B-0162,IAVB #2015-B-0007
I want to replace the above data with each occurrence of the strings beginning "IAV...". So, the above cell would read:
IAVA #2014-A-0191,IAVB #2014-B-0160,IAVB #2014-B-0162,IAVB #2015-B-0007
Below is a snippet of the script that imports the .csv and gets the column containing the data.
My regex, within powershell is:
$reg1 = '$1'
$reg2 = '(IAV[A|B]\s#[0-9]{4}-[A|B]-[0-9]{4}){1,}'
ForEach-Object {$_.IAVM = [regex]::replace($_.IAVM,$reg2,$reg1); $_}
The result is:
The entire cell contents posted above.
From my understanding {1,} at the end of the regex should return each occurrence of the string pattern, but I'm returning all contents of every cell containing my regex string.
Maybe instead of trying to pick out your string you just delete the stuff you don't want? Try something like:
$reg1=''
$reg2='((OSVDB|VMSA)\s#[M-S0-9-]{6,9}[,]?)'
You have .* in that regex at the very beginning. This will capture everything up to the last match of the pat that follows it. In your case I don't think you need that part anyway.
Also note that PowerShell has a handy -replace operator, so there's often no reason to use the static methods on the Regex type.

How do I extract a postcode from one column in SSIS using regular expression

I'm trying to use a custom regex clean transformation (information found here ) to extract a post code from a mixed address column (Address3) and move it to a new column (Post Code)
Example of incoming data:
Address3: "London W12 9LZ"
Incoming data could be any combination of place names with a post code at the start, middle or end (or not at all).
Desired outcome:
Address3: "London"
Post Code: "W12 9LZ"
Essentially, in plain english, "move (not copy) any post code found from address3 into Post Code".
My regex skills aren't brilliant but I've managed to get as far as extracting the post code and getting it into its own column using the following regex, matching from Address3 and replacing into Post Code:
Match Expression:
(?<stringOUT>([A-PR-UWYZa-pr-uwyz]([0-9]{1,2}|([A-HK-Ya-hk-y][0-9]|[A-HK-Ya-hk-y][0-9] ([0-9]|[ABEHMNPRV-Yabehmnprv-y]))|[0-9][A-HJKS-UWa-hjks-uw])\ {0,1}[0-9][ABD-HJLNP-UW-Zabd-hjlnp-uw-z]{2}|([Gg][Ii][Rr]\ 0[Aa][Aa])|([Ss][Aa][Nn]\ {0,1}[Tt][Aa]1)|([Bb][Ff][Pp][Oo]\ {0,1}([Cc]\/[Oo]\ )?[0-9]{1,4})|(([Aa][Ss][Cc][Nn]|[Bb][Bb][Nn][Dd]|[BFSbfs][Ii][Qq][Qq]|[Pp][Cc][Rr][Nn]|[Ss][Tt][Hh][Ll]|[Tt][Dd][Cc][Uu]|[Tt][Kk][Cc][Aa])\ {0,1}1[Zz][Zz])))
Replace Expression:
${stringOUT}
So this leaves me with:
Address3: "London W12 9LZ"
Post Code: "W12 9LZ"
My next thought is to keep the above match/replace, then add another to match anything that doesn't match the above regex. I think it might be a negative lookahead but I can't seem to make it work.
I'm using SSIS 2008 R2 and I think the regex clean transformation uses .net regex implementation.
Thanks.
Just solved this. As usual, it was simpler logic than I thought it should be. Instead of trying to match the non-post code strings and replace them with themselves, I have added another line matching the postcode again and replacing it with "".
So in total, I have:
Match the post code using the above regex and move it to the Post Code column
Match the post code using the above regex and replace it with "" in the Address3 column