Extract website details from text file using regex pattern - regex

Need to extract website urls from the text. Can you tell me where am I missing.
Data:
gmail.com
2.0
Dolphins.com.
B.TECH
62.1%.
github.com/XYZ
abcd.com
github.com/abcd
linkedin.com/in/abcd
abcd.wordpress.com/
https://xyz/stackoverflow.com
Regex pattern:
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w+/\-?=%.]+\.[\w+/\-?=%.]+', text)
Expected Output:
github.com/XYZ
abcd.com
github.com/abcd
linkedin.com/in/abcd
abcd.wordpress.com/
https://xyz/stackoverflow.com
Current output:
Its extracting all the items which are written in Data. Can someone tell me what changes are required in my regex to get the expected output?

I used below regex and it worked in regex101.com
.*(?:https?:\/\/)?(?:www\.)?[a-z-]+\.(?:com|org)(?:\.[a-z]{2,3})?.*
But when I use it in my code with re.findall() it returns entire text file, and if we use it with re.finditer() it says json is not serializable. Im trying to return my output in json. So what can be done here?

Related

regex match anything ends with but not contains a sequence

I'm using regex to extract json data in html and and i cannot find how to get text ends with '});' but not include it.
Here is a sample data:
start({"main":{"ss_service":"...();"}});
Thank you for your help.

Regular expression to match string from url

I want to match shop name from a url .Please see the example below. Its for url redirection in a word press application.
See the examples given below
http://example.com/outlets/19-awok?page=2
http://example.com/outlets/19-awok
http://example.com/outlets/159-awok?page=3
In all cases i need to get only awok from the url .It will be the text coming after '-' and before query string .
I tried below and its not working
/outlets/(\d+)-(.*)? => /shop/$2
Any help will be greatly appreciated.
You can use this regex:
/outlets/\d+-([^?]+)?
Trailing ? is used to strip previous query string.

Jmeter parameter extraction

I am getting a result in my jmeter test that I don't understand:
I am trying to extract the "totalRunning" value from this Json response:
{"notifications":[],"taskNotificationInfo":{"totalRunning":0,"totalCompleted":0,"totalCompletedWithErrors":0,"totalFailed":0,"totalPending":0,"requestTime":1458628767436,"hasRecords":false}}
My regex is configured as following:
Reference Name: TotalRunning
Regular Expression: "totalRunning":"(.+?)"
Template: $1$
Match: 1
Default Value: 1
screen shot:
I keep getting the default value instead of "0" in this case.
Am I extracting it from the wrong place?
Any help would be appreciated.
There is no problem with your regex totalRunning":(.+?),"totalCompleted"
only you need to select radio button Body instead of Body As a Document
refer snapshot:-
Change your regular expression as below:
Regular Expression: "totalRunning":(\d+)
In the question "totalRunning":"(.+?)" is being used as regular expression. Since values of totalRunning is not surrounded by quotes. So none was being matched and default value is being picked.
Below regex can also be used:
"totalRunning":(.+?),
Change your "Field to check" to "Body" as "Body as a document" is for binary files like Word, Excel, PDF, etc. see How to Extract Data From Files With JMeter article for more details. JSON is a usual text so it should be treated as "normal" response.
Also there is a special Post Processor - JSON Path Extractor available via JMeter Plugins project. In case of complex JSON, multiple or conditional matches, etc. it might be better and easier to use it.
The relevant JSONPath expression will be: $..totalRunning[0]

in regex get a single match just before the match pattern?

I have a response like below
{"id":9,"announcementName":"Test","announcementText":"<p>TestAssertion</p>\n","effectiveStartDate":"03/01/2016","effectiveEndDate":"03/02/2016","updatedDate":"02/29/2016","status":"Active","moduleName":"Individual Portal"}
{"id":103,"announcementName":"d3mgcwtqhdu8003","announcementText":"<p>This announcement is a test announcement”,"effectiveStartDate":"03/01/2016","effectiveEndDate":"03/02/2016","updatedDate":"02/29/2016","status":"Active","moduleName":"Individual Portal"}
{"id":113,"announcementName":"asdfrtwju3f5gh7f21","announcementText":"<p>This announcement is a test announcement”,"effectiveStartDate":"03/02/2016","effectiveEndDate":"03/03/2016","updatedDate":"02/29/2016","status":"InActive","moduleName":"Individual Portal"}
I am trying get the value of id (103) of announcementName d3mgcwtqhdu8003.
I am using below regEx pattern to get the id
"id":(.*?),"announcementName":"${announcementName}","announcementText":"
But it is matching everything from the first id to the announcementName. and returning
9,"announcementName":"Test","announcementText":"<p>TestAssertion</p>\n","effectiveStartDate":"03/01/2016","effectiveEndDate":"03/02/2016","updatedDate":"02/29/2016","status":"Active","moduleName":"Individual Portal"}
{"id":103,"announcementName":"d3mgcwtqhdu8003","announcementText":
But I want to match only from the id just before the required announcementName.
How can I do this in RegEx . Can someone please help me on this ?
As an answer here as well. Either use appropriate JSON functions, if not, a simple regex like:
"id":(\d+)
will probably do as the IDs are numeric.

Pig: extracting email details from raw text using REGEX

I am trying to extract email details from raw text using pig.
Here's the sample data:
Sample data for email abc.123#gmail.com
Sample data for email xyz#abc.com
I am trying with REGEX method, Regular expression i took from: http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/
here's the script:
A = Load '----' using PigStorage as (value: chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(value, '^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z]{2,})$')) AS (f1: chararray)
dump B;
After dumping the output into the terminal, i am getting blank output:
()
()
Is there any problem in script syntax?
Please share some links also regarding regular expression writing, it would be very much helpful.
Your help is appreciated, thank you.
For following input data
abc.123#gmail.com
xyz#abc.com
Output of your code is
.123 .com
.com
So there are couple of problems in your code
You need to add parenthesis around the whole regex to capture the complete email address. The code should then work if you have only one token (word or email-id) in each line
If each input line can be a sentence, then you have to first tokenize and then on tokens you can to do regex match.
The reason that the regex you have works only on token and not on line is "^" indicates beginning of string and "$" indicates end of string, so the match is going to successful only when the entire line is an email-id which means you can have only one token per line.