How to write regular expression in nutch?

How to write regular expression in nutch? - regex

I am using Nutch for crawling web pages. I am getting problem in writing the regular expression.
It is working fine with the following configuration:
Seed url :
www.practo.com
(preceded with https:// )
Regex-urlfilter.txt:
+^https://www.practo.com/
But I want to fetch only specific pages such as pages that contain information about 'cardiologist'
Example: I want to fetch pages like:
www.practo.com/hyderabad/doctor/some-name-cardiologist
i.e. I want to fetch pages ending in certain keyword.
I am using following regular expression:
+^https://www.practo.com(/[a-z0-9]*)*cardiologist
Please help me out in writing the regular expression.

I got answer to my question. Problem was getting the correct regular expression.
+^(https|http)://([a-zA-Z0-9./-]+)cardiologist([a-zA-Z0-9-#?=])*
The following site help me a lot to get to the correct expression : https://regex101.com/

You ca use the following:
+^https://www\.practo\.com.*cardiologist

Related

Extract Session ID from URL using Jmeter

I'm using jmeter to try and test a website. I'm currently having issues extracting information that is returned.
For instance I send a HTTP request to:
https://intranet.company.com/Capps/f?p=101:1:
The website responds with:
https://intranet.company.com/Capps/f?p=101:1:11016690116729:::::
The new string of numbers listed at the end of response is the session id that I must use to test other pages of the program. I've been trying to use a reg Ex extractor but I cannot seem to pull the number off the url. I am currently using jmeter 3.1
Regular Expressions I've tired:
f?p:101:1:([0-9]{16})::
f?p=([0-9]{1,3}):([0-9]{1,3}):([0-9]{16}):
And various similar expressions, but none have worked for me. If I set up the website with no session ids it will work, but the website is required to use session ids.
Thanks for any help you may provide,
Zwils0

You need to escape ? sign as it is a meta-character and might be interpreted as repetition pattern
For some reason you are trying to extract 16-digit long integer while your id is 14-digit long
I would suggest the following Regular Expression Extractor configuration:
Field to check: URL
Reference Name: anything meaningful, i.e. id
Regular Expression: f\?p=101:1:(\d+):
Template: $1$
Demo:
References:
Apache JMeter: Regular Expressions
Using RegEx (Regular Expression Extractor) with JMeter
Perl 5 Regex Cheat sheet

I don't know anything about jmeter, but I guess it supports standard regex syntax. In your regular expressions, you are expecting a numeric session id with a constant length of 16. However, the session id is not necessary 16 digits long. In your own examples, it has 14 digits. If I check the session length on my oracle apex cloud account, it is 13 digits long.
I guess you cannot rely on its constant length, therefore, try to use something like this:
f?p=([0-9]{1,3}):([0-9]{1,3}):([0-9]{10,16}):
Or even this:
f?p=([0-9]{1,3}):([0-9]{1,3}):([0-9]*):
Also check out the following link and scroll down a bit. Guru Jeff Kemp already did something like this.
https://jeffkemponoracle.com/2011/10/07/googlebot-apex-session-ids-and-cookies/

Chris Muir covered it in his comprehensive post on configuring jmeter specifically for APEX. It's dated, but I'm pretty sure it still holds up.
c) sessionId Regular Expression Extractor
f?p=([0-9]{1,3}):([0-9]{1,3}):([0-9]{16}):
http://one-size-doesnt-fit-all.blogspot.com.au/2010/05/configuring-apache-jmeter-for-apex.html
This appears to be what you've tried, but it seems like there may be other settings and considerations.

Why is this regular expression not working for redirection?

I'm trying to set up a wildcard redirect in Wordpress using the Redirection plugin, that supports regular expression.
I have a bunch of urls that look like this:
www.example.com/details.php?Pin=1234
www.example.com/details.php?Pin=5678
That I want to have redirected to the following format:
www.example.com/readers/1234/
www.example.com/readers/5678/
I have a source and a target field to enter redirection entries, and I currently have it set as such:
Source: /details.php?Pin=(.*)
Target: /readers/$1/
Which I believe is the correct expressions to be using, for a simple wildcard. But no redirection is happening. I'm not too clued up on regular expression, so would anyone be able to tell me what I'm doing wrong with the expressions?

It's the ? that need to be escaped.
Source: /details.php\?Pin=(.*)
In it's present form it makes the last p in php optional.

This worked for me:
\/details.php\?Pin=(\d+)
/readers/$1/
Tested in notepad++.

Regular expression to exclude part of a url?

I'm trying to create a regular expression for google analytics goals.
I need to match either of these 2 url fragments:
/order/map/egw/?code=somevalue
or
/order/map/egw/
But NOT this url:
/order/map/egw/consult/
Tried this:
/order/map/egw/$ | /order/map/egw/\?
and other variations but can't get it to match properly
Fast help greatly appreciated!

How about this regular expression?
/order/map/egw/(?!consult).*
If in the future you find that there's another sub-directory that you don't want to include, you can add a new one (e.g. the sub-directory 'wrong') like so:
/order/map/egw/(?!consult|wrong).*

What about this? I don't know how strict you're trying to be but it should work for your use cases:
(?!.*consult)/order/map/egw/(\?.+)?
It ensures "consult" is not found in the URL and matches the base part with an optional query string.

simple regular expression - match specific url

I'm a noob when it comes to Regular Expressions. I'm using Joomla and the Advanced Module Manager to publish a module to a specific url.
I want to publish a module only to the url /tv-show and not /tv-show/anthingthing-else/blahblah
I thought the way to do it is /tv-show* but obviously not, since it still publishes to other urls with /tv-show on the beginning.
I tried many variations, please tell me where am I going wrong?

Try the following
/tv-show$
The dollar matches the end of a string.

Git URL Structure

I am trying to build a regular expression to match any git read+write URL structure (not just GitHub) and I wanted to check to see if I got the regex right. This is what I have so far
([A-Za-z0-9]+#|http(|s)\:\/\/)([A-Za-z0-9.]+)(:|/)([A-Za-z0-9\/]+)(\.git)?
That regex matches all of the following URLs
git#github.com:user/project.git
https://github.com/user/project.git
http://github.com/user/project.git
git#192.168.101.127:user/project.git
https://192.168.101.127/user/project.git
http://192.168.101.127/user/project.git
http://192.168.101.127/user/project
And others like non-top-level domains and single name domains (http://server/). Are there other url structures that I should be concious of? Also is there a shorter way of writing the existing regex that I have?

If you are using rails / ruby to write your program, check this out. You might be able to get some ideas from here:
http://www.simonecarletti.com/blog/2009/04/validating-the-format-of-an-url-with-rails/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to write regular expression in nutch? - regex

I got answer to my question. Problem was getting the correct regular expression. +^(https|http)://([a-zA-Z0-9./-]+)cardiologist([a-zA-Z0-9-#?=])* The following site help me a lot to get to the correct expression : https://regex101.com/

You ca use the following: +^https://www\.practo\.com.*cardiologist

Related

Extract Session ID from URL using Jmeter

Why is this regular expression not working for redirection?

Regular expression to exclude part of a url?

simple regular expression - match specific url

Git URL Structure

Categories

Resources