How do I Regex website URLs for apache nutch?

How do I Regex website URLs for apache nutch? - regex

I am trying to set up apache nutch to crawl only websites with a specified domain using Regex. I don't have much experience with Regex and I'm having trouble working out how to do my domain in Regex.
The domain is
https://www.health.gov.au/
and I would like any web page with this domain followed by anything else to be accepted by the Regex.
thanks for your time
EDIT
for example, I would like https://www.health.gov.au/health-topics to be accepted by the Regex

You can use (https://www.health.gov.au/.*).
This will match all characters after https://www.health.gov.au/
RegexDemo

Related

Greasemonkey/Tampermonkey #match for a url with different possible extensions with regex

I'm trying to make a script that activates on amazon. I use multiple amazons though (.NL, .DE, .CO.UK etc) and I would like to use a regex to pick that up, so I can visit any website that starts with https://www.amazon. and have the script activate.
I wrote this regex for the #match rule in the header;
((https?):\/\/)?(\w+)\.(amazon)\.(?P<extension>\w+(\.\w+)?)(\/.*)?
According to regex101, the regex is correct and should pick up on strings like https://www.amazon.de, but the Tampermonkey script (in both Chrome and Safari) is not activating when I visit any amazon websites.
According to the Tampermonkey documentation it should support regex in #match, so why is it not working? Did I make a mistake in my regex after all? Is this regex too complex for the #match rule?

Excluding a URL With Google Analytics Regex

I am tracking several urls on my website and I want to count only the ones beginning with /espace-debat
Examples :
/espace-debat/debat
/espace-debat/user/random-number
/espace-debat/debats/random-number
I am creating a goal on analytics to exclude all the others urls.
I am thinking about this Regex
^/(?espace-debat)
I don't know how to test it

Have you tried escaping your expression?
^\/(espace-debat)\/?

Jmeter URL patterns to exclude under workbench - not excluding patterns that are giving there

Jmeter URL patterns to exclude under workbench - not excluding patterns that are giving there.
Can we give direct URL's. i have a list of URL that needs to be excluded from the recorded script.
Example:
'safebrowsing.google.com`
'safebrowsing-cache.google.com'
'self-repair.mozilla.org'
i'm giving these directly under patters to exclude. or do i need to give as a regular expression only.
Can someone provide more info whether to use regular expression or direct url can be provided under Requests Filtering in workbench

JMeter uses Perl5-style regular expressions for defining URL patterns to include/exclude so you can filter out all the requests to/from google and mozilla domains by adding the following regular expression to URL Patterns to Exclude input of the HTTP(S) Test Script Recorder:
^((?<google>|mozilla>).)*$
See Excluding Domains From The Load Test article for more details.

If you want any of the patterns to be excluded from recording in the scripts please follow the below pattern and add it in "URL Patterns to Exlcude" it must work.
1. For .html : .*\.html.*
2. For .gif : .*\.gif.* etc

Using regex in QTP to match diff URL

I am running same QTP script in QA and Staging environment. One test case requires me to click on a PDF document which opens in a new window. My situation is that the even though the document is the same the domain name is different. What do I do to match it. Can I use regular expression to do it?
URL of document in QA:
http://qaapp2/InfoLibrary/ViewDocument\.aspx\?documentid=81b60525-9393-45ac-9c89-2fb1b0cb4701&documentname=ICD10\+physician\+readiness\+survey\.pdf"
URL of document in Staging:
http://stgapp2:81/InfoLibrary/ViewDocument\.aspx\?documentid=81b60525-9393-45ac-9c89-2fb1b0cb4701&documentname=ICD10\+physician\+readiness\+survey\.pdf"
If you look at the URL, you would notice that everything is the same except the domain name
QA: qaapp2
STG: stgapp2:81
Only common string sequence is 'app2'
I am unable to successfully match the using regex, I used this
[(stg)|(qa)][app2]
and it is not working. Please help.

Change the regular expression to
((stg)|(qa))app2
I use below site to verify my regex pattern.
http://www.regular-expressions.info/vbscriptexample.html
Note: Works only in IE as it is VBScript.

Regarding crawling urls for Google search appliance

We have a requirement where we need to crawl one particular set of URLs.
Say for example we have site abc.com. We need to crawl abc.com/test/needed -- all URL matching this pattern under "needed" folder. But we don't want to crawl rest of the URLs under abc.com/test/.
I guess this will be done using RegEx. Can anyone help me with respect to RegEx?

going from what you said in the comment, a pattern to match things of the form /xyz but not things of the form /xyz/imp:
/xyz(/[^i][^m][^p].*)?|/xyz/.{0,2}

The pattern that can be added to the GSA can be:
abc.com/test/needed
or
contains:abc.com/test/needed
The thing to consider is how the GSA will get this documents. If it can't spider to the folder it won't find the documents.

There are 3 specifications that you are allowed to make, in the GSA.
Start Crawl URLs -- these tell the GSA where to start looking for links.
Follow and crawl only URL patterns -- these tell the GSA which URLs from among those found starting with the "Start Crawl URLs", need to be followed and indexed.
Do not crawl URLS -- these are specifications for URLs patterns that match the above 2 patterns, but should not be crawled.
From as much as has been specified in the question itself, I think all you'll need to do is, put in a "Start crawl" url as: "abc.com/" and put in a "Follow and crawl only" specification as: "abc.com/test/needed/", assuming you need no other path/folder on the site crawled.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I Regex website URLs for apache nutch? - regex

You can use (https://www.health.gov.au/.*). This will match all characters after https://www.health.gov.au/ RegexDemo

Related

Greasemonkey/Tampermonkey #match for a url with different possible extensions with regex

Excluding a URL With Google Analytics Regex

Jmeter URL patterns to exclude under workbench - not excluding patterns that are giving there

Using regex in QTP to match diff URL

Regarding crawling urls for Google search appliance

Categories

Resources