How to set rule using regex in scrapy for extracting urls?

How to set rule using regex in scrapy for extracting urls? - regex

I want to crawl pages related to Disney on bloomberg websites. The url follow pattern as
"http://bloomberg.com/news/2013-07-08/disney-welcometohomepageofdisney"
So, i have written below rule for it
rules = [
Rule(SgmlLinkExtractor(allow=('/news/*/disney*',)), follow=True),
]
but the above rule doesn't working as i want and i am getting crawled pages output not related to Disney. please help to fix this rule.

/news/* matches /news followed by any number of /.
The correct regex would be:
/news/.*/disney

You likely need the following regex:
/news/[^/]+/disney.*
which escaped looks like
\/news\/[^\/]+\/disney.*
this way you will find the next / but not anything.
Example here

Related

Multiple slash in URL replacement though regex

I am trying to create a regex in pcre, that is going to salinize URL with multiple slashes like the following:
https://www.domin.com/test1/////test2/somemoretests_67142 https://www.domin.com/test1/test2/somemoretests_67142///// https://www.domin.com/test1/test2///somemoretests_67142
So that I can replace it with the following: https://\2\4 and the link at the end of it looks: https://www.domin.com/test1/test2/somemoretests_67142
I have been struggling with it for the past couple of days, so any regex guru help is more than welcome :)
I have tried the following and more:
(http|https):\/\/(.*)(\/\/+)(.*)
(http|https):\/\/(.*)(\/\/){2,}(.*)
(http|https):\/\/(.*)(\/\/{2})(.*)
I am going to utilize these for Akamai to sanitize our URLs though cloudlet.

You can try:
(?<!https:\/)(?<!http:\/)(\/+$|(?<=\/)\/+)
And substitute the first group with empty string.
Regex demo.
This will produce this output:
https://www.domin.com/test1/test2/somemoretests_67142
https://www.domin.com/test1/test2/somemoretests_67142
https://www.domin.com/test1/test2/somemoretests_67142

Why is my Regex include filter not working (google analytics)?

In google analytics, I have created the following include filter:
^https:\/\/(my\..*|accounts\..*|maya\..*\/reports\/(mymessages|favorites)|maya\..*\/account\/notification|info\..*\/(heb|eng)\/management\/generalpages\/pages\/(personalfolder|registration|change_password|userssearchindex|security%20search)\.aspx).*
In order to include only URLs that contains the following addresses:
https://my.tase.co.il
https://accounts.tase.co.il
https://maya.tase.co.il/reports/mymessages
https://maya.tase.co.il/reports/favorites
https://maya.tase.co.il/account/notification
https://info.tase.co.ilManagement/GeneralPages/Pages/PersonalFolder.aspx
https://info.tase.co.ilManagement/GeneralPages/Pages/Registration.aspx
https://info.tase.co.ilManagement/GeneralPages/Pages/Change_Password.aspx
https://info.tase.co.ilManagement/GeneralPages/Pages/UsersSearchIndex.aspx
https://info.tase.co.ilManagement/GeneralPages/Pages/Security%20Search.aspx
But for some reason i cant get it to work.
What am I doing wrong?
Thanks for your help!

The pattern does not match the links that start with info. because the pattern specifies info\..*\/(heb|eng) and in the example data there is no heb or eng present.
You can either remove that part or use a pattern that exactlty matches starting with those urls:
https:\/\/(?:(?:accounts|my)\.tase\.co\.il|maya\.tase\.co\.il\/(?:reports\/(?:mymessages|favorites)|account\/notification)|info\.tase\.co\.il\/Management\/GeneralPages\/Pages\/(?:PersonalFolder|Registration|Change_Password|UsersSearchIndex|Security%20Search)\.aspx).*
See a Regex demo.

How to fix regex url pattern

I need to fix my url pattern:
/^((http(s)?(\:\/\/)){1}(www\.)?([\w\-\.\/])*(\.[a-zA-Z]{2,4}\/?)[^\\\/#?])[^\s\b\n|]*[^\.,;:\?\!\#\^\$ -]/
I thought this regex was ok, but it is not working for urls like: https://xx.xx (without www). 'www' should be optional ((www.)?). Where is the bug?

The problem is not in the (www\.)? part but that parts after that.
Take a look at the [^\\\/#?] and the [^\.,;:\?\!\#\^\$ -] parts.
So a valid URL would be https://xx.xx plus none of \/#? plus none of .,;:?!#^$_- making the url valid if you add those, for example https://xx.xx11.
I do advice you to not try to create your own regex because you are missing a lot!
For example, tlds like .amsterdam are valid. And why are you capturing so many groups?
Your regex as an image made with https://www.debuggex.com/:

Regex - analytics filter

I'm trying to filter some urls using gapi.client.analytics. What I want to achive is to create a regex filter that covers a lot of options. The regex should keep only urls that have this structure:
subdomain1.domain.com/some-post/
My problem is that I have some other urls that I don't know how to exclude, like:
subdomain1.domain.com/p/code/
subdomain1.domain.com/
subdomain1.domain.com/some-author/some-name/
subdomain2.domain.com/some-post/
subdomain2.domain.com/p/code/
I tried to use: ga:hostname=#subdomain1.domain.com to get links that contain only subdomain1.
I also tried: ga:hostname=~^[^/]+/?[^/]+/?$ to get only those who have 2 / in url.
Unfortunately I coudn't manage to do what I want.

Following regex should match URLs with exact one trailing directory
^[a-zA-Z0-9_-]+\.domain\.com\/[a-zA-Z0-9_-]+\/$
or
^[a-zA-Z0-9_\-\.]+\/[a-zA-Z0-9_-]+\/$
to match every domain.
You can text google analytics regex on analyticsmarket.com

How to set up regex in nutch for filtering URL of techcrunch?

I want to crawl the pages of Techcrunch uploaded after the 1 Jan of 2013.The website follows the pattern
http://www.techcrunch.com/YYYY/MM/DD
So my question is how to setup the regex in urlfilter in nutch so that i could crawl only pages which i want.
+^http://www.techcrunch.com/2013/dd/dd/([a-z0-9\-A-Z]*\/)*

I don't know nutch but do you try:
+^http://www.techcrunch.com/2013/[0-9]{2}/[0-9]{2}.*$
or
+^http://www.techcrunch.com/2013/[0-9]+/[0-9]+.*$

The following expressions will match the URLs you need:
Without groups
http:\/\/www.techcrunch.com\/\d{4}\/\d{2}\/\d{2}\/\w+
With groups
http:\/\/www.techcrunch.com\/(\d{4})\/(\d{2})\/(\d{2})\/(\w+)
I didn't put anchors (^$), but you can put them if you need them for the filtering.
Try them to see if any of them work.
I don't know how nutch works, but a couple of suggestions about your regex that may apply: the / in the regexp should be escaped; the dd parts should be \d\d so they match two digits.
About setting up the regex, check out this answer to see if it helps you.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to set rule using regex in scrapy for extracting urls? - regex

/news/* matches /news followed by any number of /. The correct regex would be: /news/.*/disney

You likely need the following regex: /news/[^/]+/disney.* which escaped looks like \/news\/[^\/]+\/disney.* this way you will find the next / but not anything. Example here

Related

Multiple slash in URL replacement though regex

Why is my Regex include filter not working (google analytics)?

How to fix regex url pattern

Regex - analytics filter

How to set up regex in nutch for filtering URL of techcrunch?

Categories

Resources