I am using data scrapers: Import.io & Portia.
They both allow you to define a regular expression for the crawler to abide by.
for example the url: https://weedmaps.com/dispensaries/pdi-medical
how would I account for the ending "pdi-medical"?
I've looked all over and understand how to use regex in a JS environment, but I'm a little confused as to what I'd exactly put in the input on Portia/Import.io
Something like this?
https://weedmaps.com/dispensaries//^[a-zA-Z0-9-_]+$/
For Portia, if you want your crawler to follow any URLs starting with https://weedmaps.com/dispensaries/, you can just add a crawling rule with the following regex:
^https?://weedmaps.com/dispensaries/
Related
So, I'm trying to crawl a website that has like 7,000 product pages and the link structure is like this:
https://example.com/category/sub-category/numericid-name-of-the-product/
What I'm trying to achieve is to Generate a URL list, the Kimono App has that option, and it actually sections the URL but I'm only offered default value, range, and custom list.
I tried to put in stuff like "/.+/" to match all the chars, but that does not work, I couldn't find any help on that on official kb.
.I know that import.io had that "{alpahnumeric}" for example for different parts of URL so it matches them, is there a way to accomplish that in kimonolabs app?
Try this regex: https://example.com/([^/]+)/([^/]+)/([0-9]+)-([^/]+)
Note: you may need to escape some characters (namely / would be escaped as \/).
Also, I'm not familiar with KimonoLabs, so I don't know if this is what you're looking for exactly. Feel free to clarify.
Explanation
https://example.com/ literally
([^/]+)/ a bunch of not /s, followed by a /
([0-9]+)-([^/]+) Numbers followed by another bunch of not /s
www.domain.com/home/processform/thankyou?order_id=9653&order_value=mobilebrand as the the final URL for thank you page with unique ID.
^/thankyou$ as RegEx - will this work to count the goal?
Use this:
regular expression /thankyou
as the destination goal.
If your page values in Google Analytics are formatted the default way, you'll want to use the following regex:
^/thankyou.*
If you use a '$' at the end, it won't detect any of your URLs that have query parameters, like your examples do.
one can use the below which will only match "/thankyou" or "/thankyou/" in any given URL
regex \/thankyou\/?
I'm trying to create a regular expression for google analytics goals.
I need to match either of these 2 url fragments:
/order/map/egw/?code=somevalue
or
/order/map/egw/
But NOT this url:
/order/map/egw/consult/
Tried this:
/order/map/egw/$ | /order/map/egw/\?
and other variations but can't get it to match properly
Fast help greatly appreciated!
How about this regular expression?
/order/map/egw/(?!consult).*
If in the future you find that there's another sub-directory that you don't want to include, you can add a new one (e.g. the sub-directory 'wrong') like so:
/order/map/egw/(?!consult|wrong).*
What about this? I don't know how strict you're trying to be but it should work for your use cases:
(?!.*consult)/order/map/egw/(\?.+)?
It ensures "consult" is not found in the URL and matches the base part with an optional query string.
I'm a noob when it comes to Regular Expressions. I'm using Joomla and the Advanced Module Manager to publish a module to a specific url.
I want to publish a module only to the url /tv-show and not /tv-show/anthingthing-else/blahblah
I thought the way to do it is /tv-show* but obviously not, since it still publishes to other urls with /tv-show on the beginning.
I tried many variations, please tell me where am I going wrong?
Try the following
/tv-show$
The dollar matches the end of a string.
I am trying to build a regular expression to match any git read+write URL structure (not just GitHub) and I wanted to check to see if I got the regex right. This is what I have so far
([A-Za-z0-9]+#|http(|s)\:\/\/)([A-Za-z0-9.]+)(:|/)([A-Za-z0-9\/]+)(\.git)?
That regex matches all of the following URLs
git#github.com:user/project.git
https://github.com/user/project.git
http://github.com/user/project.git
git#192.168.101.127:user/project.git
https://192.168.101.127/user/project.git
http://192.168.101.127/user/project.git
http://192.168.101.127/user/project
And others like non-top-level domains and single name domains (http://server/). Are there other url structures that I should be concious of? Also is there a shorter way of writing the existing regex that I have?
If you are using rails / ruby to write your program, check this out. You might be able to get some ideas from here:
http://www.simonecarletti.com/blog/2009/04/validating-the-format-of-an-url-with-rails/