I have my goal funnel set up and this is the regex for one of the stages: ^/shop/(.*)
This will match pages such as /shop/collections/art.html but when I look at the goal funnel, it says people are dropping out by going to pages like /shop/collections/art.html?p=2. Notice the ?p=2 is the only difference here.
I tried to do it as ^/shop/((.|\?)*) but I'm not sure that's fixing it.
How do I fix this?
Related
I am trying to use the REGEXP_EXTRACT custom field to pull a portion of my URL using the page dimension in Google Data Studio and cannot figure it out. The page url structure is similar to this -
website.forum.com/webforms/great_practiceinfo_part2.aspx?function=greatcoverage
I'd like to only extract the middle section "great_practiceinfo_part2". I've tried many different formulas, but nothing seems to work. Does the page dimension work in this scenario? Any help would be much appreciated.
Thanks
It seemed to work fine in Google Sheets when I =REGEXEXTRACT(A3,B3) using your string, website.forum.com/webforms/great_practiceinfo_part2.aspx?function=greatcoverage for A3 and the regex \/([^\/]*?)\.aspx\? for B3. I'm guessing you just need to learn more about how to make your regex pattern making string.
I am dealing with old hacked sites in Wordpress where there are injection spam links on images.
I have access to the database and would like to remove links that look like this:
<a style="text-decoration:none" href="/ansaid-retail-cost">.</a>
Now text varies inside the <href> it might be for cialas or any product, but the rest doesn't vary. I want to remove the entire LINK, so the result is a single space.
I don't know regex, so I would appreciate the help. I've tried online generators but they don't seem to be working.
I got a lot of visits to my site main page from different keywords. Examples of format might be as following:
/?keyword= train hard
/
/?keyword=
etc., etc.
To be able to sum up all visits to my main page despite from the keyword, I wanted to use a Regex like ^/$. However, that didn't work out. What RegEx should I apply to get the proper result?
What RegEx should I apply to see other sections of my website in a similar way? E.g.
/booking?keyword= or /section?keyword=any ?
Thanks in advance!
For main page you can try: ^\/(?:\?keyword=.*)?$
Look here for example: https://regex101.com/r/5uCFun/2
For other pages similarly: ^\/booking(?:\?keyword=.*)?$
Example here: https://regex101.com/r/1dAaHL/2
Ok, I asked this already, but I guess I didn't ask it to the way stackoverflow expects. Hopefully I will get more luck this time and an answer.
I am trying to run nutch to crawl this site: http://www.tigerdirect.com/
I want it to crawl that site and all sublinks.
The problem is its not working. In my reg-ex file I tried a couple of things, but none of them worked:
+^http://([a-z0-9]*\.)*tigerdirect.com/
+^http://tigerdirect.com/([a-z0-9]*\.)*
my urls.txt is:
http://tigerdirect.com
Basically what I am trying to accomplish is to crawl all the product pages on their website so I can create a search engine (I am using solr) of electronic products. Eventually I want to crawl bestbuy.com, newegg.com and other sites as well.
BTW, I followed the tutorial from here: http://wiki.apache.org/nutch/NutchTutorial and I am using the script mentioned in session 3.3 (after fixing a bug it had).
I have a background in java and android and bash so this is a little new to me. I used to do regex in perl 5 years ago, but that is all forgotten.
Thanks!
According to your comments I see that you have crawled something before and this is why your Nutch starts to crawl Wikipedia.
When you crawl something with Nutch it records some metada at a table (if you use Hbase it is a table named webpage) When you finish a crawling and start a new one that table is scanned and if there is a record that has a metada says "this record can be fetched again because next fetch time is passed" Nutch starts to fetch that urls and also your new urls.
So if you want to have just http://www.tigerdirect.com/ crawled at your system you have to clean up that table first. If you use Hbase start shell:
./bin/hbase shell
and disable table:
disable 'webpage'
and finally drop it:
drop 'webpage'
I could truncate that table but removed it.
Next thing is putting that into your seed.txt:
http://www.tigerdirect.com/
open regex-urlfilter.txt that is located at:
nutch/runtime/local/conf
write that line into it:
+^http://([a-z0-9]*\.)*www.tigerdirect.com/([a-z0-9]*\.)*
you will put that line instead of +.
I have indicated to crawl subdomains of tigerdirect, it is up to you.
After that you can send it into solr to index and make a search on it. I have tried it and works however you may have some errors at Nutch side but it is another topic to talk about.
You've got a / at the end of both of your regexes but your URL doesn't.
http://tigerdirect.com/ will match, http://tigerdirect.com will not.
+^http://tigerdirect.com/([a-z0-9]*\.)*
Try moving that tailing slash inside the parens
+^http://tigerdirect.com(/[a-z0-9]*\.)*
I'm trying to build a funnel for pages with dynamic URLS. My regex-fu is terrible. I'm trying to see how users do on one of our wizards. The URLS I care about all have each project's name in them.
/projects/<PROJECT_NAME>/wizard_steps/1
/projects/<PROJECT_NAME>/wizard_steps/2
/projects/<PROJECT_NAME>/wizard_steps/3
So I think I need to do something like this in order allow for these dynamic URLs.
/projects/?.*$/wizard_steps/1
/projects/?.*$/wizard_steps/2
/projects/?.*$/wizard_steps/3
Does this looks correct? Any guidance would be deeply appreciated.
Try this regular expression:
/projects/([^/]+)/wizard_steps/.*