What regex in Google Analytics to use for this case? - regex

I'm trying to figure out what landing page regex to use to only show URLs that have only two sub-folders, e.g. see image below: just show green URLs but not the read ones as they have 3+ subfolders. Any advice on how to do this in GA with regex?
Cheers

If you want to match a path having only two components, e.g.
/component1/component2/
Then you may use the following regex:
/[^/]+/[^/]+/
Demo
If your regex tool requires anchors, then add them:
^/[^/]+/[^/]+/$

Is this what you are looking for?
^\/[!#$&-;=?-[]_a-z~]+\/[!#$&-;=?-[]_a-z~]+\/$
The two sections contain all the valid html characters. We're also forcing the regex to start with slash, end with slash and have only one slash in between.

Related

How can I check if a string has multiple matching groups that are the same?

Currently, I am filtering out URL paths using Regex (Python). A couple of the URL paths I have come across are irrelevant and I want to detect URLs that are like this.
For example:
/ugrad/honors/index.php/policies/sao/policies/overview/step-1-course-requirements.html
/ugrad/honors/index.php/overview/sao/overview/sao/policies/noodle.html
In the examples above, you can see that policies and overview are repeated both times.
How can I design a Regex function to detect if there are 2+ matching texts anywhere in a URL path?
I have attempted something like this but I am unsure if it is possible to detect if there is 2+ matching texts anywhere in the string
My attempt: \S+(\/.+)\1\S+
Capture a slash, followed by non-slashes, followed by a slash again. Then repeat anything and backreference the capture group:
(\/[^\/]+\/).*\1
https://regex101.com/r/ygqRZc/1

Find multiple '/' forward slashes in string of URLs for sitemap

We are trying to clean up our site map as our Magento store has created duplicate pages. I want to use a regular expression to select, or invert select, all of the pages which are linked to the top level URL.
For example, we want to find the first line-
/site/product<<
/site/category/product/
/site/category/product
Is there any way to find only two instances of a forward slash in the whole string, which are not next to each other?
Thank you for your help in advance.
I've tried something like this
(.*(?<!\/)$)
Your pattern (.*(?<!\/)$) matches any char except a newline until the end of the string and after that asserts that what is on the left is not a forward slash which will give you the first and the third match.
You could match from the start of the string ^ 2 times a forward slash and then 1+ times not a forward slash or a newline [^/\n]+ and then assert the end of the string $
^/[^/\n]+/[^/\n]+$
Regex demo
I would like like to provide a quick answer to this problem in case it helps anyone else in the future. Our sitemap had too many duplicate URLs due to an incorrect set up on our Magento store. Instead of submitting a sitemap with 20,000+ top-level URLs we decided to manually remove the top level items ourselves.
Not ideal at all.
We tweaked with the site map PHP generation code to pull top-level URLs as site/category/id/###. Then we used Notepad++ to bookmark and delete these lines accordingly.

Google Analytics Content Grouping by Extraction - extract 3rd level subdirectories

I've been going round in circles with this one.
I'd like to perform a content grouping in google analytics that groups by a 3rd level subdirectory.
I can grab the second level successfully with the following regex
`/destinations/(.*?)/`
where the url is
mydomain.com/destinations/europe
mydomain.com/destinations/alaska
I get content groups of europe and alaska.
However, I also then want a grouping of the next level, for example
mydomain.com/destinations/europe/southampton
mydomain.com/destinations/europe/portugal
mydomain.com/destinations/alaska/somealaskanplace
to give me groupings of southampton, portugal and somealaskanplace
This means i need to effectively ignore whatever's in the second level and this is what i'm struggling with.
So far i have
`/destinations\/.*\/(.*?)/$`
but that's given me the domain name as a grouping
Can anyone help? It would be very much appreciated.
You need to have the Multiline flag On
Check this:
/.*?\/(destinations)\/(\w+)\/(\w+)/gm
Demo on Regex101:
https://regex101.com/r/2wvRIx/2
I don't think you need the / delimiters. GA may be interpreting your last /$ as being a slash and then end-of-string. Try making it just /destinations/.*/(.*?)$ (note that GA regex does not require you to escape slashes).

Google Analytics Regex excluding a certain url in a sub folder

Currently on my GA Account I have the following URL's from our website tracked:
domain/contact-us/
domain/contact-us/global-contact-list.aspx
domain/contact-us/contactlist.aspx
The first two are from our new website which we want to track, the last one is from our old website (traffic is still being tracked but we do not want to use this)
I tried using a regex filter on this as the following:
(^/contact-us/global-contact-list\.aspx)|(^/contact-us/)
Reading up, I believe this looks for matches of exactly:
/contact-us/global-contact-list or /contact us/ but would disallow /contact-us/contactlist/
for some reason, the above one is still coming through. Can someone please see as to why this may be happening or know why this is happening?
You need to add a negative look-behind or a end of string anchor:
(^/contact-us/global-contact-list\.aspx)|(^/contact-us/$)
or
(^/contact-us/global-contact-list\.aspx)|(^/contact-us/(?!contactlist/))
This way, you will exclude /contact-us/contactlist/ from matching.
Have a look at the Demo 1 and Demo 2.
BTW, /contact us/ will not pass since (^/contact-us/) only allows a hyphen. You should add a space, e.g. (^/contact-us/global-contact-list\.aspx)|(^/contact[-\s]us/$).
Also, (^/contact-us/global-contact-list\.aspx) won't match /contact-us/global-contact-list because it needs to match .aspx.

Regex for excluding URL

I working with an email company that has a feature where they spider your site in order to provide custom content. I have the ability to have the spider ignore urls based on the regex patterns I provide.
For this system a pattern starts and ends with a "/".
What I'm trying to do is ignore http://www.website.com/2011/10 BUT allow http://www.website.com/2011/10/title-of-page.html
I would have thought the pattern below would work since it does not have a trailing slash but no luck.
Any ideas?
/http:\/\/www\.website\.com\/[0-9][0-9][0-9][0-9]\/[0-9][0-9]/
Your regex matches a part of the URL, so you need to tell it not to allow a slash to follow it:
/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9](?!\/)/
If you want to also avoid other partial matches like in http://www.website.com/2011/100, then an additional word boundary might help:
/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9]\b(?!\/)/
It depends on the regexp engine but you can probably either use $ (if the URL is tokenised beforehand) or a match for whitespace and delimiters