Regex: matching string with 2 specific characters - regex

I'm working in Google Analytics and trying to use the RegEx advanced filter option to display page names that contain two /, but not three /. The text string within the first section will always be products; however, after the second / it is random.
For example,
I want to include these page name strings:
/products/skis
/products/snowboards
/products/skates
I want to exclude these page name strings:
/products/skis/mens
/products/snowboards/womens
/products/skates/red
Again, the products part is consistent...but the second text section is random.
Appreciate any help -- thanks!

One possibility would be this::
^\/products\/[a-zA-Z]+$
This would capture the first slash, followed by 'products', followed by a second slash, and then any text string (without special characters). Nothing else would come after.

To match pages names starting by /products/ and not containing a third slash, you can use this regex:
^\/products\/[^\/]+$

Related

Extracting address with Regex

I'm trying to looking for Street|St|Drive|Dr and then get all the contents of the line to extract the address:
(?:(?!\s{2,}|\$).)*(Street|St|Drive|Dr).*?(?=\s{2,})
.. but it also matches:
Full match 420-442 ` Tax Invoice/Statement`
Group 1. 433-435 `St`
Full match 4858-4867 `163.66 DR`
Group 1. 4865-4867 `DR`
Full match 11053-11089 ` Permanent Water Saving Plan, please`
Group 1. 11077-11079 `Pl`
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
One option is to use the the word-boundary anchor, \b, to accomplish this:
(?:(?!\s{2,}|\$).)*\b(Street|St|Drive|Dr)\b.*?(?=\s{2,})
If you provide an example of the raw text you're parsing, I'll be able to give additional help if this doesn't work.
Edit:
From the link you posted in a comment, it seems that the \b solution solves your question:
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
However, it seems like there are additional issues with your regex.

regex: substitute character in captured group

EDIT
In a regex, can a matching capturing group be replaced with the same match altered substituting a character with another?
ORIGINAL QUESTION
I'm converting a list of products into a CSV text file. Every line in the list has: number name[ description] price in this format:
1 PRODUCT description:120
2 PRODUCT NAME TWO second description, maybe:80
3 THIRD PROD:18
The resulting format must include also a slug (with - instead of ) as second field:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product-name-two-2:second description, maybe:80
3 THIRD PROD:third-prod-3::18
The regex i'm using is this:
(\d+) ([A-Z ]+?)[ ]?([a-z ,]*):([\d]+)
and substitution string is:
`\1 \2:\L$2-\1:\3:\4
This way my result is:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product name two-2:second description, maybe:80
3 THIRD PROD:third prod-3::18
what i miss is the separator hyphen - i need in the second field, that is group \2 with '-' instead of ''.
Is it possible with a single regex or should i go for a second pass?
(for now i'm using Sublime text editor)
Thanx.
I don't think doing this in a single pass is reasonable and maybe it's not even possible. To replace the spaces with hyphens, you will need either multiple passes or use continous matching, both will lose the context of the capturing groups you need to rearrange your structure. So after your first replace, I would search for (?m)(?:^[^:\n]*:|\G(?!^))[^: \n]*\K and replace with -. I'm not sure if Sublime uses multiline modifier per default, you might drop the (?m) then.
The answer might be a different one, if you were to use a programming language, that supports callback function for regex replace operations, where you could do the to - replace inside this function.

regex needed for parsing string

I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.

Regex parsing using Kimono Labs

I am attempting to use software supplied by Kimonolabs to get a list of articles and their links from a web site. The problem I am having is that a string I have scraped from the web site has a date along with some text that I am unable to separate from the date.
Kimono uses this syntax for a regex:
/^()(.*?)()$/
first bracket => to the left of the required content
second bracket => this is what should get extracted
third bracket => to the right of the required content
Specifically the website I am trying to scrape is:
http://www.yashinquesada.com/futbol-nacional
Here is an example of the line I am trying to parse (I only want the date):
<p class="nspInfo nspInfo1 tleft fnone">Enero 08, 2016 <a href="/futbol-nacional/28-la-primera" >La Primera</a></p>
My attempts to parse this line returned no results, I have tried reading through regex reference materials but they are pretty complicated for me.
Any suggestions are appreciated!
The regular expressions Kimono expects need to have three groups (a group is a pair of parentheses). That means you always need to keep this structure:
/^()(.*?)()$/
This is Kimono's default, where the first group is empty, the second contains all the text (. matches any character, *? basically means "any number of times"), and the third is empty again.
You can adapt that arrangement to cut off unwanted text at the beginning and at the end - the value that ends up in your data will always be whatever the middle group matches.
I suspect the values you currently get currently are looking like this:
Enero 07, 2016 La Primera
so what you actually want to do is cut off text at the end.
Let's make the second and third groups more specific. We know the date always contains the year, which is four digits (\d\d\d\d or \d{4}) - and actually the match should end there. That's fairly easy:
/^()(.*?\d{4})(.*)$/
So, in English:
first group stays empty, no cut-off at the beginning
second group matches any character, but stops after matching four digits
third group matches the remainder of the value; Kimono will throw away that substring
Play around with the expression over at regex101: https://regex101.com/r/rM3tX0/1

Matching a word if preceding text is not "class=' "

I'm trying to create a regex for a search that will look at the following code and return only the ids and not the classes:
1 id="contact"
2 class="contact"
3 #contact
4 .contact
I want to return contact from the 1st and 3rd lines and NOT 2nd and 4th lines.
This is for a search across multiple files to avoid going through each one individually and checking whether it needs changing or not.
Is this possible?
Here you go:
/(?:#|id=")(\w+)"?/g
strings beginning with either # or id=" followed by word characters. You'll probably want to enhance it to handle dashes and underscores, I'd bet.
In this case, the first group is non-capturing, and the ID text will be your first capture group $1.
UPDATE
this one:
(?:(?<=id=")|(?<=#))(contact)
uses a positive lookbehind to find your prefixes and matches just the string "contact". This will NOT work in JavaScript (so you can't test it online) but will work in a text editor or CLI tool like ack.