Regex: using alternatives - regex

Let's say I would like to get all the 'href' values from HTML.
I could run a regex like this on the content:
a[\s]+href[\s]*=("|')(.)+("|')
which would match
a href="something"
OR
a href = 'something' // quotes, spaces ...
which is OK; but with ("|') I get too many groups captured which is something I do not want.
How does one use alternative in regex without capturing groups as well?
The question could also be stated like: how do I delimit alternatives to match? (start and stop). I used parenthesis since this is all that worked...
(I know that the given regex is not perfect or very good, I'm just trying to figure this alternating with two values thing since it is not perfectly clear to me)
Thanks for any tips

Use non-capture groups, like this: (?:"|'), the key part being the ?:at the beginning. They act as a group but do not result in a separate match.

Related

Regex Conditional Matching in One Capture Group

I have a string that may come in the form:
"filename.first_order.png"
"filename.second_order.png"
"filename.png"
"filename.(jpg|tif|etc)"
I need to match the first part of the string containing the name, and the extension - however, if the string is a first/second order type, I need to match "first_order"/"second_order" as the second group, and not "png", and I can't get those two conditions to co-exist in one capture group. Example matches:
imageondisk.first_order.png -> [imageondisk, first_order]
anotherfile.png -> [anotherfile, png]
meetingminutes.jpeg -> [meetingminutes, jpeg]
I feel like I've used all sorts of combinations of lookaheads, lookbehinds, ?s which must look like a desparate uneducated mess, but whatever I do, I can never get a result where they don't conflict when I join them together - which would look something like
(.+)\.(tif|jpg|<png when not preceded by first/second_order>|<first/second_order, ignoring the .png on the end>)
Except I just went down a frustrating rabbit hole of non-capture groups and lookarounds that seemed to end in the same place, and I feel like I knew less regex than before.
Help would hugely appreciated.
You could use this regex, which captures the filename in group 1, any first/second order string in group 2 and the extension in group 3:
^([^.]+)(?:\.(.+))?\.(png|jpg|tif)$
Demo on regex101

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

RegEx substract text from inside

I have an example string:
*DataFromAdHoc(cbgv)
I would like to extract by RegEx:
DataFromAdHoc
So far I have figured something like that:
^[^#][^\(]+
But Unfortunately without positive result. Do you have maybe any idea why it's not working?
The regex you tried ^[^#][^\(]+ would match:
From the beginning of the string, it should not be a # ^[^#]
Then match until you encounter a parenthesis (I think you don't have to escape the parenthesis in a character class) [^\(]+
So this would match *DataFromAdHoc, including the *, because it is not a #.
What you could do, it capture this part [^\(]+ in a group like ([^(]+)
Then your regex would look like:
^[^#]([^(]+)
And the DataFromAdHoc would be in group 1.
Use ^\*(\w+)\(\w+\)$
It just gets everything between the * and the stuff in brackets.
Your answer may depend on which language you're running your regex in, please include that in your question.

VIM regex - Attempting to fix a JSON file in which the property names aren't quoted

Basically, I'd like to add quotes immediately before and after all property names.
propertyName: {
anotherProperty: "ksdfjslkdfjklsjf",
someOtherProperty: "aklsjdfljsfdkj"
}
I've tried different variations on the following, all to no avail:
:%s/A-Za-z:/"A-Zaz-z"
Any help with this would be greatly appreciated. Also, I'd like to become more savvy with regular expressions. What's the best way to go about this?
Try using captures:
%s/\([A-Za-z]\+\)\ze:/"\1"/g
The result will be:
"propertyName": {
"anotherProperty": "ksdfjslkdfjklsjf",
"someOtherProperty": "aklsjdfljsfdkj"
}
\(regex\) is a capture group. Everything that is matched inside can be reused via \n. Where n is the group index. In my example I have only one capture group, so I use \1 later.
\ze - indicates the end of the match, that way I preserve the colon :.
UPDATE
Actually #lcd047 solution is better and shorter. \w is more preferable for property names, as they may contain digits and underscores:
%s/\w\+\ze:/"&"/
\w is equal to using [0-9A-Za-z_]

Regex - find all instances of words that begin with # but do not contain 'administrator'

I am having a hard time getting my head around this regex. What I am trying to do is as follows:
Match any occurrence of words that begin with #. So, for example, if the code finds the following tags #jon, #james, #jill, then it should hide the text.
But if the code finds occurrences of the following tag: #ADMINISTRATOR, then it should display the text
In addition, if the code finds no occurrences of any words tagged with #, it should also display the text.
Essentially, I want to hide any comments that are hashed tagged with a user name other than ADMINISTRATOR.
So far, I have the following code:
if (mb_ereg_match(".*(#[^ADMINISTRATOR]){1,}.*", $comment))
{
$hideComment = true;
}else
{
$hideComment = false;
}
The above code works for the most part, except for when the text being searched contains any one of the following:
#A, #AD, #ADM, #ADMI, #ADMIN, etc.
then the code does not hide the comment, which is not what I want. I only want an exact match to '#ADMINISTRATOR' to display the comments. Plus, any comment that contains no tags should also be displayed.
Any idea what I am doing wrong?
This is a negative lookahead based regex that will work for you:
(?i)#(?!ADMINISTRATOR)\w+
Here is a Live Demo
I've not used whatever program you're using to write your regex, but the syntax in general isn't doing what you think it is. When you use a set of [], you are saying that what lies within is a class of characters. Your regular expression states I'm looking for something that follows a #, but that something doesn't begin with an A, or any of the following characters.
What you want to use is another grouping. You can use () instead of [] to represent a specific group of characters. However, as you may notice, () is also what you use to capture part of your regex. Thus, you'll want to use a non-matching group. In python, non-matching groups look like this: (?:ADMINISTRATOR)
All put together, your regex might look something like this in python:
mb_ereg_match("(#.*(?!ADMINISTRATOR))\w ",$COMMENT)
An interval in a regex will always match a single character, whether negated or not. [ADMINISTRATOR] will match either an A, D, M and so forth. [^ADMINISTRATOR] will match anything that is not an A, D, M, etc.
If you want a regex that does not have a given string, I'd suggest using a negative lookahead instead, as anubhava suggested.