Regular expression matching beginning of line OR a set of characters - regex

At first I thought this must be very easy, I am just overlooking something, but so far with my limited knowledge of regex I can't figure this out,
I have a regex like [some characters]MYNAME actual thing is:
rx = rx + `[ ,\t,,\,,\(,=,#,\s]+(MYNAME)`
I want this regex to also detect a line that starts with MYNAME.
So the question is, is there a way to add ^ inside [] with other things? or to OR the ^ with a [some characters]?
I can't make it work either with javascript or golang. If there are differences related to this matter, I am interested in the golang specific solutions.

You can use alternation. Also, there are some unnecessary characters in your character class:
I don't know what those commas are supposed to do? Did you intent them to act as separator? If yes, remove them.
Also, you don't need to escape ( in a character class.
Since you have added \s, you don't need to add \t and " " separately.
So, your regex can be simplified to:
"(?:[(=#\s]+|^)(MYNAME)"

Related

How to do a negative lookbehind within a %r<…>-delimited regexp in Ruby?

I like the %r<…> delimiters because it makes it really easy to spot the beginning and end of the regex, and I don't have to escape any /. But it seems that they have an insurmountable limitation that other delimiters don't have?
Every other delimiter imaginable works fine:
/(?<!foo)/
%r{(?<!foo)}
%r[(?<!foo)]
%r|(?<!foo)|
%r/(?<!foo)/
But when I try to do this:
%r<(?<!foo)>
it gives this syntax error:
unterminated regexp meets end of file
Okay, it probably doesn't like that it's not a balanced pair, but how do you escape it such that it does like it?
Does something need to be escaped?
According to wikibooks.org:
Any single non-alpha-numeric character can be used as the delimiter,
%[including these], %?or these?, %~or even these things~.
By using this notation, the usual string delimiters " and ' can appear
in the string unescaped, but of course the new delimiter you've chosen
does need to be escaped.
Indeed, escaping is needed in these examples:
%r!(?<\!foo)!
%r?(\?<!foo)?
But if that were the only problem, then I should be able to escape it like this and have it work:
%r<(?\<!foo)>
But that yields this error:
undefined group option: /(?\<!foo)/
So maybe escaping is not needed/allowed? wikibooks.org does list %<pointy brackets> as one of the exceptions:
However, if you use
%(parentheses), %[square brackets], %{curly brackets} or
%<pointy brackets> as delimiters then those same delimiters
can appear unescaped in the string as long as they are in balanced
pairs
Is it a problem with balanced pairs?
Balanced pairs are no problem as long as you are doing something in the Regexp that requires them, like...
%r{(?<!foo{1})} # repetition quantifier
%r[(?<![foo])] # character class
%r<(?<name>foo)> # named capture group
But what if you need to insert a left-side delimiter ({, [, or <) inside the regex? Just escape it, right? Ruby seems to have no problem with escaped unbalanced delimiters most of the time...
%r{(?<!foo\{)}
%r[(?<!\[foo)]
%r<\<foo>
It's just when you try to do it in the middle of the "group options" (which I guess is what the <! characters are classified as here) following a (? that it doesn't like it:
%r<(?\<!foo)>
# undefined group option: /(?\<!foo)/
So how do you do that then and make Ruby happy? (without changing the delimiters)
Conclusion
The workaround is easy. I'll just change this particular regex to just use something else instead like %r{…} instead.
But the questions remain...
Is there really no way to escape the < here?
Are there really some regular expression that are simply impossible to write using certain delimiters like %r<…>?
Is %r<…> the only regular expression delimiter pair that has this problem (where some regular expressions are impossible to write when using it). If you know of a similar example with %r{…}/%r[…], do share!
Version info
Not that it probably matters since this syntax probably hasn't changed, but I'm using:
⟫ ruby -v
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux]
Reference:
https://ruby-doc.org/core-2.6.3/Regexp.html
% Notation
As others have mentioned, seems like an oversight based on how this character differs from other paired boundaries.
As far as "Is there really no way to escape the < here?" there is a way... but you're not going to like it:
%r<(?#{'<'}!foo)> == %r((?<!foo))
Using interpolation to insert the < character seems to work. But given that there are much better options, I would avoid it unless you were planning on splitting the regex into sections anyway...

Regex to get match up through second to last occurrence of a string

If I have the following string:
hello/everyone/good/bye/world
I want to match everything up through the second to last forward slash:
hello/everyone/good/
But the number of slashes may vary.
What would the regex be to do this? Been googling around to no avail.
Instead of a regular expression, use split and take advantage of "slicing".
t = "hello/everyone/good/bye/world"
ts = t.split("/")
print "/".join( ts[0:-2] ) + "/"
This prints
hello/everyone/good/
This answer is for python2 (you did not specify which language), there are equivalent commands in some other languages. For example, java has a String.split() method (to complicate things, it requires a regular expression). Consult the documentation for the language you are using.
Maybe try this https://regex101.com/r/bH2vI0/1:
((\w+\/)+(?!\w+\/\w+))
Not sure if sublime allows lookaheads though, but basically just match every word + / except for last two. [A-z] might be better if you don't want digits and _

Vim S&R to remove number from end of InstallShield file

I've got a practical application for a vim regex where I'd like to remove numbers from the end of file location links. For example, if the developer is sloppy and just adds files and doesn't reuse file locations, you'll end up with something awful like this:
PATH_TO_MY_FILES&gt
PATH_TO_MY_FILES1&gt
...
PATH_TO_MY_FILES22&gt
PATH_TO_MY_FILES_ELSEWHERE&gt
PATH_TO_MY_FILES_ELSEWHERE1&gt
...
So all I want to do is to S&R and replace PATH_TO_MY_FILES*\d+ with PATH_TO_MY_FILES* using regex. Obviously I am not doing it quite right, so I was hoping someone here could not spoon feed the answer necessarily, but throw a regex buzzword my way to get me on track.
Here's what I have tried:
:%s\(PATH_TO_MY_FILES\w*\)\(\d+\)&gt:gc
But this doesn't work, i.e. if I just do a vim search on that, it doesn't find anything. However, if I use this:
:%s\(PATH_TO_MY_FILES\w*\)\(\d\)&gt:gc
It will match the string, but the grouping is off, as expected. For example, the string PATH_TO_MY_FILES22 will be grouped as (PATH_TO_MY_FILES2)(2), presumably because the \d only matches the 2, and the \w match includes the first 2.
Question 1: Why doesn't \d+ work?
If I go ahead and use the second string (which is wrong), Vim appears to find a match (even though the grouping is wrong), but then does the replacement incorrectly.
For example, given that we know the \d will only match the last number in the string, I would expect PATH_TO_MY_FILES22&gt to get replaced with PATH_TO_MY_FILES2&gt. However, instead it replaces it with this:
PATH_TO_MY_FILES2PATH_TO_MY_FILES22&gtgt
So basically, it looks like it finds PATH_TO_MY_FILES22&gt, but then replaces only the & with group 1, which is PATH_TO_MY_FILES2.
I tried another regex at Regexr.com to see how it would interpret my grouping, and it looked correct, but maybe a hack around my lack of regex understanding:
(PATH_TO_\D*)(\d*)&gt
This correctly broke my target string into the PATH part and the entire number, so I was happy. But then when I used this in Vim, it found the match, but still replaced only the &.
Question 2: Why is Vim only replacing the &?
Answer 1:
You need to escape the + or it will be taken literally. For example \d\+ works correctly.
Answer 2:
An unescaped & in the replacement portion of a substitution means "the entire matched text". You need to escape it if you want a literal ampersand.

Simple regex for matching up to an optional character?

I'm sure this is a simple question for someone at ease with regular expressions:
I need to match everything up until the character #
I don't want the string following the # character, just the stuff before it, and the character itself should not be matched. This is the most important part, and what I'm mainly asking. As a second question, I would also like to know how to match the rest, after the # character. But not in the same expression, because I will need that in another context.
Here's an example string:
topics/install.xml#id_install
I want only topics/install.xml. And for the second question (separate expression) I want id_install
First expression:
^([^#]*)
Second expression:
#(.*)$
[a-zA-Z0-9]*[\#]
If your string contains any other special characters you need to add them into the first square bracket escaped.
I don't use C#, but i will assume that it uses pcre... if so,
"([^#]*)#.*"
with a call to 'match'. A call to 'search' does not need the trailing ".*"
The parens define the 'keep group'; the [^#] means any character that is not a '#'
You probably tried something like
"(.*)#.*"
and found that it fails when multiple '#' signs are present (keeping the leading '#'s)?
That is because ".*" is greedy, and will match as much as it can.
Your matcher should have a method that looks something like 'group(...)'. Most matchers
return the entire matched sequence as group(0), the first paren-matched group as group(1),
and so forth.
PCRE is so important i strongly encourage you to search for it on google, learn it, and always have it in your programming toolkit.
Use look ahead and look behind:
To get all characters up to, but not including the pound (#): .*?(?=\#)
To get all characters following, but not including the pound (#): (?<=\#).*
If you don't mind using groups, you can do it all in one shot:
(.*?)\#(.*) Your answers will be in group(1) and group(2). Notice the non-greedy construct, *?, which will attempt to match as little as possible instead of as much as possible.
If you want to allow for missing # section, use ([^\#]*)(?:\#(.*))?. It uses a non-collecting group to test the second half, and if it finds it, returns everything after the pound.
Honestly though, for you situation, it is probably easier to use the Split method provided in String.
More on lookahead and lookbehind
first:
/[^\#]*(?=\#)/ edit: is faster than /.*?(?=\#)/
second:
/(?<=\#).*/
For something like this in C# I would usually skip the regular expressions stuff altogether and do something like:
string[] split = exampleString.Split('#');
string firstString = split[0];
string secondString = split[1];

Regex match everything after question mark?

I have a feed in Yahoo Pipes and want to match everything after a question mark.
So far I've figured out how to match the question mark using..
\?
Now just to match everything that is after/follows the question mark.
\?(.*)
You want the content of the first capture group.
Try this:
\?(.*)
The parentheses are a capturing group that you can use to extract the part of the string you are interested in.
If the string can contain new lines you may have to use the "dot all" modifier to allow the dot to match the new line character. Whether or not you have to do this, and how to do this, depends on the language you are using. It appears that you forgot to mention the programming language you are using in your question.
Another alternative that you can use if your language supports fixed width lookbehind assertions is:
(?<=\?).*
With the positive lookbehind technique:
(?<=\?).*
(We're searching for a text preceded by a question mark here)
Input: derpderp?mystring blahbeh
Output: mystring blahbeh
Example
Basically the ?<= is a group construct, that requires the escaped question-mark, before any match can be made.
They perform really well, but not all implementations support them.
\?(.*)$
If you want to match all chars after "?" you can use a group to match any char, and you'd better use the "$" sign to indicate the end of line.
?(.*\n)+
With this you can get everything Even a new line
Check out this site: http://rubular.com/ Basically the site allows you to enter some example text (what you would be looking for on your site) and then as you build the regular expression it will highlight what is being matched in real time.
str.replace(/^.+?\"|^.|\".+/, '');
This is sometimes bad to use when you wanna select what else to remove between "" and you cannot use it more than twice in one string. All it does is select whatever is not in between "" and replace it with nothing.
Even for me it is a bit confusing, but ill try to explain it. ^.+? (not anything OPTIONAL) till first " then | Or/stop (still researching what it really means) till/at ^. has selected nothing until before the 2nd " using (| stop/at). And select all that comes after with .+.