Match anything.c but not anything.in.c - regex

I'm trying to write a regex that matches a.c, hello.c, etc.c, but not a.in.c, hello.in.c, etc.in.c.
Here's my regex: https://regex101.com/r/jC8nB0/21
It says that the negative lookahead won't match what I specified, that is, .in.c. I didn't know where to teach it to match .c. I tried both inside the parenthesis and outside.
I've read Regex: match everything but specific pattern but it talks about matching everything except something, but I need to match a rule except other rule.

This worked for me.
.*(?<!(\.in))\.c
https://www.regular-expressions.info/lookaround.html
*Edited do to good information from zzxyz

This is actually a bit complicated given unknown input. The following isn't perfect, but it avoids .cpp files, and deals with strings that don't contain filenames, or longer strings that do.
\b\S+(?<!\.in)\.c\b
https://regex101.com/r/jC8nB0/286

Related

RegEx to match acronyms

I am trying to write a regular expression that will match values such as U.S., D.C., U.S.A., etc.
Here is what I have so far -
\b([a-zA-Z]\.){2,}+
Note how this expression matches but does not include the last letter in the acronym.
Can anyone help explain what I am missing here?
SOLUTION
I'm posting the solution here in case this helps anyone.
\b(?:[a-zA-Z]\.){2,}
It seems as if a non-capturing group is required here.
Try (?:[a-zA-Z]\.){2,}
?: (non-capturing group) is there because you want to omit capturing the last iteration of the repeated group.
For example, without ?:, 'U.S.A.' will yield a group match 'A.', which you are not interested about.
None of these proposed solutions do what yours does - make sure that there are at least 2 letters in the acronym. Also, yours works on http://rubular.com/ . This is probably some issue with the regex implementation - to be fair, all of the matches that you got were valid acronyms. To fix this, you could either:
Make sure there's a space or EOF succeeding your expression ((?=\s|$) in ruby at least)
Surround your regex with ^ and $ to make sure it catches the whole string. You'd have to split the whole string on spaces to get matches with this though.
I prefer the former solution - to do this you'd have:
\b([a-zA-Z]\.){2,}(?=\s|$)
Edit: I've realized this doesn't actually work with other punctuation in the string, and a couple of other edge cases. This is super ugly, but I think it should be good enough:
(?<=\s|^)((?:[a-zA-Z]\.){2,})(?=[[:punct:]]?(?:\s|$))
This assumes that you've got this [[:punct:]] character class, and allows for 0-1 punctuation marks after an acronym that won't be captured. I've also fixed it up so that there's a single capture group that gets the whole acronym. Check out validation at http://rubular.com/r/lmr0qERLDh
Bonus: you now get to make this super confusing to anyone reading it.
This should work:
/([a-zA-Z]\.)+/g
I have slightly modified the solution above:
\b(?:[a-zA-Z]+\.){2,}
to enable capturing acronyms containing more than one letter between the dots, like in 'GHQ.AFP.X.Y'

Vim S&R to remove number from end of InstallShield file

I've got a practical application for a vim regex where I'd like to remove numbers from the end of file location links. For example, if the developer is sloppy and just adds files and doesn't reuse file locations, you'll end up with something awful like this:
PATH_TO_MY_FILES&gt
PATH_TO_MY_FILES1&gt
...
PATH_TO_MY_FILES22&gt
PATH_TO_MY_FILES_ELSEWHERE&gt
PATH_TO_MY_FILES_ELSEWHERE1&gt
...
So all I want to do is to S&R and replace PATH_TO_MY_FILES*\d+ with PATH_TO_MY_FILES* using regex. Obviously I am not doing it quite right, so I was hoping someone here could not spoon feed the answer necessarily, but throw a regex buzzword my way to get me on track.
Here's what I have tried:
:%s\(PATH_TO_MY_FILES\w*\)\(\d+\)&gt:gc
But this doesn't work, i.e. if I just do a vim search on that, it doesn't find anything. However, if I use this:
:%s\(PATH_TO_MY_FILES\w*\)\(\d\)&gt:gc
It will match the string, but the grouping is off, as expected. For example, the string PATH_TO_MY_FILES22 will be grouped as (PATH_TO_MY_FILES2)(2), presumably because the \d only matches the 2, and the \w match includes the first 2.
Question 1: Why doesn't \d+ work?
If I go ahead and use the second string (which is wrong), Vim appears to find a match (even though the grouping is wrong), but then does the replacement incorrectly.
For example, given that we know the \d will only match the last number in the string, I would expect PATH_TO_MY_FILES22&gt to get replaced with PATH_TO_MY_FILES2&gt. However, instead it replaces it with this:
PATH_TO_MY_FILES2PATH_TO_MY_FILES22&gtgt
So basically, it looks like it finds PATH_TO_MY_FILES22&gt, but then replaces only the & with group 1, which is PATH_TO_MY_FILES2.
I tried another regex at Regexr.com to see how it would interpret my grouping, and it looked correct, but maybe a hack around my lack of regex understanding:
(PATH_TO_\D*)(\d*)&gt
This correctly broke my target string into the PATH part and the entire number, so I was happy. But then when I used this in Vim, it found the match, but still replaced only the &.
Question 2: Why is Vim only replacing the &?
Answer 1:
You need to escape the + or it will be taken literally. For example \d\+ works correctly.
Answer 2:
An unescaped & in the replacement portion of a substitution means "the entire matched text". You need to escape it if you want a literal ampersand.

Parse with Regex without trailing characters

How can I successfully parse the text below in that format to parse just
To: User <test#test.com>
and
To: <test#test.com>
When I try to parse the text below with
/To:.*<[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>/mi
It grabs
Message-ID <CC2E81A5.6B9%test#test.com>,
which I dont want in my answer.
I have tried using $ and \z and neither work. What am I doing wrong?
Information to parse
To: User <test#test.com> Message-ID <CC2E81A5.6B9%test#test.com>
To:
<test#test.com>
This is my parsing information in Rubular http://rubular.com/r/DQMQC4TQLV
Since you haven't specified exactly what your tool/language is, assumptions must be made.
In general regex pattern matching tends to be aggressive, matching the longest possible pattern. Your pattern starts off with .*, which means that you're going to match the longest possible string that ENDS WITH the remainder of your pattern <[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>, which was matched with <CC2E81A5.6B9%test#test.com> from the Message-ID.
Both Apalala's and nhahtdh's comments give you something to try. Avoid the all-inclusive .* at the start and use something that's a bit more specific: match leading spaces, or match anything EXCEPT the first part of what you're really interested in.
You need to make the wildcard match non greedy by adding a question mark after it:
To:.*?<[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>

Regex to include some files but with one exception

I would like a regex that includes all filenames with a certain ending ex. ".err" but not if this filename starts with e.g. "test". In other words include "*.err"-files but not "test-whatever.err"-files.
I have found that
(?!test.*\.err$).*\.err
excludes the test*.err files and that
.*\.err
includes all the *.err files, but I need them both in the same expression.
Also the fact that the ".err" can be written as ".ERR" or ".Err" must be taken into concideration for this regex to work properly for me.
All thoughts and ideas are appreciated!
Regards
Rickard
Use this
^(?i)(?!test).*\.err$
See it here online on Regexr
The important parts, that are different to yours:
Use anchors. ^ and $ are anchoring your pattern to the start and to the end of the string
(?i) makes it "ignorecase", so that err will also match "ERR" or "ErR" and test will also match "Test" and TEST ...
You didn't gave the language, but this features should work with the most flavours.
How about this one:
^(?!test).*\.err$

regex to find instance of a word or phrase -- except if that word or phrase is in braces

First, a disclaimer. I know a little about regex's but I'm no expert. They seem to be something that I really need twice a year so they just don't stay "on top" of my brain.
The situation: I'd like to write a regex to match a certain word, let's call it "Ostrich". Easy. Except Ostrich can sometimes appear inside of a curly brace. If it's inside of a curly brace it's not a match. The trick here is that there can be spaces inside the curly braces. Also the text is typically inside of a paragraph.
This should match:
I have an Ostrich.
This should not match:
My Emu went to the {Ostrich Race Name}.
This should be a match:
My Ostrich went to the {Ostrich Race Name}.
This should not be a match:
My Emu went to the {Race Ostrich Place}. My Emu went to the {Race Place Ostrich}.
It seems like this is possible with a regex, but I sure don't see it.
I'll offer an alternative solution to doing this, which is a bit more robust (not using regex assertions).
First, remove all the bracketed items, using a regex like {[^}]+} (use replace to change it to an empty string).
Now you can just search for Ostrich (using regex or simple string matching, depending on your needs).
While regular expressions can certainly be written to do what you ask, they're probably not the best tool for this particular type of thing.
One major problem with regular expressions is that they're very good at pattern matching for things that are there, but not so much when you start adding except into the mix.
Regular expressions are not stateful enough to handle this properly without a lot of work, so I would try to find a different path towards a solution.
A character tokenizer that handles the braces would be easy enough to write.
I believe this will work, using lookahead and lookbehind assertions:
(?<!{[^}]*)Ostrich(?![^{]*})
I also tested the case My {Ostrich} went to the Ostrich Race. (where the second "Ostrich" does match)
Note that the lookahead assertion: (?![^{]*}) is optional.. but without it:
My {Ostrich has a missing bracket won't match
My Ostrich also} has a missing bracket will match
which may or may not be desirable.
This works in the .NET regex engine, however, it is not PCRE-compatible because it uses non-fixed-length assertions which are not supported.
Here's a very large regex that almost works.
It will return each "raw" occurrence of the word in a group.
However, the group for the last one will be empty; I'm not sure why. (Tested with .Net)
Parse without whitespace
^(?:
(?:
[^{]
|
(?:\{.*?\})
)*?
(?:\W(Ostrich)\W)?
)*$
Using a positive lookahead with a negation appears to properly match all the test cases as well as multiple Ostriches:
(?<!{[^}]*)Ostrich(?=[^}]*)