Regex Searching in vim - regex

I'm using vim to do some pattern matching on a text file. I've enabled search highlighting so that I know exactly what is getting matched on each search and am getting confused.
Consider searching for [a-z]* on the following text:123456789abcdefghijklmnopqrstuvwxyxz987654321ABCDEFGHIJKLMNOPQRSTUVWQXZ
I expected this search to match zero or more consecutive characters that are in the range [a-z]. Instead, I get a match on the entire line.
Should this be the expected behaviour?
Thanks,
Andrew

It's matching the empty strings that occur after every character. It has no way of highlighting empty ranges, so it looks like everything is highlighted.
Try searching for [a-z]\+ instead.

Empty string matches [a-z]*... therefore this thing is matching everywhere. Perhaps you want to cut down some of the cases by doing [a-z]+ (1 or more), or [a-z]{4,} (4 or more).

You're not getting a match on the entire line, you're getting a match on every character. Your pattern also matches nothing at all, which is matched by every single character.

Related

Why doesn't my non-greedy match work in vim?

This is test
There are two tabs (\t) in this line. I want to get rid of the part from the beginning to the first tab key, which is "This ", and I used the following pattern:
:s/.\{-}\t//g
It says it can't find the pattern. If I use the following, both tabs are replaced, which isn't what I want. Why doesn't the first pattern work?
:s/.*\t//g
Your first attempt does not work because you are matching the fewest number of any character followed by a tab. The fewest number of any character is zero (0). So both of your tabs match without any other characters.
Based on the comments, the above explanation was incorrect.
Here is one possible solution.
:s/^[^\t]*\t//
This goes from the beginning ^, capturing any number of non-tab characters [^\t]* until it reaches a tab \t.
Your pattern /.\{-}\t didn't work because of the g flag in the :s command. This flag enables global matching so it matches twice. Just remove the flag and it will work. In addition, when deleting something you can omit the replacement part in :s:
:s/.\{-}\t
The full :s/.\{-}\t// is fine as well. Note that in either case it should not say "pattern not found" as you described. If you see that message, there is something else different between your example and your actual text.

PowerShell RegEx with multiple options

Given a file name of 22-PLUMB-CLR-RECTANGULAR.0001.rfa I need a RegEx to match it. Basically it's any possible characters, then . and 4 digits and one of four possible file extensions.
I tried ^.?\.\d{4}\.(rvt|rfa|rte|rft)$ , which I would have thought would be correct, but I guess my RegEx understanding has not progressed as much as I thought/hoped. Now, .?\.\d{4}\.(rvt|rfa|rte|rft)$ does work and the ONLY difference is that I am not specifying the start of the string with ^. In a different situation where the file name is always in the form journal.####.txt I used ^journal\.\d{4}\.txt$ and it matched journal.0001.txt like a champ. Obviously when I am specifying a specific string, not any characters with .? is the difference, but I don't understand the WHY.
That never matches the mentioned string since ^.? means match beginning of input string then one optional single character. Then it looks for a sequence of dots and digits and nothing's there. Because we didn't yet pass the first character.
Why does it work without ^? Because without ^ it is allowed to go through all characters to find a match and it stops right before R and continues matching up to the end.
That's fine but with first approach it should be ^.*. Kleene star matches every thing greedily then backtracks but ? is the operator here which makes preceding pattern optional. That means one character, not many characters.

Ant regex expression

Quite a simple one in theory but can't quite get it!
I want a regex in ant which matches anything as long as it has a slash on the end.
Below is what I expect to work
<regexp id="slash.end.pattern" pattern="*/"/>
However this throws back
java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*/
^
I have also tried escaping this to \*, but that matches a literal *.
Any help appreciated!
Your original regex pattern didn't work because * is a special character in regex that is only used to quantify other characters.
The pattern (.)*/$, which you mentioned in your comment, will match any string of characters not containing newlines, however it uses a possibly unnecessary capturing group. .*/$ should work just as well.
If you need to match newline characters, the dot . won't be enough. You could try something like [\s\S]*/$
On that note, it should be mentioned that you might not want to use $ in this pattern. Suppose you have the following string:
abc/def/
Should this be evaluated as two matches, abc/ and def/? Or is it a single match containing the whole thing? Your current approach creates a single match. If instead you would like to search for strings of characters and then stop the match as soon as a / is found, you could use something like this: [\s\S]*?/.

Notepad++ Regex: Find all 1 and 2 letter words

I’m working with a text file with 200.000+ lines in Notepad++. Each line has only one word. I need to strip out and remove all words which only contains one letter (e.g.: I) and words which contains only two letters (e.g.: as).
I thought I could just pas in regular regex like this [a-zA-Z]{1,2} but I does not recognize anything (I’m trying to Mark them).
I’ve done manual search and I know that there do exists words of that length so therefor can it only be my regex code that’s wrong. Anyone knows how to do this in Notepad++ ???
Cheers,
- Mestika
If you want to remove only the words but leave the lines empty, this works:
^[a-zA-Z]{1,2}$
Replace this with an empty string. ^ and $ are anchors for the beginning and the end of a line (because Notepad++'s regexes work in multi-line mode).
If you want to remove the lines completely, search for this:
^[a-zA-Z]{1,2}\r\n
And replace with an empty string. However, this won't work before Notepad++ 6, so make sure yours is up-to-date.
Note that you will have to replace \r\n with the specific line-endings of your file!
As Tim Pietzker suggested, a platform independent solution that also removes empty lines would be:
^[a-zA-Z]{1,2}[\r\n]+
A platform-independent solution that does not remove empty lines but only those with one or two letters would be:
^[a-zA-Z]{1,2}(\r\n?|\n)
I don't use Notepad++ but my guess is it could be because you have too many matches - try including word boundaries (your exp will match every set of 2 letters)
\b[a-zA-Z]{1,2}\b
The regex you specified should find 1-or-2 characters (even in Notepad++'s Find-dialog), but not in the way you'd think. You want to have the regex make sure it starts at the beginning of the line and ends at the end with ^ and $, respecitevely:
^[a-zA-Z]{1,2}$
Notepad++ version 6.0 introduced the PCRE engine, so if this doesn't work in your current version try updating to the most recent.
You seem to use the version of Notepad++ that doesn't support explicit quantifiers: that's why there's no match at all (as { and } are treated as literals, not special symbols).
The solution is to use their somewhat more lengthy replacement:
\w\w?
... but that's only part of the story, as this regex will match any symbol, and not just short words. To do that, you need something like this:
^\w\w?$

Assistance with a regular expression

I am not good with regular expressions, and I could use some help with a couple of expressions I am working on. I have a line of text, such as Text here then 999-99 and I'd like to isolate that number sequence at the end. It could be either 999-99 or 999-99-9. The following seems to work:
\d{3}-\d{2}(-\d{1})?
But I notice that it really just seems to be searching anywhere within the text, as I can add text after the number sequence and it still matches. This needs to be more strict, so that the line must end with this exact sequence, and nothing after it. I tried ending with $ instead of ?, but that never seems to create a match (it always returns false).
I could also use some help with character replacement. I am working on a program which deals with OCR scanning, and occasionally the string value that comes back contains undisplayable characters, represented by the ܀ symbol. Is there a regular expression which will replace the ܀ characters with a space?
Try this regular expression.
([\d-]+)$
This should work. Just end your regex with $. It represents end of line
\d{3}-\d{2}(-\d{1})?$
Use the word-boundary metacharacter, \b:
\b\d{3}-\d{2}(-\d)?\b
You can also remove the {1} from the last \d since it's redundant.