I need to match any string that has certain characteristics, but I think enabling the /m flag is breaking the functionality.
What I know:
The string will start and end with quotation marks.
The string will have the following words. "the", "fox", and "lazy".
The string may have a line break in the middle.
The string will never have an at sign (used in the regex statement)
My problem is, if I have the string twice in a single block of text, it returns once, matching everything between the first quote mark and last quote mark with the required words in-between.
Here is my regex:
/^"the[^#]*fox[^#]*lazy[^#]*"$/gim
And a Regex101 example.
Here is my understanding of the statement. Match where the string starts with "the and there is the word fox and lazy (in that order) somewhere before the string ends with ". Also ignore newlines and case-sensitivity.
The most common answer to limiting is (.*?) But it doesn't work with new lines. And putting [^#?]* doesn't work because it adds the ? to the list of things to ignore.
So how can I keep the "match everything until ___" from skipping until the last instance while still being able to ignore newlines?
This is not a duplicate of anything else I can find because this deals with multi-line matching, and those don't.
In your case, all your quantifiers need to be non-greedy so you can just use the flag ungreedy: U.
/^"the[^#]*fox[^#]*lazy[^#]*"$/gimU
Example on Regex101.
The answer, which was figured out while typing up this question, may seem ridiculously obvious.
Put the ? after the *, not inside the brackets. Parenthesis and Brackets are not analogous, and the ? should be relative to the *.
Corrected regex:
/^"the[^#]*?fox[^#]*?lazy[^#]*?"$/gim
Example from Regex101.
The long and the short of this is:
Non-greedy, multi-line matching can be achieved with [^#]*?
(substituting # for something you don't want to match)
Related
This is a follow-up question to what was solved yesterday:
Notepad++ Regex Replace Makeshift Footnotes format With Proper Markdown format
I managed to find a Regex to remove the offending semicolons in the main text area but by only cutting out the text and pasting back the result, which can only be done one by one.
I'm not sure how this can be done, but the expert can tell me.
So I have footnote references in markdown format. Two instances of the same thing:
[^1]:
[^2]:
.
.
.
[^99]:
I might not have 99 in a document but I wanted to show I need to match two digits here again.
As I said, there are two instances of these numbered references in the text. One in the main text pointing to the footnote and the footnote at the end of the document.
What I need is deleting the semi-colons from the main text and leave the
[^3]:
[^15]:
etc.
references at the end intact.
Because the main text references come after a word or at the end of a sentence (ususally before the sentence-ending period), there is never a case a reference would start a sentence (even if they seem to appear there once or twice because of word wrap).
I provided the exact opposite of my needs here:
Click here for Regex101 website link
I put in the exact opposite of what I want because I already knew of the
^
sign to match anything that is at the front of the line.
Now I would like to negate this, if possible, so that I would delete the semi-colons in the main text, not down at the bottom.
Of course, it is likely that my approach is not good and you'll come up with a completely different approach. Especially because there doesn't seem to be a NOT operator in Regex, if I read correctly.
I repeat: the Regex101 example with the match and substitution is exactly the opposite of what I want.
I am not sure if you can play around in the substitution line to get the desired negative effect.
I could have probably asked for removing the first occurence of semi-colons but I thought the important part of tackling the problem is that those items not to be matched are always at the start of the line, not the others.
Thanks for any suggestions
In Notepad++ you might use a negative lookabehind asserting not the start of the string to the left, and use \K to clear the match buffer matching only the colon that should be replaced by an empty string.
(?<!^)\[\^\d{1,2}]\K:
Explanation
(?<!^) Negative lookbehind, assert not the start of the start directly to the left
\[\^ Match [^
\d{1,2} Match 1 or 2 digits
] Match literally
\K Forget what is matched so far
: Match a colon
Regex demo
I am having trouble understanding negative regex lookahead / lookbehind. I got the impression from reading tutorials that when you set a criteria to look for, the criteria doesn't form part of the search match.
That seems to hold for positive lookahead examples I tried, but when I tried these negative ones, it matches the entire test string. 1, it shouldn't have matched anything, and 2 even if it did, it wasn't supposed to include the lookahead criteria??
(?<!^And).*\.txt$
with input
And.txt
See: https://regex101.com/r/vW0aXS/1
and
^A.*(?!\.txt$)
with input:
A.txt
See: https://regex101.com/r/70yeED/1
PS: if you're going to ask me which language. I don't know. we've been told to use regex without any specific reference to any specific languages. I tried clicking various options on regex101.com and they all came up the same.
Lookarounds only try to match at their current position.
You are using a lookbehind at the beginning of the string (?<!^And).*\.txt$, and a lookahead at the end of the string ^A.*(?!\.txt$), which won't work. (.* will always consume the whole string as it's first match)
To disallow "And", for example, you can put the lookahead at the beginning of the string with a greedy quantifier .* inside it, so that it scans the whole string:
(?!.*And).*\.txt$
https://regex101.com/r/1vF50O/1
Your understanding is correct and the issue is not with the lookbehind/lookahead. The issue is with .* which matches the entire string in both cases. The period . matches any character and then you follow it with * which makes it match the entire string of any length. Remove it and both you regexes will work:
(?<!^And)\.txt$
^A(?!\.txt$)
My string being of the form:
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
I only want to match against the last segment of whitespace before the last period(.)
So far I am able to capture whitespace but not the very last occurrence using:
\s+(?=\.\w)
How can I make it less greedy?
In a general case, you can match the last occurrence of any pattern using the following scheme:
pattern(?![\s\S]*pattern)
(?s)pattern(?!.*pattern)
pattern(?!(?s:.*)pattern)
where [\s\S]* matches any zero or more chars as many as possible. (?s) and (?s:.) can be used with regex engines that support these constructs so as to use . to match any chars.
In this case, rather than \s+(?![\s\S]*\s), you may use
\s+(?!\S*\s)
See the regex demo. Note the \s and \S are inverse classes, thus, it makes no sense using [\s\S]* here, \S* is enough.
Details:
\s+ - one or more whitespace chars
(?!\S*\s) - that are not immediately followed with any 0 or more non-whitespace chars and then a whitespace.
You can try like so:
(\s+)(?=\.[^.]+$)
(?=\.[^.]+$) Positive look ahead for a dot and characters except dot at the end of line.
Demo:
https://regex101.com/r/k9VwC6/3
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=((?<=\S)\s+)).*
replaced by `>\1<`
> <
As a more generalized example
This example defines several needles and finds the last occurrence of either one of them. In this example the needles are:
defined word findMyLastOccurrence
whitespaces (?<=\S)\s+
dots (?<=[^\.])\.+
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)).*
replaced by `>\1<`
>..<
Explanation:
Part 1 .*
is greedy and finds everything as long as the needles are found. Thus, it also captures all needle occurrences until the very last needle.
edit to add:
in case we are interested in the first hit, we can prevent the greediness by writing .*?
Part 2 (?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+|(?<=**Not**NeedlePart)NeedlePart+))
defines the 'break' condition for the greedy 'find-all'. It consists of several parts:
(?=(needles))
positive lookahead: ensure that previously found everything is followed by the needles
findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)|(?<=**Not**NeedlePart)NeedlePart+
several needles for which we are looking. Needles are patterns themselves.
In case we look for a collection of whitespaces, dots or other needleparts, the pattern we are looking for is actually: anything which is not a needlepart, followed by one or more needleparts (thus needlepart is +). See the example for whitespaces \s negated with \S, actual dot . negated with [^.]
Part 3 .*
as we aren't interested in the remainder, we capture it and dont use it any further. We could capture it with parenthesis and use it as another group, but that's out of scope here
SIMPLE SOLUTION for a COMMON PROBLEM
All of the answers that I have read through are way off topic, overly complicated, or just simply incorrect. This question is a common problem that regex offers a simple solution for.
Breaking Down the General Problem
THE STRING
The generalized problem is such that there is a string that contains several characters.
THE SUB-STRING
Within the string is a sub-string made up of a few characters. Often times this is a file extension (i.e .c, .ts, or .json), or a top level domain (i.e. .com, .org, or .io), but it could be something as arbitrary as MC Donald's Mulan Szechuan Sauce. The point it is, it may not always be something simple.
THE BEFORE VARIANCE (Most important part)
The before variance is an arbitrary character, or characters, that always comes just before the sub-string. In this question, the before variance is an unknown amount of white-space. Its a variance because the amount of white-space that needs to be match against varies (or has a dynamic quantity).
Describing the Solution in Reference to the Problem
(Solution Part 1)
Often times when working with regular expressions its necessary to work in reverse.
We will start at the end of the problem described above, and work backwards, henceforth; we are going to start at the The Before Variance (or #3)
So, as mentioned above, The Before Variance is an unknown amount of white-space. We know that it includes white-space, but we don't know how much, so we will use the meta sequence for Any Whitespce with the one or more quantifier.
The Meta Sequence for "Any Whitespace" is \s.
The "One or More" quantifier is +
so we will start with...
NOTE: In ECMAS Regex the / characters are like quotes around a string.
const regex = /\s+/g
I also included the g to tell the engine to set the global flag to true. I won't explain flags, for the sake of brevity, but if you don't know what the global flag does, you should DuckDuckGo it.
(Solution Part 2)
Remember, we are working in reverse, so the next part to focus on is the Sub-string. In this question it is .com, but the author may want it to match against a value with variance, rather than just the static string of characters .com, therefore I will talk about that more below, but to stay focused, we will work with .com for now.
It's necessary that we use a concept here that's called ZERO LENGTH ASSERTION. We need a "zero-length assertion" because we have a sub-string that is significant, but is not what we want to match against. "Zero-length assertions" allow us to move the point in the string where the regular expression engine is looking at, without having to match any characters to get there.
The Zero-Length Assertion that we are going to use is called LOOK AHEAD, and its syntax is as follows.
Look-ahead Syntax: (?=Your-SubStr-Here)
We are going to use the look ahead to match against a variance that comes before the pattern assigned to the look-ahead, which will be our sub-string. The result looks like this:
const regex = /\s+(?=\.com)/gi
I added the insensitive flag to tell the engine to not be concerned with the case of the letter, in other words; the regular expression /\s+(?=\.cOM)/gi
is the same as /\s+(?=\.Com)/gi, and both are the same as: /\s+(?=\.com)/gi &/or /\s+(?=.COM)/gi. Everyone of the "Just Listed" regular expressions are equivalent so long as the i flag is set.
That's it! The link HERE (REGEX101) will take you to an example where you can play with the regular expression if you like.
I mentioned above working with a sub-string that has more variance than .com.
You could use (\s*)(?=\.\w{3,}) for instance.
The problem with this regex, is even though it matches .txt, .org, .json, and .unclepetespurplebeet, the regex isn't safe. When using the question's string of...
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
as an example, you can see at the LINK HERE (Regex101) there are 3 lines in the string. Those lines represent areas where the sub-string's lookahead's assertion returned true. Each time the assertion was true, a possibility for an incorrect final match was created. Though, only one match was returned in the end, and it was the correct match, when implemented in a program, or website, that's running in production, you can pretty much guarantee that the regex is not only going to fail, but its going to fail horribly and you will come to hate it.
You can try this. It will capture the last white space segment - in the first capture group.
(\s+)\.[^\.]*$
In vim I would like to use regex to highlight each line that ends with a letter, that is preceeded by neither // nor :. I tried the following
syn match systemverilogNoSemi "\(.*\(//\|:\).*\)\#!\&.*[a-zA-Z0-9_]$" oneline
This worked very good on comments, but did not work on lines containing colon.
Any idea why?
Because with this regex vim can choose any point for starting match for your regular expression. Obviously it chooses the point where first concat matches (i.e. does not have // or :). These things are normally done by using either
\v^%(%(\/\/|\:)#!.)*\w$
(removed first concat and the branch itself, changed .* to %(%(\/\/|\:)#!.)*; replaced collection with equivalent \w; added anchor pointing to the start of line): if you need to match the whole line. Or negative look-behind if you need to match only the last character. You can also just add anchor to the first concat of your variant (you should remove trailing .* from the first concat as it is useless, and the branch symbol for the same reason).
Note: I have no idea why your regex worked for comments. It does not work with comments the way you need it in all cases I checked.
does this work for you?
^\(\(//\|:\)\#<!.\)*[a-zA-Z0-9_]$
I'm looking for a Perl regex that will capitalize any character which is preceded by whitespace (or the first char in the string).
I'm pretty sure there is a simple way to do this, but I don't have my Perl book handy and I don't do this often enough that I've memorized it...
s/(\s\w)/\U$1\E/g;
I originally suggested:
s/\s\w/\U$&\E/g;
but alarm bells were going off at the use of '$&' (even before I read #Manni's comment). It turns out that they're fully justified - using the $&, $` and $' operations cause an overall inefficiency in regexes.
The \E is not critical for this regex; it turns off the 'case-setting' switch \U in this case or \L for lower-case.
As noted in the comments, matching the first character of the string requires:
s/((?:^|\s)\w)/\U$1\E/g;
Corrected position of second close parenthesis - thanks, Blixtor.
Depending on your exact problem, this could be more complicated than you think and a simple regex might not work. Have you thought about capitalization inside the word? What if the word starts with punctuation like '...Word'? Are there any exceptions? What about international characters?
It might be better to use a CPAN module like Text::Autoformat or Text::Capitalize where these problems have already been solved.
use Text::Capitalize 0.2;
print capitalize_title($t), "\n";
use Text::Autoformat;
print autoformat{case => "highlight", right=>length($t)}, $t;
It sounds like Text::Autoformat might be more "standard" and I would try that first. Its written by Damian. But Text::Capitalize does a few things that Text::Autoformat doesn't. Here is a comparison.
You can also check out the Perl Cookbook for recipie 1.14 (page 31) on how to use regexps to properly capitalize a title or headline.
Something like this should do the trick -
s!(^|\s)(\w)!$1\U$2!g
This simply splits up the scanned expression into two matches - $1 for the blank/start of string and $2 for the first character of word. We then substitute both $1 and $2 after making the start of the word upper-case.
I would change the \s to \b which makes more sense since we are checking for word-boundaries here.
This isn't something I'd normally use a regex for, but my solution isn't exactly what you would call "beautiful":
$string = join("", map(ucfirst, split(/(\s+)/, $string)));
That split()s the string by whitespace and captures all the whitespace, then goes through each element of the list and does ucfirst on them (making the first character uppercase), then join()s them back together as a single string. Not awful, but perhaps you'll like a regex more. I personally just don't like \Q or \U or other semi-awkward regex constructs.
EDIT: Someone else mentioned that punctuation might be a potential issue. If, say, you want this:
...string
changed to this:
...String
i.e. you want words capitalized even if there is punctuation before them, try something more like this:
$string = join("", map(ucfirst, split(/(\w+)/, $string)));
Same thing, but it split()s on words (\w+) so that the captured elements of the list are word-only. Same overall effect, but will capitalize words that may not start with a word character. Change \w to [a-zA-Z] to eliminate trying to capitalize numbers. And just generally tweak it however you like.
If you mean character after space, use regular expressions using \s. If you really mean first character in word you should use \b instead of all above attempts with \s which is error prone.
s/\b(\w)/\U$1/g;
You want to match letters behind whitespace, or at the start of a string.
Perl can't do variable length lookbehind. If it did, you could have used this:
s/(?<=\s|^)(\w)/\u$1/g; # this does not work!
Perl complains:
Variable length lookbehind not implemented in regex;
You can use double negative lookbehind to get around that: the thing on the left of it must not be anything that is not whitespace. That means it'll match at the start of the string, but if there is anything in front of it, it must be whitespace.
s/(?<!\S)(\w)/\u$1/g;
The simpler approach in this exact case will probably be to just match the whitespace; the variable length restriction falls away, then, and include that in the replacement.
s/(\s|^)(\w)/$1\u$2/g;
Occasionally you can't use this approach in repeated substitutions because that what precedes the actual match has already been eaten by the regex, and it's good to have a way around that.
Capitalize ANY character preceded by whitespace or at beginning of string:
s/(^|\s)./\u$1/g
Maybe a very sloppy way of doing it because it's also uppercasing the whitespace now. :P
The advantage is that it works with letters with all possible accents (and also with special Danish/Swedish/Norwegian letters), which are problematic when you use \w and \b in your regex. Can I expect that all non-letters are untouched by the uppercase modifier?