I need a regular expression to find a specific line in a file that occurs somewhere after another line. for example, I may want to find the string "friend", but only when it occurs on a line after a line containing the string "hello". so for example:
hello there
how are you
my friend
should pass, but
how are you
my friend
hello
or
hello friend
how are you
should not pass.
The only thing I've thought of is something like hello[.\s]*\n[.\s]*friend, which does not work.
EDIT: I'm using a customized program that has a lot of limitations. I don't have access to switches or custom modes. I need a single regular expression that works for the standard python regex mode.
hello[.\s]*\n[.\s]*friend
First note that a dot inside a character class matches for a literal dot, not as a "match all" character, so you really want alternation, not character class for this. But also not that a "match all" dot will also match spaces, so you don't even need alternation.
So overall, you really just need this:
hello.*?friend
Now comes the problem with matching across new-line chars. By default the "match all" dot does not match new-line chars. You can flag/modifier it to match it, but how you do that depends on what language you are using. In php or perl, you can use the s modifier, e.g.
php:
preg_match('~hello.*?friend~s',$content);
edit:
If you are trying to use regex in something like an editor (or otherwise can't add flags/modifiers), most editors have an option to flag it as such. If not, you can try alternation with newline chars like so:
hello(.|\r?\n)*friend
You need to include two newline characters.
hello(?:.*\n)+.*friend
This expects atleast one newline character present inbetween.
I'm by no means a regex expert (particularly not in Python), but my RegexBuddy app thinks this will work:
(?s)hello.*\n+.*friend
The (?s) is apparently an inline way of specifying the "Dot matches newline" option, which seems to be necessary for the \n to work.
Related
I have this text like below:
아니다
bukan
싫다
tidak suka
훌륭하다
bagus
And I am trying to remove the English line(English Alphabets) and attach it to the end of upper line(Korean Alphabets) like this:
아니다bukan
싫다tidak suka
훌륭하다bagus
Now, Finally find almost close regular expression, which is this:
[가-힣]\R
However, It makes the text file like this:
아니bukan
싫tidak suka
훌륭하bagus
The problem is removing the one word of Korean too.
How can I solve this problem?
C++ std::regex does not support Unicode property classes like \p{Hangul}, but you may use the equivalent character class, [\u1100-\u11FF\u302E\u302F\u3131-\u318E\u3200-\u321E\u3260-\u327E\uA960-\uA97C\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uFFA0-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC], see this reference.
Besides, \R is not supported either. You may probably just use \r?\n to match Windows/Linux style line endings, or (?:\r\n?|\n) to also support MacOS line endings.
Next, if you match and consume a Korean char, when replacing, you need to capture it into a capturing group and use a backreference to the group in the replacement pattern.
So, you may use
([\u1100-\u11FF\u302E\u302F\u3131-\u318E\u3200-\u321E\u3260-\u327E\uA960-\uA97C\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uFFA0-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC])(?:\r\n?|\n)
Replace with $1 to put back the Korean char into the resulting string.
See the regex demo online.
The regex for the set of all Korean characters in unicode is this:
\p{Hangul}
There is more information here: https://www.regular-expressions.info/unicode.html
Maybe you also need a + after your group of characters?
Use the [\p{Hangul}]+\R regular expression instead of what you're using now.
I've got a practical application for a vim regex where I'd like to remove numbers from the end of file location links. For example, if the developer is sloppy and just adds files and doesn't reuse file locations, you'll end up with something awful like this:
PATH_TO_MY_FILES>
PATH_TO_MY_FILES1>
...
PATH_TO_MY_FILES22>
PATH_TO_MY_FILES_ELSEWHERE>
PATH_TO_MY_FILES_ELSEWHERE1>
...
So all I want to do is to S&R and replace PATH_TO_MY_FILES*\d+ with PATH_TO_MY_FILES* using regex. Obviously I am not doing it quite right, so I was hoping someone here could not spoon feed the answer necessarily, but throw a regex buzzword my way to get me on track.
Here's what I have tried:
:%s\(PATH_TO_MY_FILES\w*\)\(\d+\)>:gc
But this doesn't work, i.e. if I just do a vim search on that, it doesn't find anything. However, if I use this:
:%s\(PATH_TO_MY_FILES\w*\)\(\d\)>:gc
It will match the string, but the grouping is off, as expected. For example, the string PATH_TO_MY_FILES22 will be grouped as (PATH_TO_MY_FILES2)(2), presumably because the \d only matches the 2, and the \w match includes the first 2.
Question 1: Why doesn't \d+ work?
If I go ahead and use the second string (which is wrong), Vim appears to find a match (even though the grouping is wrong), but then does the replacement incorrectly.
For example, given that we know the \d will only match the last number in the string, I would expect PATH_TO_MY_FILES22> to get replaced with PATH_TO_MY_FILES2>. However, instead it replaces it with this:
PATH_TO_MY_FILES2PATH_TO_MY_FILES22>gt
So basically, it looks like it finds PATH_TO_MY_FILES22>, but then replaces only the & with group 1, which is PATH_TO_MY_FILES2.
I tried another regex at Regexr.com to see how it would interpret my grouping, and it looked correct, but maybe a hack around my lack of regex understanding:
(PATH_TO_\D*)(\d*)>
This correctly broke my target string into the PATH part and the entire number, so I was happy. But then when I used this in Vim, it found the match, but still replaced only the &.
Question 2: Why is Vim only replacing the &?
Answer 1:
You need to escape the + or it will be taken literally. For example \d\+ works correctly.
Answer 2:
An unescaped & in the replacement portion of a substitution means "the entire matched text". You need to escape it if you want a literal ampersand.
I have a very simple Vim syntax file for personal notes. I would like to highlight people's name and I chose a Twitter-like syntax #jonathan.
I tried:
syntax match notesPerson "\<#\S\+"
To mean: words beginning with # and having at least one non-whitespace character. The problem is that # seems to be a special character in Vim regular expressions.
I tried to escape \# and enclose in brackets [#], the usual tricks, but that didn't work. I could try something like (^|\s) (beginning of line or whitespace) but that's exactly the problem that word-boundary tries to solve.
Highlighting works on simplified regular expressions, so this is more a question of finding the right regex than anything else. What am I missing?
# is a special character only if you have enabled the “very magic”
mode by having \v somewhere in the pattern prior to that #.
You have another problem here: # does not start a new word. \< is
not just “word boundary” like perl/PCRE’s \b, but “left word
boundary” (in help: “beginning of the word”) meaning that \< must be
followed by some keyword character. As # is not normally a keyword
character, pattern \<# will never match. (And even if it was like
\b, it would match constructs like abc#def which is definitely not
what you want for the aforementioned reasons.)
You should use \k\#<!#\k\S* instead: \k\#<! ensures that # is not preceded by any keyword character, \k\S* makes sure that first character of the name is a keyword one (you could probably also use #\<\S\+).
There is another solution: include # into 'iskeyword' option and leave the regex as is:
:setlocal iskeyword+=#-#
See :help 'isfname' for the explanation why #-# is used here.
(The 'iskeyword' option has exactly the same syntax and will,
in fact, redirect you there for the explanation.)
Normally the . doesn't match newline unless I specify the engine to do so with the (?s) flag. I tried this regexp on my editor's (UltraEdit v14.10) regexp engine using Perl style regexp mode:
(?s).*i
The search text contains multiple lines and each line contains many 'i' characters.
I expect the above regexp means: search as many characters (because with the '?s' the . now matches anything including newline) as possible (because of the greediness for *) until reaching the character 'i'.
This should mean "from the first character to the last 'i' in the last sentence" (greediness should reach the last sentence, right?).
But with UltraEdit's test, it turns out to be "from the first character to the last 'i' in the first sentence that contains an i". Is this result correct? Did I make any wrong interpretation of my reg expression?
e.g. given this text
aaa
bbb
aiaiaiaiaa
bbbicicid
it is
aaa
bbb
aiaiaiai
matched. But I expect:
aaa
bbb
aiaiaiaiaa
bbbicici
Your regex is correct, and so are your expectations of its performance.
This is a long-known bug in UltraEdit's regex implementation which I have written repeatedly to support about. As far as I know, it still hasn't been fixed. The problem appears to lie in the fact that UE's regex implementation is essentially line-based, and additional lines are taken into the match only if necessary. So .* will match greedily on the current line, but it will not cross a newline boundary if it doesn't have to in order to achieve a match.
There are some other subtle bugs with line endings. For example, lookbehind doesn't work across newlines, either.
Write to IDM support, or change to an editor with decent regex support. I did both.
Yes you are right this looks like a bug.
Your interpretation is correct. If you are in Perl mode and not Posix.
However it should apply to posix as well.
Altough defining the modifiers like you do is very rare.
Mostly you provide a string with delimiters and the modifier afterwards like /.*i/s
But this doesn't matter because your way is correct too. And if it wouldnt be supported, it wouldn't match the first newline either.
So yes, this is definately a bug in your program.
You're right that that regex should match the entire string (all 4 lines). My guess is that UltraEdit is attempting to do some sort of optimization by working line by line, and only accumulating new lines "when necessary".
I have some text like this:
dagGeneralCodes$_ctl1$_ctl0
Some text
dagGeneralCodes$_ctl2$_ctl0
Some text
dagGeneralCodes$_ctl3$_ctl0
Some text
dagGeneralCodes$_ctl4$_ctl0
Some text
I want to create a regular expression that extracts the last occurrence of dagGeneralCodes$_ctl[number]$_ctl0 from the text above.
the result should be: dagGeneralCodes$_ctl4$_ctl0
Thanks in advance
Wael
This should do it:
.*(dagGeneralCodes\$_ctl\d\$_ctl0)
The .* at the front is greedy so initially it will grab the entire input string. It will then backtrack until it finds the last occurrence of the text you want.
Alternatively you can just find all the matches and keep the last one, which is what I'd suggest.
Also, specific advice will probably need to be given depending on what language you're doing this in. In Java, for example, you will need to use DOTALL mode to . matches newlines because ordinarily it doesn't. Other languages call this multiline mode. Javascript has a slightly different workaround for this and so on.
You can use:
[\d\D]*(dagGeneralCodes\$_ctl\d+\$_ctl0)
I'm using [\d\D] instead of . to make it match new-line as well. The * is used in a greedy way so that it will consume all but the last occurrence of dagGeneralCodes$_ctl[number]$_ctl0.
I really like using this Regular Expression Cheatsheet; it's free, a single page, and printed, fits on my cube wall.