How to match until get specific pattern in Regex - regex

I have a scenario where i want to match specific word and then match everything until i get another pattern. For example
ABC=145865865
Then anything comes in ways
and then
Date=11/11/2001
I have tried (.*?) but it only match that specific line in my scenario i have multiple lines of data in between.
How can i do this?

Closest guess to what I think you're looking for:
ABC=(\d+)[\s\S]*?Date=(\d\d/\d\d/\d{4})
This uses [\s\S] which means "either a whitespace character or not a whitespace character", which is equivalent to "any character". The . can also be set to match any character, but I tend to prefer [\s\S] because it does just that without having to set flags. You haven't specified the language you are using so I can't tell you how to set such a flag anyway (it's re.DOTALL in Python).

Multiple lines? If you mean you have newline characters (\n) in between then you need to set the DOTALL flag, as follows:
Pattern p = Pattern.compile(<your-regex-here>, Pattern.DOTALL)
The above will match new line characters between the two strings.

Related

Regex not extracting all matching words

I am trying to extract words that have at least one character from a special character set. It picks up some words and not others. Here is a link to regex101 to test it. This it the regex \b(\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*)\b, and this is the sample sentence I am using
His full name is Abu ʿĪsa Muḥammad ibn ʿĪsa ibn Sawrah ibn Mūsa ibn
Al-Daḥāk Al-Sulamī Al-Tirmidhī.
It should match the following words:
ʿĪsa Muḥammad ʿĪsa Mūsa Al-Daḥāk Al-Sulamī Al-Tirmidhī
I am not too experienced with regex, so I have no idea what I am doing wrong. If someone knows any tool to find out why a specific word doesn't match a regex pattern, please let me know as well.
You can use
[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ][\wāīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]*
After matching the one required special character, use another character set to match more occurrences of those characters or normal word characters.
https://regex101.com/r/ovJoLt/2
You can make this work by enabling the Unicode flag /u (so that the word boundary \b assertions support Unicode characters) and adding hyphens to the surrounding character groups:
/\b[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+[\w-]*\b/gu
Plus, you don't need the capturing group, since the only characters being matched form the desired output anyway (\b is a zero-width assertion).
Demo
You are not doing anything wrong except that to match unicode boundaries you have to enable u modifier or use (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*(?!\S)
If you want to match hyphen add it to your character class (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]+\w*(?!\S)

Pattern regex inclusion special characters

I have this line
pattern = "\S*\w+(\s?$|\s{1,}\w+)+"
It all works fine as it allows me to block the initial white space, and allow at those between the words, but I can not include special characters (for example: '+' &%) without changing this property. Can someone help me out ? Thank you
If all you want is a space split you should replace \w with \S.
And anyway having \S*\w+ is sort of redundant, you could simplify with \S*\w.
But if you want finer control why not write out the whole range and replace \w with [a-zA-Z0-9_+&%]?
Check out regular expressions for javascript
Only \S matches special characters, \w only matches [a-zA-Z0-9_].
So you could simply replace them to
pattern="\S*\S+(\s?$|\s{1,}\S+)+"
but there is so much redundancy then. Simplify it to
pattern="\S+(\s+\S+)*\s?"
or if really the only thing you care about is starting with \S then just do
pattern="\S[\s\S]*" <!-- or -->
pattern="\S.*" <!-- not allowing linebreaks -->
From my understanding if you want it to not find the whitespaces or the special characters simply remove the \S* this matches anything OTHER then whitespace which includes special characters.
\w+(\s?$|\s{1,}\w+)+
This means it would block the whitespace and the special characters at the beginning of the regex however special characters inbetween words would be ignored. for that i would replace the \s with \W for non-word characters. This would allow spaces and special characters in between the words.
\w+(\W?$|\W{1,}\w+)+
A great site to test out regex and where I was able to confirm this was regex101.com it's a place you can test out the regex as you type it with detailed information that displays what your regex will do as you type it. You can also include sample text to see what your regex will find in the text. the above regex when given: " ! Test" only captured the Test and ignored both the ! and the spaces prior to Test.

Vim RegEx: Match until blank line

I'm trying to write a RegEx that will match any line that contains ".wpd", and then match all lines after that until it reaches a blank line (including the blank line).
This is what I've tried:
/\v^.*.wpd\_.\{-}^\s*$
However, the non-greedy operator \{-} after the "all characters including new lines" character class \{-} doesn't seem to work. If I use
/\v^.*.wpd\_.*
that will match the next line containing ".wpd" and then all lines after that. However, as soon as I change the * to \{-}, it doesn't match anything at all.
What am I doing wrong? Thanks!
This one seems to work:
/\v^.*\.wpd\_.{-}\n\s*\n
You cannot use the atom ^ (same for $) inside the regexp, it has its special meaning only at the front (back); elsewhere, it's taken as the literal char. Use \n to match a newline inside the regexp, as shown by perreal's answer.
(?s)[^\n\r]*\.wpd(.*?)\n{2}
(?s) - Turn on 'dot matches line breaks' to search across lines
[^\n\r]* - Starting at the beginning of a line, match anything that's not a line break
.wpd - Match '.wpd'
(.*?) - Match anything, non-greedily, including line breaks ( because we turned on (?s) previously )
\n{2} - ... until you find two newlines in a row, which would be a blank line
:)
The following is a large supporting comment to #perreal's answer above as well as my own version of that answer which I find more intuitive.
Let's dissect the following regexp based on http://vimdoc.sourceforge.net/htmldoc/pattern.html#/magic
/\v^.*\.wpd\_.{-}\n\s*\n
\v (lowercase v): This is the 'very magic' operator which
signifies that in the pattern after it all ASCII characters except
'0'-'9', 'a'-'z', 'A'-'Z' and '_' have a special meaning.Therefore, characters like * , ^, $ need not be escaped in the pattern but for _ to have special meaning (such as modifying the behaviour of . to match newline), it needs to be escaped. Hence with \v set, you need \_ for the latter to have special meaning. To truly appreciate how much very magic simplifies the expression, compare it with the same expression using the very NOmagic(uppercase \V): /\V\^\.\*.wpd\_\.\{-}\n\s\*\n (very nomagic) vs /\v^.*\.wpd\_.{-}\n\s*\n (very magic)
^.*\.wpd: Greedily match anything (.*) from the beginning of a line (^) till .wpd
\_. : Matches a single character, which can be
any character including the newline. Note that with \v set, the pattern must have escaped underscore as noted above.
{-} : Is the non-greedy equivalent of * quantifier. So, where .*BLAH matches the most possible characters till BLAH, .{-}BLAH will match the least possible. To see this in action, take a look at this (in this case, I had to use ? instead of {-} since that regex is PCRE) :
\n\s*\n: Matches a blank line which may contain one or more spaces or tabs
\_.{-}\n\s*\n: combines the above two and means Match the least possible number of characters including newline (\_.) until a blank line (\n\s*\n)
\v^.*\.wpd\_.{-}\n\s*\n: Finally putting it altogether, set the very magic operator (possibly to allow simplifying the pattern by not needing to escape anything except an _ for special meaning), search for any line which contains .wpd and match until the closest blank line.
My version using variants of end-of-line start-of-line characters
The only modification is to the expression used to signify a blank line. I find it useful to define a blank line in terms of the start-of-line ('^') and end-of-line ('$') characters, however as-is, they cannot be used anywhere in a regexp except the beginning and the end respectively.
For the above use-case, there are variants which can be used anywhere in a regex, namely: '_^' and \_$ respectively. Therefore the blank line expression can be written as \_^\s*\_$ instead of \n\s*\n, thus making the complete expression:
\v^.*.wpd\_.{-}\_^\s*\_$
This perhaps is closer to answering the OP's question about why they were unable to use the start-of-line character in their expression.
Phew!

Regex to insert space in vim

I am a regex supernoob (just reading my first articles about them), and at the same time working towards stronger use of vim. I would like to use a regex to search for all instances of a colon : that are not followed by a space and insert one space between those colons and any character after them.
If I start with:
foo:bar
I would like to end with
foo: bar
I got as far as %s/:[a-z] but now I don't know what do for the next part of the %s statement.
Also, how do I change the :[a-z] statement to make sure it catches anything that is not a space?
:%s/:\(\S\)/: \1/g
\S matches any character that is not whitespace, but you need to remember what that non-whitespace character is. This is what the \(\) does. You can then refer to it using \1 in the replacement.
So you match a :, some non-whitespace character and then replace it with a :, a space, and the captured character.
Changing this to only modify the text when there's only one : is fairly straight forward. As others have suggested, using some of the zero-width assertions will be useful.
:%s/:\#!<:[^:[:space:]]\#=/: /g
:\#!< matches any non-:, including the start of the line. This is an important characteristic of the negative lookahead/lookbehind assertions. It's not requiring that there actually be a character, just that there isn't a :.
: matches the required colon.
[^:[:space:]] introduces a couple more regex concepts.
The outer [] is a collection. A collection is used to match any of the characters listed inside. However, a leading ^ negates that match. So, [abc123] will match a, b, c, 1, 2, or 3, but [^abc123] matches anything but those characters.
[:space:] is a character class. Character classes can only be used inside a collection. [:space:] means, unsurprisingly, any whitespace. In most implementations, it relates directly to the result of the C library's isspace function.
Tying that all together, the collection means "match any character that is not a : or whitespace".
\#= is the positive lookahead assertion. It applies to the previous atom (in this case the collection) and means that the collection is required for the pattern to be a successful match, but will not be part of the text that is replaced.
So, whenever the pattern matches, we just replace the : with itself and a space.
You want to use a zero-width negative lookahead assertion, which is a fancy way of saying look for a character that's not a space, but don't include it in the match:
:%s/: \#!/: /g
The \#! is the negative lookahead.
An interesting feature of Vim regex is the presence of \zs and \ze. Other engines might have them too, but they're not very common.
The purpose of \zs is to mark the start of the match, and \ze the end of it. For example:
ab\zsc
matches c, only if before you have ab. Similarly:
a\zebc
matches a only if you have bc after it. You can mix both:
a\zsb\zec
matches b only if in between a and c. You can also create zero-width matches, which are ideal for what you're trying to do:
:%s/:\zs\ze\S/ /
Your search has no size, only a position. And them you substitute that position by " ". By the way, \S means any character but white space ones.
:\zs\ze\S matches the position between a colon and something not a space.
you probably want to use :[^ ] to mach everything except spaces. As mentioned by Matt this will cause your replace to replace the extra character.
There are several ways to avoid this, here are 2 that I find useful.
1) Surround the last part of the search term with parenthesis \(\), this allows you to reference that part of the search in your replace term with a /1.
Your final replace string should look like this:
%s/:\([^ ]\)/: \1/g
2) end the search term early with \ze This will means that the entire search term must be met for a match, but only the part before \ze will be higlighted / or replaced
Your final replace string should look like this:
%s/:\ze[^ ]/: /g

Regex help NOT a-z or 0-9

I need a regex to find all chars that are NOT a-z or 0-9
I don't know the syntax for the NOT operator in regex.
I want the regex to be NOT [a-z, A-Z, 0-9].
Thanks in advance!
It's ^. Your regex should use [^a-zA-Z0-9]. Beware: this character class may have unexpected behavior with non-ascii locales. For instance, this would match é.
Edited
If the regexes are perl-compatible (PCRE), you can use \s to match all whitespace. This expands to include spaces and other whitespace characters. If they're posix-compatible, use [:space:] character class (like so: [^a-zA-Z0-9[:space:]]). I would recommend using [:alnum:] instead of a-zA-Z0-9.
If you want to match the end of a line, you should include a $ at the end. Turning on multiline mode is only when your match should extend across multiple lines, and it reduces performance for larger files since more must be read into memory.
Why don't you include a copy of sample input, the text you want to match, and the program you are using to do so?
It's pretty simple; you just add ^ at the beginning of a character set to negate that character set.
For example, the following pattern will match everything that's not in that character set -- i.e., not a lowercase ASCII character or a digit:
[^a-z0-9]
As a side note, some of the more helpful Regular Expression resources I've found have been this site and this cheat sheet (C# specific).
Put at ^ at the begining of your character class expression: [^a-z0-9]
At start [^a-zA-Z0-9]
for condition;
pre_match();
pre_replace();
ergi();
try this
You can also use \W it's a shorthand for non-word character (equal to [^a-zA-Z0-9_])