Python 2: Regex to get text anywhere between two strings

Python 2: Regex to get text anywhere between two strings - regex

I am trying to find a regex to get the text between Explanation One: and Explanation Two:
Trick is that text may or may not exist, it could be in the same line as Explanation One or it could be in next line of Explanation One. Current regex in the below code, adds an additional line after it finds the text before Explanation Two:
Any pointers appreciated to just get the text ignoring additional empty lines.
import re
STRING="""Explanation One:
Blah Blah
Explanation Two: ndnlnlkn
"""
pattern = r'Explanation One:[\r\n ].*(?=Explanation Two:)+')'
regex = re.compile(pattern, re.IGNORECASE)
print regex.search(STRING).group()
Output:
Explanation One:
Blah Blah

To match the text between Explanation One: and Explanation Two: you could capture it in a group using the DOTALL flag or use an inline modifier (?s) to make the dot match a newline.
Explanation One:\s*(.*?)\s*Explanation Two
Explanation
Explanation One: Match literally
\s* Match zero or times a whitespace character
(.*?) Capture in a group any character zero or more time non greedy
\s* Match zero or times a whitespace character
Explanation Two Match literally
Regex demo
Demo Python

The problem with your current approach is that mode in which you are performing your regex is not DOT ALL mode. This means that .* will not match across lines, which is precisely what you want it to do, until reaching the Explanation Two: marker text. One way around this is to match the following:
[\s\S]*
This will match anything, whitespace or non whitespace, meaning it will match everything even across lines.
pattern = r'Explanation One:([\s\S]*)(?=Explanation Two:)'
searchObj = re.search(pattern, STRING, re.M|re.I)
print searchObj.group(1)
Blah Blah
Demo
By the way, an alternative would be to leave your current pattern as is, and add the re.DOTALL flag to re.search call. So the following should also work:
pattern = r'Explanation One:(.*)(?=Explanation Two:)'
searchObj = re.search(pattern, STRING, re.M|re.I|re.DOTALL)
print searchObj.group(1)

Related

Regex: How to get all words, special characters and white spaces between quotation marks?

Currently I have a regex expression ([^\[\][\[^\[\][\n"]+) to match text between "", but this does not capture whitespaces, for e.g. if I enter " hello ", it will return hello, without the spaces before and after the word.
Is there some expression I can use to just simply catch anything between two quotation marks?
Thank you.

Maybe this will help:
(?<!\\)(\"|')(.+?)(?:(?<!\\)\1)
And to get the text inside the quotes, get the second capture group.
Proof.
Explanation
(?<!\\) - Negative lookbehind. Looks for literal backslash ('')
(\"|') - to test for the start of the "string"
(.+?) - . will match anything but newlines.
+? means as much as possible but only as much needed to match.
(?:(?<!\\)\1) - Non capturing group.
Used here so we can use the (?<!\\) described earlier without looking behind the whole expression. The
\1 matches the first capture group ((\"|')). Can be replaced with $1

You should use following regex:
\"\s*([^\"]+?)\s*\"
([^\"]+?)The text you want to get will be between space and quote.
Demo & Explanation

Cut lines using Notepad++ Regexp replace

I need to cut lines that have 6 or more characters, hyphen, then other characters or symbols. Hyphen and rest of line should be removed. Source text:
0402CS-2
0402CS-3
0402
7812-C
0603CS-1
0603CS-2
0603CS-3
As a result, I need this:
0402CS
0402CS
0402
7812-C
0603CS
0603CS
0603CS
To do that, I use Notepad++ regexp replace feature. Find pattern: ^([^\-]{6,})\-.+$ Replace pattern: \1
But there is no option "multiline", so, symbols "^" and "$" doesn't match ONLY beginning and end of the line and actually I have result:
0402CS
0402CS
0402
7812 <-- that's wrong!
0603CS
0603CS
0603CS
Please advice me how to fix find pattern? Or, maybe there is other handful and powerful free text editor that can do that?

^([^\n\-]{6,})\-.+$
^^
Just use \n as due to [^-] the regex can traverse to line below as use that line to make a match.
See demo.
https://regex101.com/r/BHO93c/1
for the input
0402
7812-C the regex matches both lines as 1 line and makes a match.
See demo if 0402 is not there.
https://regex101.com/r/BHO93c/2

That happens because the [^-] character class also matches a newline.
Add \n to it:
^([^\n-]{6,})-.+$
See the regex online demo (note the m multiline modifier (making ^ match the start of the line, and $ - the end of the line) and g modifier (enabling search for multiple occurrences) that is ON by default in Notepad++).
Note that escaping the hyphen is not necessary inside a character class when it is at the start/end of the class, and you never need to escape the hyphen outside the character class.

PCRE Regular expression : only one matching

I want to catch strings which respond to a pattern in a subject string.
Patterns examples: ##name##, ##address##, ##bankAccount##, ...
Subject example: This is the template with patterns : ##name##Your bank account is : ##bankAccount##Your address is : ##address##
With the following regex: .*(#{2}[a-zA-Z]*#{2}).*, only the last pattern is matched.
How to capture all the patterns, not just the last or first ?

Now that I've formatted your regex properly, the problem shows. A * in your regex was hidden since markdown took it to make the text italics.
Your opening .* matches greedily as much as it can, only backing up enough to let (#{2}[a-zA-Z]*#{2}) match. This matches the last pattern found in the line, everything before it having been matched by the .*.

You need to remove .* as I mentioned in my comment, and use preg_match_all:
$re = '~#{2}[a-zA-Z]*#{2}~';
preg_match_all($re, "##name##, ##address##, ##bankAccount##", $m);
print_r($m);
See the PHP demo
The .*#{2}[a-zA-Z]*#{2}.* matched 0 or more characters other than a newline at first, grabbing the whole line, and then backtracked until the last occurrence of #{2}[a-zA-Z]*#{2} pattern, and the last .* only grabbed the rest of the line. Removing the .* and using preg_match_all, all substrings matching the #{2}[a-zA-Z]*#{2} pattern can be extracted.

Replace whitespaces between specific strings

I'm trying to replace whitespaces with underscores in certain parts of my html-document with Notepad++.
I can identify the area to search for the whitespaces in the following way:
-Begins with: src="video/
-Ends with: mp4
For example I might have a line like this:
<video class="play" src="video/my file name with empty spaces.mp4">
and I would like to change it to be like this:
<video class="play" src="video/my_file_name_with_empty_spaces.mp4">

Tested in N++
Search: (?:src="video|(?<!^)\G)(?:(?!mp4).)*?\K\s+
Replace: _
On the demo, see the substitutions at the bottom.
Explanation
(?:src="video|(?<!^)\G) matches the delimiter src="video, or \G the position following the previous match as long as it is not at the beginning of the string (?<!^) where \G can also match
(?:(?!mp4).) matches one character that is not followed by mp4
*? lazily matches such characters, up to...
\s a space character (our match which we replace with _)
before the space, the \K tells the engine to drop what was matched so far from the final match it returns

Is there a way to "recall" a char sequence already matched in the regex itself?

The regex I'm searching has the following constraints:
it starts with "//"
then "[" a non number sequence (called delimiter in this list) and "]"
next line "\n"
"[" 0 or more number separated by the delimiter previously found "]".
For example the following text matches the regex:
//[*#*]
[1*#*34*#*64]
and the following text doesn't match the regex:
//[*#*]
[1#34#64]
because the delimiter is not the same matched in the first row
The regex I currently create is
^//\[(\D)+\]\n\[[(\d)+(\D)+]*(\d)+\]$|^//\[(\D)+\]\n\[\]$|^//\[(\D)+\]\n\[(\d)+\]$
but obviously this regex match with both previous examples.
Is there a way to "recall" a char sequence already matched in the regex itself?

You need something called back-reference (a very good tutorial here).
Use this regex in Python:
r'^//\[([^\]]+)\]\n\[\d+(\1\d+)*\]'
Sample run:
>>> string = """//[*#*]
... [1*#*34*#*64]"""
>>> print re.search(r'^//\[([^\]]+)\]\n\[\d+(\1\d+)*\]',string).group(0)
//[*#*]
[1*#*34*#*64]
will match your string in Python.
Debuggex Demo

You need to use a back-reference, in most languages you can reference a matching group using \n where n is the group number.
This pattern will work:
//\[([^]]++)]\n\[(?>\d++\1?)+]
To break it down:
// just matches the literal
\[([^]]++)] matches some characters in square brackets
\n matches the newline
\[(?:\d++\1?)++] matches one or more digits followed by the match captured in the first pattern section - optionally. This is an atomic group.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python 2: Regex to get text anywhere between two strings - regex

Related

Regex: How to get all words, special characters and white spaces between quotation marks?

Cut lines using Notepad++ Regexp replace

PCRE Regular expression : only one matching

Replace whitespaces between specific strings

Is there a way to "recall" a char sequence already matched in the regex itself?

Categories

Resources