Multiline Regex not catching the correct mask - regex

I'm trying to catch a tag with a special syntax in a file with this regex :
([a-z0-9 >}\/])(\{(var)\:([a-z0-9\_\/\-\.]+)([\?0-9]+)*\})([a-z0-9 {<\/])
The tag looks like :
{var:contactText}
But as you can see in my regex, I want to catch what's before and after the {var:something}. My expression work fine except when the expression is alone in a line.
I've had m flag to prevent the problem but that's still not working.
Live example: https://regex101.com/r/6T6OJm/1/
Am I missing something? It seems to be the last part with ([a-z0-9 {<\/]) which doesn't accept line break, so what's the solution?

Like the cat, the "multiline" modifier is a false friend. The m modifier doesn't mean the pattern will run magically over multiple lines, It only changes the meaning of the ^ and $ anchors (from start/end of the string to start/end of the line).
All what you need is to figure potential white characters using the \s class (that includes also the carriage return \r and the newline \n characters).
~
([a-z0-9 >}/])
\s* ( { (var) : ([a-z0-9_/.-]+) ([?0-9]+)? } ) \s*
([a-z0-9 {</])
~xi
demo
Note that many characters in the pattern doesn't need to be escaped.
Since the pattern is a bit long, I used the x modifier to not take in account spaces in the pattern: it's more readable.
Not sure that all the capture groups are useful.

Related

Regex adding undesired line break between backreference and literal in Notepad++

I have a text file with a list of elements separated by line-breaks, like this:
alpha
beta
gamma
...
I want to get it into this format:
(alpha),
(beta),
(gamma),
...
So I am using following regular expressions in Notepad++ for replacing those lines:
Find: ([^\n]+)
Replace: \($1\),
but the output now strangely has another line-break for each line into it:
(alpha
),
(beta
),
(gamma
),
...
I have no clue how this is happening. When I solely use $1 or \), apart for replacement it works just fine, but everytime I put a literal after the backreference it puts a line-break in between. I know that I can work around that with another regular expression afterwards, but could anybody explain to me why exactly this is happening?
Instead of [^\n] (=any char but an LF, line feed, \n) you should use . that only matches any char other than line break chars. Use the following regex to match a non-empty line:
^.+$
Replace with \($0\), where $0 replacement backreference (also called placeholder) stands for the whole match and the parentheses are escaped (since parentheses are special metacharacters inside Boost replacement patterns used to define conditional replacement patterns).
No need to use the m modifier here since ^ and $ anchors match start and end of the line respectively by default in Notepad++.
See the NPP S&R settings:

Cut lines using Notepad++ Regexp replace

I need to cut lines that have 6 or more characters, hyphen, then other characters or symbols. Hyphen and rest of line should be removed. Source text:
0402CS-2
0402CS-3
0402
7812-C
0603CS-1
0603CS-2
0603CS-3
As a result, I need this:
0402CS
0402CS
0402
7812-C
0603CS
0603CS
0603CS
To do that, I use Notepad++ regexp replace feature. Find pattern: ^([^\-]{6,})\-.+$ Replace pattern: \1
But there is no option "multiline", so, symbols "^" and "$" doesn't match ONLY beginning and end of the line and actually I have result:
0402CS
0402CS
0402
7812 <-- that's wrong!
0603CS
0603CS
0603CS
Please advice me how to fix find pattern? Or, maybe there is other handful and powerful free text editor that can do that?
^([^\n\-]{6,})\-.+$
^^
Just use \n as due to [^-] the regex can traverse to line below as use that line to make a match.
See demo.
https://regex101.com/r/BHO93c/1
for the input
0402
7812-C the regex matches both lines as 1 line and makes a match.
See demo if 0402 is not there.
https://regex101.com/r/BHO93c/2
That happens because the [^-] character class also matches a newline.
Add \n to it:
^([^\n-]{6,})-.+$
See the regex online demo (note the m multiline modifier (making ^ match the start of the line, and $ - the end of the line) and g modifier (enabling search for multiple occurrences) that is ON by default in Notepad++).
Note that escaping the hyphen is not necessary inside a character class when it is at the start/end of the class, and you never need to escape the hyphen outside the character class.

Notepad++ Replace all with an exception

I am attempting to edit a csv file, below is a sample line from this file.
|MIGRATE|;|10000|;|2ACC0003|;|30/09/13|;|Positive Adjmt.|;||;|MIGRATE|;|95004U
The beginning of the line |MIGRATE| needs to be modified without changing the second MIGRATE so the line would read
|MIGRATE|;|MIG_IN|;|10000|;|2ACC0003|;|30/09/13|;|Positive Adjmt.|;||;|MIGRATE|;|95004U
There are 7700 or so lines so if I am forced to do this manually I will probably cry a little.
Thanks in advance!
Just replace all the ones you want not changed with another word temporarily, then replace the rest with what you want. I'm not sure what you're asking here, but from what I can guess this might help.
It seems like you could just search for Just search for:
^\|MIGRATE\|
And replace with:
|MIGRATE|;|MIG_IN|
Make sure you've checked 'Regular expression' in the 'Search Mode' options.
Explanation: The ^ is a begin anchor; it will match the beginning of the line, ensuring that it does not match the second |MIGRATE|. The \ characters are required to escape the | characters since they normally have special meaning in regular expressions, and you want to match a literal |.
You can use beginning of line anchors:
Find:
^(\|MIGRATE\|)
Replace with:
$1;|MIG_IN|
regex101 demo
Just make sure that you are using the regular expression mode of the Search&Replace.
If you want to be a bit fancier, you can use a positive lookbehind:
Find:
(?<=^\|MIGRATE\|)
Replace with:
;|MIG_IN|
^ Will match only at the beginning of a line.
( ... ) is called a capture group, and will save the contents of the match in variable you can use (in the first regex, I accessed the variable using $1 in the replace. The first capture gets stored to $1, the second to $2, etc.)
| is a special character meaning 'or' in regex (to match a character or group of characters or another, e.g. a|b matches a or b. As such, you need to escape it with a backslash to make a regex match a literal |.
In my second regex, I used (?<= ... ) which is called a positive lookbehind. It makes sure that the part to be matched has what's inside before it. For instance, (?<=a)b matches a b only if it has an a before it. So that the b in ab matches but not in bb.
The website I linked also explains the details of the regex and you can try out some regex yourself!

RegEx to detect if a line doesn't end in a semi colon

I'm trying to run through some code files and find lines that don't end in a semicolon.
I currently have this: ^(?:(?!;).)*$ from a bunch of Googling, and it works just fine. But now I want to expand on it so it ignores all the whitespace at the start or specific keywords like package or opening and closing braces.
The end goal is to take something like this:
package example
{
public class Example
{
var i = 0
var j = 1;
// other functions and stuff
}
}
And for the pattern to show me var i = 0 is missing a semi colon. That's just an example, the missing semi colon could be anywhere in class.
Any ideas? I've been fiddling for over an hour but no luck.
Thanks.
If you want a line that doesn't end in a semicolon you can ask for any amount anything .* followed by one character that isn't a semicolon [^;] followed possibly by some whitespace \s* by the end of the line $. So you have:
.*[^;]\s*$
Now if you don't want whitespace at the beginning you need to ask for the beginning of the line ^ followed by any character that isn't whitespace [^\s] followed by the regex from earlier:
^[^\s].*[^;]\s*$
If you don't want it to start with a keyword like package or, say, class, or whitespace you can ask for a character that isn't any of those three things. The regex that matches any of those three things is (?:\s|package|class) and the regex that matches anything except them them is (?!\s|package|class). Note the !. So you now have:
^(?!\s|package|class).*[^;]\s*$
Try this:
^\s*(?!package|public|class|//|[{}]).*(?<!;\s*)$
When tested in PowerShell:
PS> (gc file.txt) -match '^\s*(?!package|public|class|//|[{}]).*(?<!;\s*)$'
var i = 0
PS>
The key to capturing this complicated concept in a regex is to first understand how your regular expression engine/interpreter handles the following concepts:
positive lookahead
negative lookahead
positive lookbehind
negative lookbehind
Then you can begin to understand how to capture what you want, but only in such cases where what's ahead and what's behind is exactly as you specify.
str.scan(/^\s*(?=\S)(?!package.+\n|public.+\n|\/\/|\{|\})(.+)(?<!;)\s*$/)
This is the regular expression line I'm using to highlight lines of Java code that don't end in semicolon and aren't one of the lines in java that aren't supposed to have a semicolon at the end... using vim's regular expression engine.
\(.\+[^; ]$\)\(^.*public.*\|.*//.*\|.*interface.*\|.*for.*\|.*class.*\|.*try.*\|^\s*if\s\+.*\|.*private.*\|.*new.*\|.*else.*\|.*while.*\|.*protected.*$\)\#<!
^ ^ ^
| | negative lookbehind feature
| |
| 2. But not where such matches are preceeded by these keywords
|
|
1. Group of at least some anychar preceeding a missing semicolon
Mnemonics for deciphering glyphs:
^ beginning of line
.* Any amount of any char
+ at least one
[^ ... ] everything but
$ end of line
\( ... \) group
\| delimiter
\#<! negative lookbehind
Which roughly translates to:
Find me all lines that don't end in a semicolon and don't have any of the above keywords/expressions to the left of it. It's not perfect and probably doesn't hold up to obfuscated java, but for simple java programs it highlights the lines that should have semicolons at the end, but don't.
Image showing how this expression is working out for me:
Helpful link that helped me get the concepts I needed:
https://jbodah.github.io/blog/2016/11/01/positivenegative-lookaheadlookbehind-vim/
For just line that don't end in a semicolon, this is simpler:
.*[^;]$
If you don't want lines starting with whitespace and ending with semicolon:
^[^ ].*[^;]$
You are trying to match lines that possibly begin with whitespace ^\s*, then don't have a particular set of words, for example (?!package|class), then have anything .* but then don't end in a semicolon (or a semicolon with whitespace after it) [^;]\s*.
^\s*(?!package|class).*?[^;]\s*$
Note that I added parentheses around a section of the regex.

RegEx: Grabbing values between quotation marks

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub