Skip quotes in regex - regex

I have the following regex
val=\"(?<val>.*?)\"
it works ok for val="value"
Now I need regex that will match val="value" and val=value
Could you please help? I don't understand how to build such regex. I have tried the following but no success
val=[^"](?<val>.*?)[^"]
update
it seems works val=(?:[^"])*(?<val>.*?)(?:[^"]|")* but I'm not sure that it is correct

You can capture the optional opening quote, and require it to be present at the end of the match.
val=(\"?)(?<val>.*?)\1
The back-reference \1 recalls the text which matched the first parenthesized expression.
Obviously, if you have code which depends on the order of grouped parentheses, you need to refer to the second group to get val; but of course you are likely referring to it by name already (otherwise why use a named group?)
The expression [^"] matches a character which isn't a quote, so it's completely wrong here.
Of course, when there aren't any quotes, the expression .*? will match the empty string if there isn't a trailing context which forces it to match something longer. Perhaps you can use something like
val=(\"?)(?<val>.*?)\1(\s|$)
but this will obviously depend on what exactly you are hoping to match and in what context. If not this then maybe you can constrain the value so that you can use a greedy match instead? For instance,
val=(\"?)(?<val>[^\"]*)\1

Related

Regex if then else confusion

I have a problem with the Regex-If-then-else logic:
I am trying to achieve the following:
If the string contains the substring PubDSK then do the Regex Expression
^[\s\S]{24}(?=.{10}([\s\S]*))0*(.*?)(?=\1)[\s\S]*
If it does NOT contain the substring PubDSK then do a different Regex Expression, namely ^[\s\S]{48}(?=.{10}([\s\S]*))0*(.*?)(?=\1)[\s\S]*
I am using this Regex Expression (?(?=^.*PubDSK.*$)^[\s\S]{24}(?=.{10}([\s\S]*))0*(.*?)(?=\1)[\s\S]*|^[\s\S]{48}(?=.{10}([\s\S]*))0*(.*?)(?=\1)[\s\S]*)
The affirmative case works great: https://regex101.com/r/ab9yOv/
BUT the non-affirmative case, doesn't do the trick: https://regex101.com/r/azxGvh/1
I assume it doesn't match so it cannot do the replacement?? How can I tell the regex to do the replacement on the complete string in the ELSE case?
I understand, that this problem can be easily solved with any other programming language, but for this use case I can only use pure regex...
The second \1 backreference refers to the first capturing group of the entire regex. So, it does not refer to the right capturing group defined in the else pattern part. In fact, the second \1 must be replaced with \3 as it refers to the third capturing group.
Also, note that (?=\1) and (?=\3) lookaheads make little sense here as they are followed with [\s\S]* consuming patterns. Just remove the lookahead pattern and use consuming ones.
The fixed pattern looks like
(?(?=^.*PubDSK.*$)^[\s\S]{24}(?=.{10}([\s\S]*))0*(.*?)\1[\s\S]*|^[\s\S]{48}(?=.{10}([\s\S]*))0*(.*?)\3[\s\S]*)
See the regex demo.

non-greedy search for redundant values in string

Basically I have this string and I want to get only a distinct image filename.
/mPastedImg_Time1469244713469.png&gtxResourceFileName=mPastedImg_Time1469244713469.png&amp
I have this regex code but it does not seem to work.
[^\/]*?_Time[0-9]{13}\.\w{3,4}\&
My expected output is:
mPastedImg_Time1469244713469.png
But the actual output is:
mPastedImg_Time1469244713469.png&gtxResourceFileName=mPastedImg_Time1469244713469.png&
To find the unique filename in a string, you can use this regex,
([^\/&= ]+_Time[0-9]{13}\.\w{3,4})(?!.*\1)
Here, ([^\/&= ]+_Time[0-9]{13}\.\w{3,4}) captures the filename you require and (?!.*\1) negative look ahead gives you the last match ensuring the removal of all duplicates matches in the string. Also, because of appropriate negated character set, it allows matching Chinese character set too that are present in your filename which also you wanted to capture.
Demo
Your pattern has 2 matches where the second part has a larger match due to the negated character class [^\/] that matches not a forward slash.
What you might do is make the first character class more restrictive to specify what you would allow to match (for example [a-zA-Z]) and make sure that you don't use a global match to match all, but just one match:
[a-zA-Z]*_Time[0-9]{13}\.\w{3,4}
Regex demo
Note that you don't have to match the ampersand at the end of the pattern.
I think you were quite close matching it, but you were doing too complex:
If you know that the name will be mPastedImg_Time then use it to the fullest.
What about simply doing it like this:
mPastedImg_Time[0-9]{13}\.\w{3,4}

Regex - Find everything between OR-operators except OR between quotes

I need some help with a Regex. I have a query, that should be splitted between all OR-operators. But if the OR is inside of quotes, it should not splitted.
Example:
This is the query:
"test1" OR "test2.1 OR test2.2" OR test3 OR test4:"test4.1 OR test4.2"
Expression 1: I need everything between the OR-operators or start/end of line... (This is not working)
(^|OR).*?(OR|$)
Expression 2: ...except of the ORs between quotes:
"(.*?)"
The result should be:
"test1"
"test2.1 OR test2.2"
test3
test4:"test4.1 OR test4.2"
How can I make the first expression work and how can I combine these both expressions?
Thank you for help!
It's unclear what the grammar of your expression is, so I just make a bunch of assumptions and come up with this regex to match the tokens between OR:
\G(\w+(?::"[^"]*")?|"[^"]*")(?:(\s+OR\s+)|\s*$)
Demo at regex101
I assume that between OR, it can be an identifier \w+, an identifier with some string \w+:"[^"]*", or a string literal "[^"]*".
Feel free to substitute your own definition of string literal - I'm using the simplest (and broken) specification "[^"]*" as example.
In every match, the regex starts from where the last match left off (or the beginning of the string) and matches one token (as described above), followed by OR or the end of the input string.
The capturing groups at (\s+OR\s+) is deliberate - you will need this to check whether the last match actually terminates at the end of the string or not, or whether the input is malformed.
Caveat
Do note that while my solution produces the expected result for this case, without a full specification of the grammar of the expression, it's not possible to cater for all possible cases you may want to handle.
(?:^|OR(?=(?:[^"]*"[^"]*")*+[^"]*$))([\s\S]*?)(?=OR(?=(?:[^"]*"[^"]*")*+[^"]*$)|$)
You can use this and capture the groups.See demo.
https://regex101.com/r/xC4rJ3/12
Try to match everything in quotes or not-OR with:
(?:"[^"]+"|\b(?:(?!\bOR\b)[^"])+)+
DEMO
This regex works optimally (though it be subject to improvement with a more detailed specification):
(?<!\S)(?!OR\s)[^\s"]*(?:"[^"]*"[^\s"]*)*
DEMO
(?<!\S) ensures the match starts at the beginning of the string or after a whitespace character.
(?!OR\s) prevents it from matching OR
[^\s"]*(?:"[^"]*"[^\s"]*)* matches a contiguous series of, in any order:
sequences of non-whitespace, non-quote characters, or
a pair of quotes enclosing anything except quotes.
However, I notice that all the tokens in your example consist of:
a non-quote, non-whitespace sequence (NQ),
a quoted sequence (Q), or
an NQ followed immediately by a Q.
If you expect all tokens to match that pattern, you can change the regex to this:
(?<!\S)(?!OR\s)(?:[^\s"]*"[^"]*"|[^\s"]+)
According to Regex101, it's slightly more efficient (but probably not enough to matter).
DEMO

Regular expression to find specific string and add characters when the're not already there in notepad++

Okay, I have zero knowledge of regular expressions so if someone can direct me to a better way to figure this out then by all means please do.
I figured out that a series of files are missing a particular naming convention for the database they will write to. So some might be dbname1, dbname2, dbname3, abcdbname4, abcdbname5 and they all need to have that abc in the beginning. I want to write a regular expression that will find all tags in the file that do not follow immediately by abc and add in abc. Any ideas how I can do this?
Again, forgive me if this is poorly worded/expressed. I really have absolutely zero knowledge of regular expressions. I can't find any questions that are asking this. I know that there are questions asking how to add strings to lines but not how to add only to lines that are missing the string when some already have it.
I thought I had written this in but I'm looking at lines that look like this
<Name>dbname</Name>
or
<Name>abcdbname</Name>
and I need to get them all to have that abc at the beginning
Cameron's answer will work, but so will this. It's called a negative lookbehind.
(?<!abc)(dbname\d+)
This regex looks for dbname followed by 1 or more digits, and not prefixed by abc. So it will capture dbname113.
This looks for any occurrence of dbname not immediately prefixed by the string "abc". THe original name is in the capture group \1 so you can replace this regex with abc\1 and all your files will be properly prefixed.
Not every program/language that implements regex (famously, javascript) supports lookbehinds, but most do and Notepad++ certainly does. Lookarounds (lookbehind / lookaheads) are exceedingly handy once you get the hang of them.
?<! negative lookbehind, ?<= positive lookbehind / lookbehind, ?! negative lookhead, and ?= lookahead all must be used within parantheses as I did above, but they're not used in capturing so they do not create capture groups, hence why the second set of parentheses is able to be referenced as \1 (or $1 depending on the language)
Edit: Given some better example criteria, this is possibly more what you're looking for.
Find: (<Name>)(.*?(?<!abc)dbname\d+)(</Name>)
Replace: \1abc\2\3
Alternatively, something a bit easier to understand, you can do this or something like this:
Find: (<Name>)(abc)?(dbname\d+)(</Name>)
Replace: \1abc\3\4
What this is does is:
Matches <Name>, captures as backreference 1.
Looks for abc and captures it, if it's there as backreference 2, otherwise 2 contains nothing. The ? after (abc) means match 0 or 1 times.
Looks for the dbname and captures it. and captures as backreference 3.
Matches </Name>, captures as backreference 4.
By replacing with \1abc\3\4, you kind of drop abc off dbname if it exists and replace dbname with abcdbname in all instances.
You can take this a step further and
Find: (<Name>)(?:abc)?(dbname\d+)(</Name>)
Replace: \1abc\2\3
prefix the abc with ?: to create a noncapturing group, so the backreferences for replacing are sequential.
Replace \bdbname(\d+) with abcdbname\1.
The \b means "word boundary", so it won't match the abc versions, but will match the others. The (...) parentheses represent a capturing group, which capture everything that's matched in-between into a numbered variable that can be later referenced (there's only one here so it goes in \1). The \d+ matches one or more digit characters.

Why do I get successful but empty regex matches?

I'm searching the pattern (.*)\\1 on the text blabl with regexec(). I get successful but empty matches in regmatch_t structures. What exactly has been matched?
The regex .* can match successfully a string of zero characters, or the nothing that occurs between adjacent characters.
So your pattern is matching zero characters in the parens, and then matching zero characters immediately following that.
So if your regex was /f(.*)\1/ it would match the string "foo" between the 'f' and the first 'o'.
You might try using .+ instead of .*, as that matches one or more instead of zero or more. (Using .+ you should match the 'oo' in 'foo')
\1 is the backreference typically used for replacement later or when trying to further refine your regex by getting a match within a match. You should just use (.*), this will give you the results you want and will automatically be given the backreference number 1. I'm no regex expert but these are my thoughts based on my limited knowledge.
As an aside, I always revert back to RegexBuddy when trying to see what's really happening.
\1 is the "re-match" instruction. The question is, do you want to re-match immediately (e.g., BLABLA)
/(.+)\1/
or later (e.g., BLAahemBLA)
/(.+).*\1/