Regex match the space bewtween matches as well? - regex

I'm not new to regex (or SO), but I can't seem to find a solid solution for matching the leftover spaces between matches.
For instance, I want to know what is inside quotes, and what is not, and do things to both.
Getting quotes is easy: (\".+?\"|'.+?') = quoteMatch
but making another match group to select everything else is not.
The closest I've gotten is quoteMatch+'|(.)'. This will separate my quote groups from my everything else groups, but it doesn't group together the 'else groups.
Trying quoteMatch+'|(.+)' selects everything together and quoteMatch+'|(.+?)' puts me back a step.
I imagine I need to find a way to make the first match more greedy than the second, but anything I do to make it greedy makes it start taking over multiple quotes and the things in between (ie. match = "quote1" things in between "quote2".
I've also looked into using the split function, but it doesn't return what the split was, and is not quite as eloquent of a solution as I imagine must exist.
Thank you for any help.

Move the match for selecting the other character to the inside of the capturing group as an alternation:
(\".+?\"|'.+?'|.+?(?=["']|$))
Then you can use a positive lookahead such as (?=["']|$) in order to match until a quote or the end of the line.
Live Example
In doing so, an input of:
before quotes "quote1" in between quotes "quote2" after quotes
Would return:
(before quotes ), ("quote1"), ( in between quotes ), ("quote2"), ( after quotes)
As a side note, you can also combine the first two alternations by using a backreference to close the quote:
((['"]).+?\2|.+?(?=["']|$))

Related

RegEx for matching different types of quotes

I write small scripting 'language' for my game.
I would like to allow for every JS string literal strings(`"').
I figured out how to check everything inside those using:
(?<e1>""|'|`)(?:\$\k<e1>|(?!\k<e1>).)*\k<e1>)
It works.
But now, I have a different trouble. I need to remove all tabs, that are not inside those types of quotes.
I looked up here how to match everything, that is not inside quotes:
\t(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)
And I got trouble connecting those two worlds so that "a`\t`" does not remove this middle tab as
\t(?=([^"'`$]*(\$.|['`"]([^"'`$]*\$.)*[^"'`$]*["`']))*[^"`']*$)
does. I know, I have to check for the last not-escaped (with $ not \) quote, but how do I do that?
You could match what you don't want and keep what you want using a capturing group.
In this case you could wrap your first pattern in a capturing group and add an alternation using the pipe | after it to match 1+ times a tab.
In the replacement use the first capturing group:
((?<e1>""|'|`)(?:\$\k<e1>|(?!\k<e1>).)*\k<e1>)|\t+
^ ^^^^^
See a regex demo

Regex Expression to allow comma only inside a string (within quotes)and not outside it

I am kind of new to regex. I am looking for a regex expression to add it as a constraint not to allow comma outside a string .
My input is like
"1,212121,121212","Extra_data"
Here the regex expression should not check for comma in the first value within quotes "1,212121,121212" but should check after the quotes including ,"Extra_data" . In short expression should allow comma in a string only inside quotes and not outside.
Kindly help me with the expression.
I think this is what you're looking for, essentially a group of numbers or commas surrounded by parentheses then followed by comma and another phrase (not necessarily numbers) in parentheses. Capturing group #1 gives you "1,212121,121212" and capturing group #2 gives you ,"Extra_data"
("[\d,]+")(,"[^"]+")
It would be helpful to see more of how your input might come in. I think that the biggest question that remains is whether that first group always contain only numbers/commas, or are there sometimes other characters such as letters, underscores, etc in that first group? If that first group contains only numbers, as I've assumed, then this should work. If it doesn't, then this will not work.
Edit:
"\s*(,\s*"[^"]+")
try this
".*?(?=,).*?"
it only extract comma in a string only inside quotes
Try the following regex:
"[^"]*"(,)[^"]*"[^"]*"
It will capture the commas you need. But note that PHP has no support for captures of the same groups. i.e. in your case:
If the input is : "1,212121,121212","Extra_data","hel,lo","a,bc"
It will capture commas before "Extra_data" and "a,bc" but will exclude the comma before "hel,lo". For that you'll have to use recursion.
You can try using this regex.
(^,)|("\s*,\s*")|(,$)
If you find any match for this regex, then the string will be invalid.

Regex to match one or two quotes but not three in a row

For the life of me I can't figure this one out.
I need to search the following text, matching only the quotes in bold:
Don't match: """This is a python docstring"""
Match: " This is a regular string "
Match: "" ← That is an empty string
How can I do this with a regular expression?
Here's what I've tried:
Doesn't work:
(?!"")"(?<!"")
Close, but doesn't match double quotes.
Doesn't work:
"(?<!""")|(?!"")"(?<!"")|(?!""")"
I naively thought that I could add the alternates that I don't want but the logic ends up reversed. This one matches everything because all quotes match at least one of the alternates.
(Please note: I'm not running the code, so solutions around using __doc__ won't help, I'm just trying to find and replace in my code editor.)
You can use /(?<!")"{1,2}(?!")/
DEMO
Autopsy:
(?<!") a negative look-behind for the literal ". The match cannot have this character in front
"{1,2} the literal " matched once or twice
(?!") a negative look-ahead for the literal ". The match cannot have this character after
Your first try might've failed because (?!") is a negative look-ahead, and (?<!") is a negative look-behind. It makes no sense to have look-aheads before your match, or look-behinds after your match.
I realized that my original problem description was actually slightly wrong. That is, I need to actually only match a single quote character, unless if it's part of a group of 3 quote characters.
The difference is that this is desirable for editing so that I can find and replace with '. If I match "one or two quotes" then I can't automatically replace with a single character.
I came up with this modification to h20000000's answer that satisfies that case:
(?<!"")(?<=(?!""").)"(?!"")
In the demo, you can see that the "" are matched individually, instead of as a group.
This works very similarly to the other answer, except:
it only matches a single "
that leaves us with matching everything we want except it still matches the middle quotes of a """:
Finally, adding the (?<=(?!""").) excludes that case specifically, by saying "look back one character, then fail the match if the next three characters are """):
I decided not to change the question because I don't want to hijack the answer, but I think this can be a useful addition.

regex - Removing text from around numbers in Notepad++

I have a large subset of data that looks like this:
MyApp.Whatever\app.config(115): More stuff here, but possibly with numbers or parenthesis...
I'd like to create a replace filter using Notepad++ that would identify and replace the line number "(115):" and replace it with a tab character followed by the same number.
I've been trying filters such as (\(\d+\):) and (\(\[0-9]+\):), but they keep returning the entire value in the \1 output.
How would I create a filter using Notepad++ that would successfully replace (115): with tab character + 115?
Use a quantifier.. (\(\d+?\):) where the ? will prevent it from being greedy. Also, since everything is in a () it will group it all and treat it as \1 ..
If it was in perl I'd say \((\d+?)\): which should match only the inner part.
Edit:
Just talked with my colleague - he said s/\((\d+)\)/\t\1/ and if you needed app config in front you could just put that in the front.
this should work for your needs
replace
\((\d+)\):
with
\t$1
Replacing (\(\d+\):) with \t\1 will keep the parenthesis and the colon since you've included them in the group (the outer parenthesis), and I think that's what you mean by "they keep returning the entire value."
Instead of escaping those inner parenthesis, escape the outer ones like the other answers have suggested: \((\d+)\): - this says to match a left paren, then match and capture a group of digits, then match a right paren and a colon. Replacing that with \t\1 will get rid of the parens and colon that were not in the captured group.

Regex to validate company names

I have this RegEx to validate a few things, unfortunately it won't validate P.C. only P.C - I tried adding {0,1} to each period but it still will not validate. Any ideas?
(new-line characters for readability)
/(^|\s)Corporation\.{0,1}(^|$)|
(^|\s)Corp\.{0,1}(^|$)|
(^|\s)Inc\.{0,1}(^|$)|
(^|\s)Incorporated\.{0,1}(^|$)|
(^|\s)Company\.{0,1}(^|$)|
(^|\s)(^|$)|
(^|\s)LTD\.{0,1}(^|$)|
(^|\s)PLLC\.{0,1}(^|$)|
(^|\s)P\.{0,1}C\.{0,1}(^|$)/ig;
Here's a simplified version of your regex:
/(?:^|\s)(?:Corporation|Corp|Inc|Incorporated|Company|LTD|PLLC|P\.C)\.?$/ig;
{0,1} can be replaced by ?
Repetition can be eliminated with some grouping.
This doesn't make much sense: (^|$). You are requiring either a beginning of a line or an end of a line to occur right after a match. This is functionally the same as requiring the match to be at the end of the line, so I just replaced it with $.
When you need to group things, use non-capturing groups (?:...) unless you need to grab that part of the match. They are more efficient.
All that being said, your original pattern should have matched P.C. at the end of a line. The problem may be something with your input data or the way you are using the regex.