Regex to match text between single, double and triple quotes - regex

I have a text file that I want to parse strings from. The thing is that there are strings enclosed in either single ('), double (") or 3x single (''') quotes within the exact same file. The best result I was able to get so far is to use this:
((?<=["])(.*?)(?=["]))|((?<=['])(.*?)(?=[']))
to match only single-line strings between single and double quotes. Please note that the strings in the file are enclosed in each type of quotes can be either single- or multi-line and that each type of string repeats several times within the file.
Here's a sample string:
<thisisthefirststring
'''- This is the first line of text
- This is the second line of text
- This is the third line of text
'''
>
<thisisanotheroption
"Just a string between quotes"
>
<thisisalsopossible
'Single quotes
Multiple lines.
With blank lines in between
'
>
<lineBreaksDoubleQoutes
"This is the first sentence here
After the first sentence, comes the blank line, and then the second one."
>

Use this:
((?:'|"){1,3})([^'"]+)\1
Test it online
Using the group reference \1, you can simplify your work
Also, to get only what is inside of the quotes, use the 2nd group of the match

This regex: ('{3}|["']{1})([^'"][\s\S]+?)\1
does what you want.
Some results:

Using Notepad++, you can use: ('''|'|")((?:(?!\1).)+)\1
Explanation:
('''|'|") : group 1, all types of quote
( : group 2
(?:(?!\1).)+ : any thing that is not the quote in group 1
) : end group 2
\1 : back reference to group 1 (i.e. same quote as the beginning)
Here is a screen capture of the result.

Here's something that may work for you.
^(\"([^\"\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"|'([^'\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*'|\"\"\"((?!\"\"\")[^\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"\"\")$
Replace the triple double quotes with triple single quotes. See it in action at regex101.com.

Named Group Version
Avoids problems when used in larger expressions by explicitly referring to the name of the group storing the last found quote.
Should work for most systems:
(?<Qt>'''|'|")(.*?)\k<Qt>
.NET version:
(?<Qt>'''|'|"")(.*?)\k<Qt>
Works as follows:
'''|'|": Check first for ''', then ', and finally ". Done in this order so ''' has priority over '.
(?<Qt>'''|'|""): When matched, place the match in <Qt> for later use.
(.*?): Capture the results of a lazy search for 0 or more of anything .*? - will return empty strings. To prevent empty strings from being returned, change to a lazy search for 1 or more of anything .+?.
\k<Qt>: Search for the value last stored in <Qt>.

Related

find value for single quoted or double quoted id attribute regex

I am trying to write a script which would crawl through the file tree and collect all the id attribute values (id="value"). I want to collect and list such values using regex. Here's what I came up with:
id=\'(.*?)\' - for singe quotes
and
id=\"(.*?)\" - for double quotes
What I'm trying to figure out is how could I merge these two into one, so it would find values wrapped in either single or double quotes.
Use a backreference:
id=(["'])(.*?)\1
Now captured group 2 ((.*?)) i.e. \2 would have your desired value.
(["']) matches any of " or ' and put in captured group 1, \1 at the end make sure we are looking for the same token as the first captured group
Demo
This is a very simple solution, example here.
id=\'([^']*)\'|id=\"([^"]*)\"
Instead of (.*) looking for anything, the ([^']*) looks for anything apart from an apostrophe.

Regex - Match first occurrence within string, do not return anything before it

I'm doing a search and replace in Notepad++ and am looking for a regex that will literally give me the first ( in a given string, so I can replace it.
I am not interested in any preceding or succeeding characters, literally just the first (.
An example string is:
"starLan(11), -- Deprecated via RFC3635 ethernetCsmacd (6) should be used instead
I'd like to find the first ( (near starLan(11) in this case) so I can replace that character with something else.
It should not match any other ( in the same line, so in this case it should not match the second ( near (6).
All of the examples I've come across seem to be returning everything up to and including the given character, which is not what I'm after in this case.
I would match the following pattern :
^([^(]*)\((.*)$
And replace it with this :
\1X\2
Where X is the text you want to replace your ( with.
It uses back-references to refer to the parts before and after the first (.
Edit : as mentioned by OP, matching ^([^(]*)\( and replacing with \1X is enough.
you can use this
^(.*?)\(
the text captured inside () will be available in back reference $1. so you can replace it like:
$1someText
where someText is the text you want to put in place of removed '('
if you want the text after removed '(' to remain intact as well, you can use:
^(.*?)\((.*)
and replacement as:
$1someText$2

Regex Group not starting with

I'm having trouble to compute 2 regex in one (used to deal with .ini files)
I've got this one (I suggest you to use rubular with theses examples to understand)
^(?<key>[^=;\r\n]+)=((?<value>\"*.*;*.*\"[^;\r\n]*);?(?<comment>.*)[^\r\n]*)
to match :
This="isnot;acomment"
This="isa";comment
This="isa;special";case
And I've got this one :
^(?<key>[^=;\r\n]+)=(?<value>[^;\r\n]*);?(?<comment>[^\r\n]*)
to match
This=isasimplecase
This=isasimple;comment
And I'm trying to merge the 2 regex, sadly I do not manage to say "If my value group is not starting with \" use the second one if not use the first one".
Right now i've got this :
^(?<key>[^=;\r\n]+)=(((?<value>\"*.*;*.*\"[^;\r\n]*);?(?<comment>.*)[^\r\n]*)|(?<value>[^;\r\n]*);?(?<comment>[^\r\n]*))
But it's creating 2 more sections unnamed for the simple case without quoted. I was thinking that maybe by adding "the first item of the value group for the simple case must not start with \". But I didn't manage to do it.
PS : I suggest you to use rubular to understand better my problem. Sorry if I wasn't clear enough
How about this?
^(?<key>[^=;\r\n]+)=(?<value>"[^"]*"|[^;\n\r]*);?(?<comment>.*)
DEMO
(?<key>[^=;\r\n]+) Matches the part before the = symbol.
"[^"]*" Matches the string within the double quotes , ex strings like "foobar". If there is no " then the regex engine move on to the next pattern that is [^;\n\r]* and it matches upto the first ; or newline or \r character. These matched characters are stored into a named group called value.
;? Optional semicolon.
(?<comment>.*) Remaining characters are stored into the comment group.

find a single quote at the end of a line starting with "mySqlQueryToArray"

I'm trying to use regex to find single quotes (so I can turn them all into double quotes) anywhere in a line that starts with mySqlQueryToArray (a function that makes a query to a SQL DB). I'm doing the regex in Sublime Text 3 which I'm pretty sure uses Perl Regex. I would like to have my regex match with every single quote in a line so for example I might have the line:
mySqlQueryToArray($con, "SELECT * FROM Template WHERE Name='$name'");
I want the regex to match in that line both of the quotes around $name but no other characters in that line. I've been trying to use (?<=mySqlQueryToArray.*)' but it tells me that the look behind assertion is invalid. I also tried (?<=mySqlQueryToArray)(?<=.*)' but that's also invalid. Can someone guide me to a regex that will accomplish what I need?
To find any number of single quotes in a line starting with your keyword you can use the \G anchor ("end of last match") by replacing:
(^\h*mySqlQueryToArray|(?!^)\G)([^\n\r']*)'
With \1\2<replacement>: see demo here.
Explanation
( ^\h*mySqlQueryToArray # beginning of line: check the keyword is here
| (?!^)\G ) # if not at the BOL, check we did match sth on this line
( [^\n\r']* ) ' # capture everything until the next single quote
The general idea is to match everything until the next single quote with ([^\n\r']*)' in order to replace it with \2<replacement>, but do so only if this everything is:
right after the beginning keyword (^mySqlQueryToArray), or
after the end of the last match ((?!^)\G): in that case we know we have the keyword and are on a relevant line.
\h* accounts for any started indent, as suggested by Xælias (\h being shortcut for any kind of horizontal whitespace).
https://stackoverflow.com/a/25331428/3933728 is a better answer.
I'm not good enough with RegEx nor ST to do this in one step. But I can do it in two:
1/ Search for all mySqlQueryToArray strings
Open the search panel: ⌘F or Find->Find...
Make sure you have the Regex (.* ) button selected (bottom left) and the wrap selector (all other should be off)
Search for: ^\s*mySqlQueryToArray.*$
^ beginning of line
\s* any indentation
mySqlQueryToArray your call
.* whatever is behind
$ end of line
Click on Find All
This will select every occurrence of what you want to modify.
2/ Enter the replace mode
⌥⌘F or Find->Replace...
This time, make sure that wrap, Regex AND In selection are active .
Them search for '([^']*)' and replace with "\1".
' are your single quotes
(...) si the capturing block, referenced by \1 in the replace field
[^']* is for any character that is not a single quote, repeated
Then hit Replace All
I know this is a little more complex that the other answer, but this one tackles cases where your line would contain several single-quoted string. Like this:
mySqlQueryToArray($con, "SELECT * FROM Template WHERE Name='$name' and Value='1234'");
If this is too much, I guess something like find: (?<=mySqlQueryToArray)(.*?)'([^']*)'(.*?) and replace it with \1"\2"\3 will be enough.
You can use a regex like this:
(mySqlQueryToArray.*?)'(.*?)'(.*)
Working demo
Check the substitution section.
You can use \K, see this regex:
mySqlQueryToArray[^']*\K'(.*?)'
Here is a regex demo.

PCRE regex replace a text pattern within double quotes

In Notepad++ 6.5.1 I need to replace certain patterns within quote pairs. I want to save the replace as part of a macro, so all replacements need to happen in one step.
For example, in the following string, replace all 'a' characters within quote pairs with a dash, while leaving characters outside the quote pairs untouched:
Input: aa"bbabaavv"kdjhas"bbabaavv"x
Desired result: aa"bb-b--vv"kdjhas"bb-b--vv"x
Note that the quotes are matched up pairwise, such that the 'a' in kdjhas is untouched.
So far I have tried searching for (?:"[^"a]*|\G)\Ka([^"a]*) and replacing with -$1, but that simply replaces all the a's, with the result --"bb-b--vv"kdjh-s"bb-b--vv"x. I'm attempting PCRE regex that will let me recursively replace the quote-delimited text.
Edit: Quote marks within a quoted string are escaped with an extra quote, e.g. "". However, assume I will have already replaced these in a previous pass with a special character. Therefore a regex solution to this problem will not have to deal with escaped quotes.
It is hard to tell if this is possible as you've only provided one line of input text.
But assuming that input follows this pattern:
BOL|any text|string with two groups of a's|any text|string with two groups of a's|any text|EOL
aa "bbabaavv" kdjhas "bbabaavv" x
I was able to create this regexp search string:
^(.+?\".+?)([a]+)(.+?)([a]+)(.*?\")(.+?\".+?)([a]+)(.+?)([a]+)(.*?\".*)$
With this replace string:
\1-\3-\5\6-\8-\A
and it turn your input string from this:
aa"bbabaavv"kdjhas"bbabaavv"x
into this:
aa"bb-b-vv"kdjhas"bb-b-vv"x
Now naturally the search an replace will fail if the input varies from that pattern described as the search is looking for those four groups of a's inside the two groups of quoted strings.
Also I tested that regexp using Zeus which can create a regexp with more than 9 groups.
As you can see the regexp requires 10 groups.
I'm not familar with Notpad++ so I don't know if it supports that many groups.
If your data have variable number of occurrences of quoted strings, then it is not possible to perform replacements only via regex at least in its form offered by Notepad++.
To replace using regex, you would need to perform regex find in existing regex match. As far as I know such a functionality is not available in Notepad++ regexes.
Self-answer
I may have been reaching for the stars in trying to get Notepad++ to do this regex replace, but I think I found a workaround.
The actual task I was attempting involved creating a SQL Server VALUES list from an Excel spreadsheet, where I was copying and pasting selected cells into Notepad++. The delimiters are \t and \r\n. But, cells can have linefeeds too, which are delimited by ". So, I was going to replace these linefeeds with <br> (or something like it), so that
"line1
line2"
would become "line1<br>line2", before processing the actual end-of-row line feeds.
Having such parsing work reliably, especially when more than two lines were in a single cell, may have been too much to ask of Notepad++'s regex capability.
So I came up with a workaround that seems to be working:) Basically it starts with selecting a blank "dummy" column to the right of my column selection (which I can insert if I'm partially selecting from the middle). This will leave a trailing \t at the end of each row, which effectively sets these EOL's apart from ones that might exist with a text cell, freeing me from having to parse line feeds from a "..." field.
So I compiled a macro from the following steps, which seems to be working well:
replace ' with ''
replace \t\r\n with '\)\r\n, \('
replace \t with ', '
replace "" with ''
replace " with <blank>
replace ^ with \(' (cleanup - first row only)
replace ^, \('$ with <blank> (cleanup - last row only)
Example transformation:
from
line1 line 2
"line3
line3b
line3c" line 4
to
('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
which can now be easily modified into a SELECT statement:
SELECT *
FROM (VALUES('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
) t(a,b)