I'm trying to use a regex replace each character after a given position (say, 3) with a placeholder character, for an arbitrary-length string (the output length should be the same as that of the input). I think a lookahead (lookbehind?) can do it, but I can't get it to work.
What I have right now is:
regex: /.(?=.{0,2}$)/
input string: 'hello there'
replace string: '_'
current output: 'hello th___' (last 3 substituted)
The output I'm looking for would be 'hel________' (everything but the first 3 substituted).
I'm doing this in Typescript, to replace some old javascript that is using ugly split/concatenate logic. However, I know how to make the regex calls, so the answer should be pretty language agnostic.
If you know the string is longer than given position n, the start-part can be optionally captured
(^.{3})?.
and replaced with e.g. $1_ (capture of first group and _). Won't work if string length is <= n.
See this demo at regex101
Another option is to use a lookehind as far as supported to check if preceded by n characters.
(?<=.{3}).
See other demo at regex101 (replace just with underscore) - String length does not matter here.
To mention in PHP/PCRE the start-part could simply be skipped like this: ^.{1,3}(*SKIP)(*F)|.
There are tons of examples to do the conversion from C-style line comment to 1-line block comment. But I need to do the opposite: find a regex to replace multi-line block comment with line comments.
From:
This text must not be touched
/*
This
is
random
text
*/
This text must not be touched
To
This text must not be touched
// This
// is
// random
// text
This text must not be touched
I was thinking if there's a way to represent "each line" concept in regex, then just add // in front of each line. Something like
\/\*\n(?:(.+)\n)+\*\/ -> // $1
But the greediness nature of the regex engine makes $1 just match the last line before */. I know Perl and other languages have some advanced regex features like recursion, but I need to do this in a standard engine. Is there any trick to accomplish this?
EDIT: To clarify, I'm looking for pure regex solution, not involving any programming language. Should be testable on sites like https://regex101.com/.
If you are interested in a single regex pass in the modern JavaScript engine (and other regex engines supporting infinite length patterns in lookbehinds), you can use
/(?<=^(\/)\*(?:(?!^\/\*)[\s\S])*?\r?\n)(?=[\s\S]*?^\*\/)|(?:\r?\n)?(?:^\/\*|^\*\/)/gm
Replace with $1$1, see the regex demo.
Details
(?<=^(\/)\*(?:(?!^\/\*)[\s\S])*?\r?\n) - a positive lookbehind that matches a location that is immediately preceded with
^(\/)\* - /* substring at the start of a line (with / captured into Group 1)
(?:(?!^\/\*)[\s\S])*? - any char, zero or more occurrences, as few as possible, not starting a /* char sequence that appears at the start of a line
\r?\n - a CRLF or LF ending
(?=[\s\S]*?^\*\/) - a positive lookahead that requires any 0 or more chars as few as possible followed with */ at the start of a line, immediately to the right of the current location
| - or
(?:\r?\n)? - an optional CRLF or LF linebreak
(?:^\/\*|^\*\/) - and then either /* or */ at the start of a line.
As usual in such cases, two regular expressions—the second applied to the matches of the first—can do what one cannot achieve.
const txt = `This text must not be touched
/*
This
is
random
text
*/
This text must not be touched`;
const to1line = str => str.replace(
/\/\*\s*(.*?)\s*\*\//gs,
(_, comment) => comment.replace( /^/mg, '//')
);
console.log( to1line( txt ));
My OCD has gotten the better of me and I'm going through my groovy codebase replacing simple strings with double quotes around them into single quoted strings.
However, I want to avoid GStrings that actually contain dollar symbols and variables.
I'm using IntelliJ to do the substitution, and the following almost works:
From: "([^$\"\n\r]+)"
To: '$1'
It captures strings without any dollars in, but only partially skips any strings that contain them.
For example it matches the quotes between two double quoted strings in this case:
foo("${var}": "bar")
^^^^
Is it possible to create a regex that would skip a whole string that contained dollars, so in the above case it skips "${var}" and selects "bar", instead of erroneously selecting ": "?
EDIT: Here's a section of code to try against
table.columns.elements.each{ columnName, column ->
def columnText = "${columnName} : ${column.dataType}"
cols += "${columnText}\n"
if (columnText.length() > width) {
width = columnText.length()
}
height++
}
builder."node"("id": table.elementName) {
builder."data"("key": "d0") {
builder."y:ShapeNode"()
}
}
def foo() {
def string = """
a multiline quote using triple quotes with ${var} gstring vars in.
"""
}
Do single and triple quote replacements separately.
Single quotes:
Use a look ahead for an even number of quotes after your hit. A negative look behind stops it matching the inner quotes of triple quoted strings.
Find: (?<!")"([^"$]*)"(?=(?:(?:[^"\r\n]*"){2})*[^"]*$)
Replace: '$1'
See live demo.
Triple quotes:
Use a simpler match for triple quoted strings, since they are on their own lines.
Find: """([^"$]*?)"""
Replace: '''$1'''
See live demo, which includes a triple-quoted string that contains a variable.
You need to make sure the first quote comes after even number of quotes:
^[^\n\r"]*(?:(?:"[^"\n\r]*){2})*"([^$\"\n\r]+)"
Here you can play with it.
Explanation:
^[^"\n\r]* - some non-quotes at the beginning
"[^"\n\r]* - a quote, then some more non-quotes
(?:"[^"\n\r]*){2} - let's have two of this
(?:(?:...)) - actually, let's have 0, 2, 4, 6, ... whatever amount of this
Then your regex comes to match the right string: "([^$\"\n\r]+)"
If intellij supports that, then you can make it faster by replacing the non-capturing groups (?:...) with atomic groups (?>...).
This regex finds the last string in the line so you'll have to run the replace several times.
Update
Updated the negated character classes with the newline characters. Now it works well for multi-line texts too. Still, you'll have to run it several times because it finds only one string per line.
The Input:
Let's consider this string below
* key : foo bar *
* big key : bar*bar
* healthy : cereal bar *
sadly : without star *
The Output:
I would like to retrieve the key:value pairs for each match.
'key', 'foo bar'
'big key', 'bar*bar'
'healthy', 'cereal bar'
'sadly', 'without star'
The Regex:
My first success was achieved with this Regex (PCRE/Perl):
/(\n?)([^\* ].*[^ *])\s+:\s+([^\* ].*[^ *])[\s\*]+(?|\n)/g
Here the DEMO.
My question
I really find my regex pretty ugly. The main reason is because I can't use /^ and $/ in a global regex and I had to play with /(\n?)...(?|\n)/g.
Is there any possibility to shorten the above regex ?
The optional challenge
Actually this was the easy part. My string is supposed to be embedded in a C comment and I have to make sure I am not trying to match something outside a comment block.
(I not really need an answer to this second tricky question because if I write a script I can first match all the comments blocks, then find all the key:values patterns).
/********************************
* key : foo bar *
* big key : bar*bar
* healthy : /*cereal bar *
sadly : without star *
********************************/
not a key : this key
You can add the m -flag to the regexp to make anchors ^ and $ match beginnings and ends of each line within the string, i.e:
/^\s*\*?\s*([^:]+?)\s*:\s*(.*?)\s*\*?\s*$/gm
Note the use of non-greedy quantifiers (+? and *?) to not eat up characters that can be matched after the quantifier, i.e. the first capture group will not include the optional trailing whitespace before the colon, and the second capture group will not include trailing whitespace and an optional asterisk at the end of a line.
http://regex101.com/r/oJ8uW4/1
the regex I used is: /^\s*[*]*\s+(.*)\s+:\s+(.*?)\s+[*]*\s*$/gm
It works for your exemple as the not a key : this key has no space after it, so it would miss comments which do not close whith * and get values with trailing spaces too.
The point you're looking for is the modifiers after the last /
m to says it's multiline so ^ and $ are usable and g to rematch on each line.
The drawback is you can't rely on having /* and */ on lines around when using ^ and $
But Avinash will prove me wrong I bet :) (he's far better than me with regexes)
How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"
As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string
"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js
here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"