A common thing I want to do, when doing a search-replace in an IDE (in this case: PyCharm), is to avoid cutting expressions or statements in half.
For example, suppose I want to fix the fact that my is using python-2-style print statements. I might write:
Search: print (.+), replace: print($1)
But this will do the wrong thing for multi-line statements:
print 'one' \
'two'
In general, recognizing multi-line statements is complicated. You need to check for trailing \s and also do bracket-matching for multiple types of brackets. Is there built-in functionality for doing this? Some kind of end-of-statement / end-of-expression escape sequence?
You could probably do it this way.
Find print((?:.+?(?:\\\r?\n)?)+)
Replace print($1)
Expanded
print
( # (1 start)
(?:
.+?
(?: \\ \r? \n )? # Possible line-continuation
)+
) # (1 end)
Related
Problem:
I have thousands of documents which contains a specific character I don't want. E.g. the character a. These documents contain a variety of characters, but the a's I want to replace are inside double quotes or single quotes.
I would like to find and replace them, and I thought using Regex would be needed. I am using VSCode, but I'm open to any suggestions.
My attempt:
I was able to find the following regex to match for a specific string containing the values inside the ().
".*?(r).*?"
However, this only highlights the entire quote. I want to highlight the character only.
Any solution, perhaps outside of regex, is welcome.
Example outcomes:
Given, the character is a, find replace to b
Somebody once told me "apples" are good for you => Somebody once told me "bpples" are good for you
"Aardvarks" make good kebabs => "Abrdvbrks" make good kebabs
The boy said "aaah!" when his mom told him he was eating aardvark => The boy said "bbbh!" when his mom told him he was eating aardvark
Visual Studio Code
VS Code uses JavaScript RegEx engine for its find / replace functionality. This means you are very limited in working with regex in comparison to other flavors like .NET or PCRE.
Lucky enough that this flavor supports lookaheads and with lookaheads you are able to look for but not consume character. So one way to ensure that we are within a quoted string is to look for number of quotes down to bottom of file / subject string to be odd after matching an a:
a(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$)
Live demo
This looks for as in a double quoted string, to have it for single quoted strings substitute all "s with '. You can't have both at a time.
There is a problem with regex above however, that it conflicts with escaped double quotes within double quoted strings. To match them too if it matters you have a long way to go:
a(?=[^"\\]*(?:\\.[^"\\]*)*"[^"\\]*(?:\\.[^"\\]*)*(?:"[^"\\]*(?:\\.[^"\\]*)*"[^"\\]*(?:\\.[^"\\]*)*)*$)
Applying these approaches on large files probably will result in an stack overflow so let's see a better approach.
I am using VSCode, but I'm open to any suggestions.
That's great. Then I'd suggest to use awk or sed or something more programmatic in order to achieve what you are after or if you are able to use Sublime Text a chance exists to work around this problem in a more elegant way.
Sublime Text
This is supposed to work on large files with hundred of thousands of lines but care that it works for a single character (here a) that with some modifications may work for a word or substring too:
Search for:
(?:"|\G(?<!")(?!\A))(?<r>[^a"\\]*+(?>\\.[^a"\\]*)*+)\K(a|"(*SKIP)(*F))(?(?=((?&r)"))\3)
^ ^ ^
Replace it with: WHATEVER\3
Live demo
RegEx Breakdown:
(?: # Beginning of non-capturing group #1
" # Match a `"`
| # Or
\G(?<!")(?!\A) # Continue matching from last successful match
# It shouldn't start right after a `"`
) # End of NCG #1
(?<r> # Start of capturing group `r`
[^a"\\]*+ # Match anything except `a`, `"` or a backslash (possessively)
(?>\\.[^a"\\]*)*+ # Match an escaped character or
# repeat last pattern as much as possible
)\K # End of CG `r`, reset all consumed characters
( # Start of CG #2
a # Match literal `a`
| # Or
"(*SKIP)(*F) # Match a `"` and skip over current match
)
(?(?= # Start a conditional cluster, assuming a positive lookahead
((?&r)") # Start of CG #3, recurs CG `r` and match `"`
) # End of condition
\3 # If conditional passed match CG #3
) # End of conditional
Three-step approach
Last but not least...
Matching a character inside quotation marks is tricky since delimiters are exactly the same so opening and closing marks can not be distinguished from each other without taking a look at adjacent strings. What you can do is change a delimiter to something else so that you can look for it later.
Step 1:
Search for: "[^"\\]*(?:\\.[^"\\]*)*"
Replace with: $0Я
Step 2:
Search for: a(?=[^"\\]*(?:\\.[^"\\]*)*"Я)
Replace with whatever you expect.
Step 3:
Search for: "Я
Replace with nothing to revert every thing.
/(["'])(.*?)(a)(.*?\1)/g
With the replace pattern:
$1$2$4
As far as I'm aware, VS Code uses the same regex engine as JavaScript, which is why I've written my example in JS.
The problem with this is that if you have multiple a's in 1 set of quotes, then it will struggle to pull out the right values, so there needs to be some sort of code behind it, or you, hammering the replace button until no more matches are found, to recurse the pattern and get rid of all the a's in between quotes
let regex = /(["'])(.*?)(a)(.*?\1)/g,
subst = `$1$2$4`,
str = `"a"
"helapke"
Not matched - aaaaaaa
"This is the way the world ends"
"Not with fire"
"ABBA"
"abba",
'I can haz cheezburger'
"This is not a match'
`;
// Loop to get rid of multiple a's in quotes
while(str.match(regex)){
str = str.replace(regex, subst);
}
const result = str;
console.log(result);
Firstly a few of considerations:
There could be multiple a characters within a single quote.
Each quote (using single or double quotation marks) consists of an opening quote character, some text and the same closing quote character. A simple approach is to assume that when the quote characters are counted sequentially, the odd ones are opening quotes and the even ones are closing quotes.
Following point 2, it could be worth some further thought on whether single-quoted strings should be allowed. See the following example: It's a shame 'this quoted text' isn't quoted. Here, the simple approach would think there were two quoted strings: s a shame and isn. Another: This isn't a quote ...'this is' and 'it's unclear where this quote ends'. I've avoided attempting to tackle these complexities and gone with the simple approach below.
The bad news is that point 1 presents a bit of a problem, as a capturing group with a wildcard repeat character after it (e.g. (.*)*) will only capture the last captured "thing". But the good news is there's a way of getting around this within certain limits. Many regex engines will allow up to 99 capturing groups (*). So if we can make the assumption that there will be no more than 99 as in each quote (UPDATE ...or even if we can't - see step 3), we can do the following...
(*) Unfortunately my first port of call, Notepad++ doesn't - it only allows up to 9. Not sure about VS Code. But regex101 (used for the online demos below) does.
TL;DR - What to do?
Search for: "([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*"
Replace with: "\1\2\3\4\5\6\7\8\9\10\11\12\13\14\15\16\17\18\19\20\21\22\23\24\25\26\27\28\29\30\31\32\33\34\35\36\37\38\39\40\41\42\43\44\45\46\47\48\49\50\51\52\53\54\55\56\57\58\59\60\61\62\63\64\65\66\67\68\69\70\71\72\73\74\75\76\77\78\79\80\81\82\83\84\85\86\87\88\89\90\91\92\93\94\95\96\97\98\99"
(Optionally keep repeating steps the previous two steps if there's a possibility of > 99 such characters in a single quote until they've all been replaced).
Repeat step 1 but replacing all " with ' in the regular expression, i.e: '([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*'
Repeat steps 2-3.
Online demos
Please see the following regex101 demos, which could actually be used to perform the replacements if you're able to copy the whole text into the contents of "TEST STRING":
Demo for double quotes
Demo for single quotes.
If you can use Visual Studio (instead of Visual Studio Code), it is written in C++ and C# and uses the .NET Framework regular expressions, which means you can use variable length lookbehinds to accomplish this.
(?<="[^"\n]*)a(?=[^"\n]*")
Adding some more logic to the above regular expression, we can tell it to ignore any locations where there are an even amount of " preceding it. This prevents matches for a outside of quotes. Take, for example, the string "a" a "a". Only the first and last a in this string will be matched, but the one in the middle will be ignored.
(?<!^[^"\n]*(?:(?:"[^"\n]*){2})+)(?<="[^"\n]*)a(?=[^"\n]*")
Now the only problem is this will break if we have escaped " within two double quotes such as "a\"" a "a". We need to add more logic to prevent this behaviour. Luckily, this beautiful answer exists for properly matching escaped ". Adding this logic to the regex above, we get the following:
(?<!^[^"\n]*(?:(?:"(?:[^"\\\n]|\\.)*){2})+)(?<="[^"\n]*)a(?=[^"\n]*")
I'm not sure which method works best with your strings, but I'll explain this last regex in detail as it also explains the two previous ones.
(?<!^[^"\n]*(?:(?:"(?:[^"\\\n]|\\.)*){2})+) Negative lookbehind ensuring what precedes doesn't match the following
^ Assert position at the start of the line
[^"\n]* Match anything except " or \n any number of times
(?:(?:"(?:[^"\\\n]|\\.)*){2})+ Match the following one or more times. This ensures if there are any " preceding the match that they are balanced in the sense that there is an opening and closing double quote.
(?:"(?:[^"\\\n]|\\.)*){2} Match the following exactly twice
" Match this literally
(?:[^"\\\n]|\\.)* Match either of the following any number of times
[^"\\\n] Match anything except ", \ and \n
\\. Matches \ followed by any character
(?<="[^"\n]*) Positive lookbehind ensuring what precedes matches the following
" Match this literally
[^"\n]* Match anything except " or \n any number of times
a Match this literally
(?=[^"\n]*") Positive lookahead ensuring what follows matches the following
[^"\n]* Match anything except " or \n any number of times
" Match this literally
You can drop the \n from the above pattern as the following suggests. I added it just in case there's some sort of special cases I'm not considering (i.e. comments) that could break this regex within your text. The \A also forces the regex to match from the start of the string (or file) instead of the start of the line.
(?<!\A[^"]*(?:(?:"(?:[^"\\]|\\.)*){2})+)(?<="[^"]*)a(?=[^"]*")
You can test this regex here
This is what it looks like in Visual Studio:
I am using VSCode, but I'm open to any suggestions.
If you want to stay in an Editor environment, you could use
Visual Studio (>= 2012) or even notepad++ for quick fixup.
This avoids having to use a spurious script environment.
Both of these engines (Dot-Net and boost, respectively) use the \G construct.
Which is start the next match at the position where the last one left off.
Again, this is just a suggestion.
This regex doesn't check the validity of balanced quotes within the entire
string ahead of time (but it could with the addition of a single line).
It is all about knowing where the inside and outside of quotes are.
I've commented the regex, but if you need more info let me know.
Again this is just a suggestion (I know your editor uses ECMAScript).
Find (?s)(?:^([^"]*(?:"[^"a]*(?=")"[^"]*(?="))*"[^"a]*)|(?!^)\G)a([^"a]*(?:(?=a.*?")|(?:"[^"]*$|"[^"]*(?=")(?:"[^"a]*(?=")"[^"]*(?="))*"[^"a]*)))
Replace $1b$2
That's all there is to it.
https://regex101.com/r/loLFYH/1
Comments
(?s) # Dot-all inine modifier
(?:
^ # BOS
( # (1 start), Find first quote from BOS (written back)
[^"]*
(?: # --- Cluster
" [^"a]* # Inside quotes with no 'a'
(?= " )
" [^"]* # Between quotes, get up to next quote
(?= " )
)* # --- End cluster, 0 to many times
" [^"a]* # Inside quotes, will be an 'a' ahead of here
# to be sucked up by this match
) # (1 end)
| # OR,
(?! ^ ) # Not-BOS
\G # Continue where left off from last match.
# Must be an 'a' at this point
)
a # The 'a' to be replaced
( # (2 start), Up to the next 'a' (to be written back)
[^"a]*
(?: # --------------------
(?= a .*? " ) # If stopped before 'a', must be a quote ahead
| # or,
(?: # --------------------
" [^"]* $ # If stopped at a quote, check for EOS
| # or,
" [^"]* # Between quotes, get up to next quote
(?= " )
(?: # --- Cluster
" [^"a]* # Inside quotes with no 'a'
(?= " )
" [^"]* # Between quotes
(?= " )
)* # --- End cluster, 0 to many times
" [^"a]* # Inside quotes, will be an 'a' ahead of here
# to be sucked up on the next match
) # --------------------
) # --------------------
) # (2 end)
"Inside double quotes" is rather tricky, because there are may complicating scenarios to consider to fully automate this.
What are your precise rules for "enclosed by quotes"? Do you need to consider multi-line quotes? Do you have quoted strings containing escaped quotes or quotes used other than starting/ending string quotation?
However there may be a fairly simple expression to do much of what you want.
Search expression: ("[^a"]*)a
Replacement expression: $1b
This doesn't consider inside or outside of quotes - you have do that visually. But it highlights text from the quote to the matching character, so you can quickly decide if this is inside or not.
If you can live with the visual inspection, then we can build up this pattern to include different quote types and upper and lower case.
Lets say I am using Visual Studio's Find and Replace tool and I want to find and comment out every instance of Console.WriteLine(...). However, I can wind up with situations where Console.WriteLine(...) goes across multiple lines like so:
Console.WriteLine("Adding drive to VM with ID: {0}. Drive HostVMID is {1}",
vm.ID, drive.HostVmId);
These can go on for 2, 3, 4, etc lines and finally end with ); to close the statement. Then I can have other lines that are immediately followed by important blocks of code:
Console.WriteLine("Creating snapshot for VM: {0} {1}", dbVm.ID, dbVm.VmName);
dbContext.Add(new RTVirtualMachineSnapshot(dbVm));
So what I want to do is come up with a regex statement that will find both the first type of instances of Console.WriteLine as well as simple single-line instances of it.
The Regex that I got from another user was:
Console\.writeline(?>.+)(?<!;)
Which will match any line that contains Console.WriteLine but does not end with a semicolon. However I need it to continue on until it finally does reach a closing parenthesis followed by a semicolon.
Ive tried the following regex:
(Console\.writeline(?>.+)(?<!\);)
However I think thats incorrect because it still only matches the first line and doesnt capture the following lines when the writeline spans multiple lines.
At the end of the day I want to be able to capture a full Console.writeline statement regardless of how many lines it spans using Visual Studio's find and replace feature and I am a little confused on the regex I would need to use to do this.
I guess you could try this which just looks for a pseudo termination,
but does not take into account string quotes.
(?s)\b(Console\s*\.\s*WriteLine\s*\((?:(?!\)\s*;).)*\)\s*;)
Formatted:
(?s)
\b
( # (1 start)
Console \s* \. \s* WriteLine \s* \(
(?:
(?! \) \s* ; )
.
)*
\) \s* ;
) # (1 end)
add:
If you cannot use dot-all modifier (?s) and Dot-all options are not available (should be),
substitute [\S\s] for the dot.
Then just substitute "/** $1 **/" to comment it out.
I have some obfuscated code which call functions, like this:
getAny([["text with symbols \"()[],.;\" and maybe 'ImVerySeriousFn'"], ...]);
setAny([["other text with \"()[],.;\""], ...]);...
Arguments contain random text. Functions follow each other without a new line.
How can I get arguments of getAny, setAny and other functions, using set of regular expressions?
I need this result:
regex1 result: [["text with symbols \"()[],.;\" and maybe 'ImVerySeriousFn'"], ...]
regex2 result: [["other text with \"()[],.;\""], ...]
...
I tried write regex1:
getAny\((.*)\)
but matching result also contains setAny call
[["text with symbols \"()[],.;\" and maybe 'ImVerySeriousFn'"], ...]);setAny([["other text with \"()[],.;\""], ...]
When I tried:
getAny\((.*?)\)
matching result break argument string
[["text with symbols \"(
I can't split by ; or ); because text in arguments can contains symbols ; or );
maybe impossible to do it using regex?
Your regex needs to be \(.*?\); since your code is obfuscated (and assumedly on one line).
Note that this will fail if one of your arguments contains ); inside of it.
Explanation (From Regex101.com):
/\((.*?)\);/g
\( matches the character ( literally
1st Capturing group (.*?)
.*? matches any character (except newline)
Quantifier: Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\) matches the character ) literally
; matches the character ; literally
g modifier: global. All matches (don't return on first match)
The main problem with your regex is that you never specified ; to end a match, so it went ahead and grabbed up until the last ) it saw because you used .*, which is greedy (grabs everything) unless followed by ?.
Demo
I don't know, if I understand your question, but if I do, you maybe could use a group and collect the allowed signs in it.
Your regex could be: \( ( ) " [ ],\.; a-zA-Z \)
outer brackets enclose the group
If I understand your pattern correctly, your function argument will always start with [[" and end with "]].
Regex:
/getAny\((\[\[".*?[^\\]"\]\])\);/
Demo: http://regex101.com/r/jC3vX5/2
Note the lazy .*?, and [^\\] to make sure the matching quote is not escaped.
I have a regular expression of the form
def parse(self, format_string):
for m in re.finditer(
r"""(?: \$ \( ( [^)]+ ) \) ) # the field access specifier
| (
(?:
\n | . (?= \$ \( ) # any one single character before the '$('
)
| (?:
\n | . (?! \$ \( ) # any one single character, except the one before the '$('
)*
)""",
format_string,
re.VERBOSE):
...
I would like to replace all the repeating sequences (\$ \() with some custom shorthand "constant", which would look like this:
def parse(self, format_string):
re.<something>('\BEGIN = \$\(')
for m in re.finditer(
r"""(?: \BEGIN ( [^)]+ ) \) ) # the field access specifier
| (
(?:
\n | . (?= \BEGIN ) # any one single character before the '$('
)
| (?:
\n | . (?! \BEGIN ) # any one single character, except the one before the '$('
)*
)""",
format_string,
re.VERBOSE):
...
Is there a way to do this with regular expressions themselves (i.e. not using Python's string formatting to substitute \BEGIN with \$\()?
Clarification: the Python source is purely for context and illustration. I'm looking for RE solution, which would be available in some RE dialect (maybe not Python's one), not the solution specifically for Python.
I don't think this is possible in Python's regex flavor. You would need recursion (or rather pattern reuse) which is only supported by PCRE. In fact, PCRE even mentions how defining shorthands works in its man page (search for "Defining subpatterns").
In PCRE, you can use the recursion syntax in a similar way to backreferences - except that the pattern is applied again, instead of trying to get the same literal text as from a backreference. Example:
/(\d\d)-(?1)-(?1)/
Matches something like a date (where (?1) will be replaced with with \d\d and evaluated again). This is really powerful, because if you use this construct within the referenced group itself you get recursion - but we don't even need that here. The above also works with named groups:
/(?<my>\d\d)-(?&my)-(?&my)/
Now we're already really close, but the definition is also the first use of the pattern, which somewhat clutters up the expression. The trick is to use the pattern first in a position that is never evaluated. The man pages suggest a conditional that is dependent on a (non-existent) group DEFINE:
/
(?(DEFINE)
(?<my>\d\d)
)
(?&my)-(?&my)-(?&my)
/x
The construct (?(group)true|false) applies pattern true if the group group was used before, and (the optional) pattern false otherwise. Since there is no group DEFINE, the condition will always be false, and the true pattern will be skipped. Hence, we can put all kinds of definitions there, without being afraid that they will ever be applied and mess up our results. This way we get them into the pattern, without really using them.
And alternative is a negative lookahead that never reaches the point where the expression is defined:
/
(?!
(?!) # fail - this makes the surrounding lookahead pass unconditionally
# the engine never gets here; now we can write down our definitions
(?<my>\d\d)
)
(?&my)-(?&my)-(?&my)
/x
However, you only really need this form, if you don't have conditionals, but do have named pattern reuse (and I don't think a flavor like this exists). The other variant has the advantage, that the use of DEFINE makes it obvious what the group is for, while the lookahead approach is a bit obfuscated.
So back to your original example:
/
# Definitions
(?(DEFINE)
(?<BEGIN>[$][(])
)
# And now your pattern
(?: (?&BEGIN) ( [^)]+ ) \) ) # the field access specifier
|
(
(?: # any one single character before the '$('
\n | . (?= (?&BEGIN) )
)
|
(?: # any one single character, except the one before the '$('
\n | . (?! (?&BEGIN) )
)*
)
/x
There are two major caveats to this approach:
Recursive references are atomic. That is, once the reference has matched something it will never be backtracked into. For certain cases this can mean that you have to be a bit clever in crafting your expression, so that the first match will always be the one you want.
You cannot use capturing inside the defined patterns. If you use something like (?<myPattern>a(b)c) and reuse it, the b will never be captured - when reusing a pattern, all groups are non-capturing.
The most important advantage over any kind of interpolation or concatenation however is, that you can never produce invalid patterns with this, and you cannot mess up your capturing group counts either.
New to regex and I need to pattern match on some dates to change the format.
I'm going from mm/dd/yy to yyyy-mm-dd where there are no entries prior to 2000.
What I'm unfamiliar with is how to group things to use their respective references of \1, \2, etc.
Would I first want to match on mm/dd/yy with something like ( \d{2} ) ( \/\d{2} ) ( \/\d{2} ) or is it as easy as \d\d/\d\d/\d\d ?
Assuming my first grouping is partially the right idea, I'm looking to do something like:
:%s/old/new/g
:%s/ ( \d{2} ) ( \/\d{2} ) ( \/\d{2} ) / ( 20+\3) - (\3) - (\1) /g
EDIT: Sorry, the replace is going to a yyyy-mm-dd format with hyphens, not the slash.
I was going to comment on another answer but it got complicated.
Mind the magic setting. If you want unescaped parens to do grouping, you need to include \v somewhere in your pattern. (See :help magic).
You can avoid escaping the slashes if you use something other than slashes in the :s command.
You are close. :) You don't want all of those spaces though as they'll require spaces in the same places to match.
My solution, where I use \v so I don't need to escape the parens and exclamation points so I can use slashes in my pattern without escaping them:
:%s!\v(\d{2})/(\d{2})/(\d{2})!20\3-\2-\1!g
This will match "inside" items that start or end with three or more digits though, too. If you can give begin/end criteria then that'd possibly be helpful. Assuming that simple "word boundary" conditions work, you can use <>:
:%s!\v<(\d{2})/(\d{2})/(\d{2})>!20\3-\2-\1!g
To critique yours specifically (for learning!):
:%s/ ( \d{2} ) ( \/\d{2} ) ( \/\d{2} ) / ( 20+\3) - (\3) - (\1) /g
Get rid of the spaces since presumably you don't want them!
Your grouping needs either \( \) or \v to work
You also need \{2} unless you use \v
You are putting the slashes in groups two and three which means they'll show up in the replacement too
You don't want the parentheses in the output!
You're substituting text directly; you don't want the + after the 20 in the output
Try this:
:%s/\(\d\{2}\)\/\(\d\{2}\)\/\(\d\{2}\)/20\3-\2-\1/g
The bits you're interested in are: \(...\) - capture; \d - a digit; \{N} - N occurrences; and \/ - a literal forward slash.
So that's capturing two digits, skipping a slash, capturing two more, skipping another slash, and capturing two more, then replacing it with "20" + the third couplet + "-" + the second couplet + "-" + the first couplet. That should turn "dd/mm/yy" into "20yy-mm-dd".
ok, try this one:
:0,$s#\(\d\{1,2\}\)/\(\d\{1,2\}\)/\(\d\{1,2\}\)#20\3-\2-\1#g
I've removed a lot of the spaces, both in the matching section and the replacement section, and most of parens, because the format you were asking for didn't have it.
Some things of note:
With vi you can change the '/' to any other character, which helps when you're trying to match a string with slashes in it.. I usually use '#' but it doesn't have to be.
You've got to escape the parens, and the curly braces
I use the :0,$ instead of %s, but I think it has the same meaning -- apply the following command to every row between row 0 and the end.
For the match: (\d{2})\/(\d{2})\/(\d{2})
For the replace: 20\3\/\1\/\2