Replace a character by another, unless it is located in between braces - regex

What I would like to do with the following string, is to replace all comas "," by tabulation, unless the said coma is between braces { }.
Say I have:
goldRigged,1,0,0,0,1,0,0,0,1,"{"LootItemID": "goldOre", "Amount": 1}"
The result should be:
goldRigged\t1\t0\t0\t0\t1\t0\t0\t0\t1\t"{"LootItemID": "goldOre"**,** "Amount": 1}"
I already have: \"(\\{((.*?))\\})\" which allow me to match what's in between { }.
The idea would be to exclude the content with something and match any comas with something like \",^(\\{((.*?))\\})\"
But I guess that by doing that it will exclude the comma itself.

What you would need is called a negative lookahead and a negative lookbehind. However, this would make up a quite complex statement:
Match all commas that are not preceeded by a opening brace as long as they were not previously preceeded by a closing brace (plus the reverted logic for the right side of the comma). This will result in an expression that is difficult to process because the regex engine constantly needs to run up and down your string from its current position what will be rather inefficient.
Instead, iterate over all characters of your string. If you match an opening brace, set an escape hint. Remove it, when you find a closing brace. When you find a comma, replace it when your escape hint is not set. Write your result to some sort of string buffer and your solution will b significantly more efficiant over the regex.

You want to use a negative lookaround to achieve this:
(?<![\{\}]),*(?![\{\}]) should work, try here: http://regex101.com/r/gG3oU1

Use negative lookahead (?!expr) and negative lookbehind (?<!expr) in your regex expression
for example you can code like this:
System.Text.RegularExpressions.Regex.Replace(
"goldRigged,1,0,0,0,1,0,0,0,1, {\"LootItemID\": \"goldOre\", \"Amount\": 1}" ,
#"(?<!\{[^\}].*)[,](?![^\{]*\})", "\t");

Does your input line contain the { only in the last token?
If yes then you can try this brute force approach
echo "goldRigged,1,0,0,0,1,0,0,0,1,"{"LootItemID": "goldOre", "Amount": 1}"" | awk -F'{' '{one=$1;gsub(",","\t",one);printf("%s{%s\n",one,$2);}

The below regex is an expensive way of doing it. As suggest by #Sniffer a parser would be nicer here :)
(?=,.*?"{),|(?!,.*?\}),
First alternation
(?=,.*?"{), - make sure comma is outside the sequence "{
Second alternation
(?!,.*?\}), - make sure comma isn't inside the sequence }"
There will be edge cases that haven't been accounted for, that's the parser comes in

I think you actually need only one lookahead:
,(?=[^{}]*({|$))
reads: a comma, followed by some non-braces and then either an open brace or the end.
Example in JS:
> x = 'goldRigged,1,0,0,0,1,0,0,0,1,"{"LootItemID": "goldOre", "Amount": 1}",some,more{stuff,ff}end'
> x.replace(/,(?=[^{}]*({|$))/g, "#")
"goldRigged#1#0#0#0#1#0#0#0#1#"{"LootItemID": "goldOre", "Amount": 1}"#some#more{stuff,ff}end"
Note this doesn't work if braces can be nested, in this case you need either a regex engine with recursion (?R) or a proper parser.

Related

Regex: Exact match string ending with specific character

I'm using Java. So I have a comma separated list of strings in this form:
aa,aab,aac
aab,aa,aac
aab,aac,aa
I want to use regex to remove aa and the trailing ',' if it is not the last string in the list. I need to end up with the following result in all 3 cases:
aab,aac
Currently I am using the following pattern:
"aa[,]?"
However it is returning:
b,c
If lookarounds are available, you can write:
,aa(?![^,])|(?<![^,])aa,
with an empty string as replacement.
demo
Otherwise, with a POSIX ERE syntax you can do it with a capture:
^(aa(,|$))+|(,aa)+(,|$)
with the 4th group as replacement (so $4 or \4)
demo
Without knowing your flavor, I propose this solution for the case that it does know the \b.
I use perl as demo environment and do a replace with "_" for demonstration.
perl -pe "s/\baa,|,aa\b/_/"
\b is the "word border" anchor. I.e. any start or end of something looking like a word. It allows to handle line end, line start, blank, comma.
Using it, two alternatives suffice to cover all the cases in your sample input.
Output (with interleaved input, with both, line ending in newline and line ending in blank):
aa,aab,aac
_aab,aac
aab,aa,aac
aab_,aac
aab,aac,aa
aab,aac_
aa,aab,aac
_aab,aac
aab,aa,aac
aab_,aac
aab,aac,aa
aab,aac_
If the \b is unknown in your regex engine, then please state which one you are using, i.e. which tool (e.g. perl, awk, notepad++, sed, ...). Also in that case it might be necessary to do replacing instead of deleting, i.e. to fine tune a "," or "" as replacement. For supporting that, please show the context of your regex, i.e. the replacing mechanism you are using. If you are deleting, then please switch to replacing beforehand.
(I picked up an input from comment by gisek, that the cpaturing groups are not needed. I usually use () generously, including in other syntaxes. In my opinion not having to think or look up evaluation orders is a benefit in total time and risks taken. But after testing, I use this terser/eleganter way.)
If your regex engine supports positive lookaheads and positive lookbehinds, this should work:
,aa(?=,)|(?<=,)aa,|(,|^)aa(,|$)
You could probably use the following and replace it by nothing :
(aa,|,aa$)
Either aa, when it's in the begin or the middle of a string
,aa$ when it's at the end of the string
Demo
As you want to delete aa followed by a coma or the end of the line, this should do the trick: ,aa(?=,|$)|^aa,
see online demo

Vim any character replaced by itself

In Vim, a search and replace command which to the segment below:
func {print("touching curly braces")}
transforms it into this:
func { print("touching curly brace") }
So far I have:
:%s/{.\( \)\#!/{./g
However, it does this to the first segment:
func {.rint("touching curly braces")}
I believe I need something like this:
:%s/{.\( \)\#!/{ ./g
:%s/}.\( \)\#!/} ./g
How do I replace the kleene star '.' with the character it matched?
You need to put the . into a group so you can repeat in the substituion string \(.\)
:%s/{\(.\)\( \)\#!/{ \1/g
:%s/}\(.\)\( \)\#!/} \1/g
This is what is called backreferencing and grouping
If you want to do both spaces at once here is the command:
:%s/{\(\S.*\S\)}/{ \1 }/g
I'd replace
:%s/{\($| \)\#!/{ /g
and
:%s/\(^| \)\#<!}/ }/g
where {\($| \)\#! is { and a negative lookahead for either end-of-line or space, the other expression is analogous with a negative lookbehind and }.
Note that replacing in source code like that is a dangerous endeavor. You can break things very easily. Think of curly braces inside strings, or in regular expressions, or other situations you did not quite think of. Use /gc instead of /g to manually confirm each change.
do you mean this? I hope I understood your problem right:
s/\zs{\ze./& /g
s/.\zs}\ze/ &/g
You can do it in this way too:
s/{\ze./& /g
s/.\zs}/ &/g

Is this possible with one regex?

I have a string like
{! texthere }
I want to capture either everything after {! until either the end or you reach the first }. So if I had
{!text here} {!text here again} {!more text here. Oh boy!
I would want ["{!text here}", "{!text here again}", "{!more text here. oh boy!"]
I thought this would work
{!.*}??
but the above string would come out to be ["{!text here} {!text here again} {!more text here. Oh boy!"]
I'm still very inexperienced with regexes so I don't understand why this doesn't work. I would think it would match '{!' followed by any number of characters until you get to a bracket (non greedy) which may not be there.
Using positive lookbehind (?<={!)[^}]+:
In [8]: import re
In [9]: str="{!text here} {!text here again} {!more text here. Oh boy!"
In [10]: re.findall('(?<={!)[^}]+',str)
Out[10]: ['text here', 'text here again', 'more text here. Oh boy!']
That is positive lookbehind where by any non } character is matched if following {!.
You can do it this way :
({![^}]+}?)
Edit live on Debuggex
Then recover the capture group $1 which corresponds to the first set of parenthesis.
Using this way, you have to use a "match all" type of function because the regex itself is made to match a single group function
This way doesn't use any look around. Also the use of ^} should limit the number of regex engine cycle since it is searching for the next } as a breaker instead of having to do the whole expression then backtrack.
I believe you want to use a reluctant quantifier:
{!.*?}?
This will cause the . to stop matching as soon as the first following } is found, instead of the last.
I had a question about greedy and reluctant quantifiers that has a good answer here.
Another option would be to specify the characters that are allowed to come between the two curly braces like so:
{![^}]*}?
This specifies that there cannot be a closing curly brace matched within your pattern.
if your tool/language supports perl regex, try this:
(?<={!)[^}]*

Regex for \begin{?} and \end{?}

I need to match from a a string \begin{?} and \end{?} where ? is any number of alphanumerical or * characters so it must match for example \begin{align} and \end{align*}.
I tried to do it but I'm not sure what's wrong
^\\begin{[^}]*}$
Start with \begin{, following anything that's not } multiple times and close with }.
The same thing is with the \end{?} but I would like it do it inside single regex if possible.
I think below regex is what you need.
\\(begin|end){[a-zA-Z0-9*]+}
Your regex:
\\(begin|end){.*?}
the .* will grab anything between the { }, and the ? means will stop when the first } comes.
{} are special characters used for expressing repetitions so you need to escape those as well.
^\\begin\{[^}]*\}$

Extracting some data items in a string using regular expression

<![Apple]!>some garbage text may be here<![Banana]!>some garbage text may be here<![Orange]!><![Pear]!><![Pineapple]!>
In the above string, I would like to have a regex that matches all <![FruitName]!>, between these <![FruitName]!>, there may be some garbage text, my first attempt is like this:
<!\[[^\]!>]+\]!>
It works, but as you can see I've used this part:
[^\]!>]+
This kills some innocents. If the fruit name contains any one of these characters: ] ! > It'd be discarded and we love eating fruit so much that this should not happen.
How do we construct a regex that disallows exactly this string ]!> in the FruitName while all these can still be obtained?
The above example is just made up by me, I just want to know what the regex would look like if it has to be done in regex.
The simplest way would be <!\[.+?]!> - just don't care about what is matched between the two delimiters at all. Only make sure that it always matches the closing delimiter at the earliest possible opportunity - therefore the ? to make the quantifier lazy.
(Also, no need to escape the ])
About the specification that the sequence ]!> should be "disallowed" within the fruit name - well that's implicit since it is the closing delimiter.
To match a fruit name, you could use:
<!\[(.*?)]!>
After the opening <![, this matches the least amount of text that's followed by ]!>. By using .*? instead of .*, the least possible amount of text is matched.
Here's a full regex to match each fruit with the following text:
<!\[(.*?)]!>(.*?)(?=(<!\[)|$)
This uses positive lookahead (?=xxx) to match the beginning of the next tag or end-of-string. Positive lookahead matches but does not consume, so the next fruit can be matched by another application of the same regex.
depending on what language you are using, you can use the string methods your language provide by doing simple splitting (and simple regex that is more understandable). Split your string using "!>" as separator. Go through each field, check for <!. If found, replace all characters from front till <!. This will give you all the fruits. I use gawk to demonstrate, but the algorithm can be implemented in your language
eg gawk
# set field separator as !>
awk -F'!>' '
{
# for each field
for(i=1;i<=NF;i++){
# check if there is <!
if($i ~ /<!/){
# if <! is found, substitute from front till <!
gsub(/.*<!/,"",$i)
}
# print result
print $i
}
}
' file
output
# ./run.sh
[Apple]
[Banana]
[Orange]
[Pear]
[Pineapple]
No complicated regex needed.