How to replace specific strings between <> using regex - regex

Can anyone tell me how to do the following task using regex?
replace all the ABC with DEF only when ABC is inside both <> and ""
original string:
<tagA nameABC1="attr1ABCx xyzABC" name2="attABCa"> outside"ABC"xyz</tagA>
<tagB nameABC2="attr2ABCx cccABC" name3="testABCb"> outside_"ABC"</tagB>
desired string after replacing:
<tagA nameABC1="attr1DEFx xyzDEF" name2="attDEFa"> outside"ABC"xyz</tagA>
<tagB nameABC2="attr2DEFx cccDEF" name3="testDEFb"> outside_"ABC"</tagB>
Edited:
Thank you guys.
I've decided to use HTML parser library jsoup to handle all html text properly.

Assuming well formed input (no dangling quotes or brackets):
Search: ABC(?=(?:(?:[^"]*"){2})*[^"]*"[^"]*$)(?=[^<>]*>)
Replace: DEF
See live demo.
This works by applying two look aheads:
the first look ahead (?=(?:(?:[^"]*"){2})*[^"]*"[^"]*$) requires there to be an odd number of quote characters in the remaining input, which in turn means the match is inside quotes
the other look ahead (?=[^<>]*>) requires the next angle bracket to be a closing bracket, which in turn means the match is inside an angle bracket pair
This is not bullet proof, for example it doesn't cater for closing angle brackets being inside quotes, but even this could be handled with an even more complicated look ahead that applied similar logic from the first look ahead when matching angle brackets... an excerise left for the reader.

Related

How to remove everything outside of round brackets in notepad++

I am trying to extract image urls from css file using a notepad++.
Since all the image urls are inside of the round brackets I am thinking about regex to remove everything before ( and everything after ).
Here is a text example :
text-transform:uppercase;
font-size:17px;
}
.conTxt .btnGiveAccess{
width:532px;
height:145px;
background:url(http://www.website.com/css/images/btn-give-me-access2.jpg) no-repeat center top;
margin:0 auto;
display:block;
}
.conTxt .btnGiveAccess:hover{
background:url(http://www.website.com/css/images/btn-give-me-access2.jpg) no-repeat center -145px;
}
/*#########################################*/
popup window
/*#########################################*/
/*a {
As a result I woul dlike to get the following:
http://www.website.com/css/images/btn-give-me-access2.jpg
http://www.website.com/css/images/btn-give-me-access2.jpg
I also tried the following regex to delete everything before http://
^[^http]*` Also .*((.*)).*
but it didnt work. Could anybody please help?
For the given text the following works. Use a find text of (\A|\))([^()]*)(\(|\Z) and replace that with \r\n. This will leave the required text plus a few empty lines that can easily be removed, eg by menu => Edit => Line operations => Remove empty lines.
A minor variation is to use a replacement string of \1\3. which will remove everything outside the round brackets leaving the brackets themselves and the text between them. It is then a simple job to remove all round brackets, perhaps replacing them with new lines. This could be done with a find text of [()]+ and replace string of \r\n.
Explanation of the first regular expression. The captures are:
(\A|\)) which looks for either the start of the buffer, the \A or a close bracket.
([^()]*) which looks for a sequence of characters that do not include round brackets.
(\(|\Z) which looks for a close bracket or the end of the buffer, the \Z.
The effect is to look for a three types of text.
From start of buffer to first opening round bracket. This matches (\A)([^()]*)(\(|\Z).
From close bracket to open bracket. This matches (\))([^()]*)(\().
From close bracket to end of buffer. This matches (\))([^()]*)(\Z).
This may not do the desired job if there are nested round brackets, but the question does not specify what should happen in such cases.
If none of the URLs have parens in them, you could use \((.*)\)
The difference between this and what you show above is that the outer () (the literal ones, not the ones that make the regex capture) are escaped using \

how to write link parser with regex

I have a line: "a herf = sdfsjkdhfks http://www.google.com 134"
I want to get the "http://www.google.com" part only if there is a "<" at the beginning and a ">" in the end
For now my regex is "(?i)(http)(s:| :).+\.[A-Za-z]{2,}/?"
What can I do to check if the arrow bracket exist without taking it as part of my regular expression, I mean, I do not want arrow bracket to be the output of the match"
In this case, the output should be null cause there is no arrow bracket, but if there are, I want the output to be just "www.google.com"
Thanks in advance
Include the bracket as part of your regex, then as a second step after you've found the match, strip it out of that result string before you return the result.
If you're anchoring the angled brackets to the start and end of the regex, this could be as simple as something like .substring(1,matchedString.length()-1).
This will get the link part skipping any thing at the start and end.
import re
content = "<ahref = 123 http://googl 235>"
re.findall("<a[\s]*href[\s]*=.*(http://[^> ]*)[\s]*.*>",content)

How to do regular Expression in AutoIt Script

In Autoit script Iam unable to do Regular expression for the below string Here the numbers will get changed always.
Actual String = _WinWaitActivate("RX_IST2_AM [PID:942564 NPID:10991 SID:498702881] sbivvrwm060.dev.ib.tor.Test.com:30000","")
Here the PID, NPID & SID : will be changing and rest of the things are always constant.
What i have tried below is
_WinWaitActivate("RX_IST2_AM [PID:'([0-9]{1,6})' NPID:'([0-9]{1,5})' SID:'([0-9]{1,9})' sbivvrwm060.dev.ib.tor.Test.com:30000","")
Can someone please help me
As stated in the documentation, you should write the prefix REGEXPTITLE: and surround everything with square brackets, but "escape" all including ones as the dots (.) and spaces () with a backslash (\) and instead of [0-9] you might use \d like "[REGEXPTITLE:RX_IST2_AM\ \[PID:(\d{1,6})\ NPID:(\d{1,5})\ SID:(\d{1,9})\] sbivvrwm060\.dev\.ib\.tor\.Test\.com:30000]" as your parameter for the Win...(...)-Functions.
You can even omit the round brackets ((...)) but keep their content if you don't want to capture the content to process it further like with StringRegExp(...) or StringRegExpReplace(...) - using the _WinWaitActivete(...)-Function it won't make sense anyways as it is only matching and not replacing or returning anything from your regular expression.
According to regex101 both work, with the round brackets and without - you should always use a tool like this site to confirm that your expression is actually working for your input string.
Not familiar with autoit, but remember that regex has to completely match your string to capture results. For example, (goat)s will NOT capture the word goat if your string is goat or goater.
You have forgotten to add a ] in your regex, so your pattern doesn't match the string and capture groups will not be extracted. Also I'm not completely sold on the usage of '. Based on this page, you can do something like StringRegExp(yourstring, 'RX_IST2_AM [PID:([0-9]{1,6}) NPID:([0-9]{1,5}) SID:([0-9]{1,9})]', $STR_REGEXPARRAYGLOBALMATCH) and $1, $2 and $3 would be your results respectively. But maybe your approach works too.

Search text with a regular expression to match outside specific characters

I have text that looks like:
My name is (Richard) and I cannot do
[whatever (Jack) can't do] and
(Robert) is the same way [unlike
(Betty)] thanks (Jill)
The goal is to search using a regular expression to find all parenthesized names that occur anywhere in the text BUT in-between any brackets.
So in the text above, the result I am looking for is:
Richard
Robert
Jill
You can do it in two steps:
step1: match all bracket contents using:
\[[^\]]*\]
and replace it with ''
step2: match all the remaining parenthesized names(globally) using:
\([^)]*\)
You didn't say what language you're using, so here's some Python:
>>> import re
>>> REGEX = re.compile(r'(?:[^[(]+|\(([^)]*)\)|\[[^]]*])')
>>> s="""My name is (Richard) and I cannot do [whatever (Jack) can't do] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"""
>>> filter(None, REGEX.findall(s))
The output is:
['Richard', 'Robert', 'Jill']
One caveat is that this does not work with arbitrary nesting. The only nesting it's really designed to work with is one level of parens in square brackets as mentioned in the question. Arbitrary nesting can't be done with just regular expressions. (This is a consequence of the pumping lemma for regular languages.)
The regex looks for chunks of text without brackets or parens, chunks of text enclosed in parens, and chunks of text enclosed in brackets. Only text in parens (not in square brackets) is captured. Python's findall finds all matches of the regex in sequence. In some languages you may need to write a loop to repeatedly match. For non-paren matches, findall inserts an empty string in the result list, so the call to filter removes those.
IF you are using .NET you can do something like:
"(?<!\[.*?)(?<name>\(\w+\))(?>!.*\])"
It's not really the best job for a single regexp - have you considered, for example, making a copy of the string and then deleting everything in between the square brackets instead? It would then be fairly straight forward to extract things from inside the parenthesis. Alternatively, you could write a very basic parser that tokenises the line (into normal text, square bracket text, and parenthesised text, I imagine) and then parses the tree that produces; it'd be more work initially but would make life much simpler if you later want to make the behaviour any more complicated.
Having said that, /(?:(?:^|\])[^\[]*)\((.*?)\)/ does the trick for your test case (but it will almost certainly have some weird behaviour if your [ and ] aren't matched properly, and I'm not convinced it's that efficient).
A quick (PHP) test case:
preg_match_all('/(?:(?:^|\])[^\[]*)\((.*?)\)/', "My name is ... (Jill)", $m);
print(implode(", ", $m[1]));
Outputs:
Richard, Robert, Jill
>>> s="My name is (Richard) and I cannot do [whatever (Jack) can't do (Jill) can] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"
>>> for item in s.split("]"):
... st = item.split("[")[0]
... if ")" in st:
... for i in st.split(")"):
... if "(" in i:
... print i.split("(")[-1]
...
Richard
Robert
Jill
So you want the regex to match the name, but not the enclosing parentheses? This should do it:
[^()]+(?=\)[^\[\]]*(?:\[[^\[\]]*\][^\[\]]*)*$)
As with the other answers, I'm making certain assumptions about your target string, like expecting parentheses and square brackets to be correctly balanced and not nested.
I say it should work because, although I've tested it, I don't know what language/tool you're using to do the regex matching with. We could provide higher-quality answers if we had that info; all regex flavors are not created equal.

Regex: How to get all contents inside a tag #[SOME TEXT HERE]

I am working on a simple token replacement feature of our product. I have almost resolved all the issue but I missed one thing. A token must support attributes, and an attribute can also be a token. This is part of a bigger project. hope you can help.
The begining tag is "**#[**" and the ending tag is "**]**". Say, #[FirstName], #[LastName], #[Age, WhenZero="Undisclosed"].
Right now i am using this expression "\#\[[^\]]+\]". I have this working but it failed on this input:
blah blah text here...
**#[IsFreeShipping, WhenTrue="<img src='/images/fw_freeshipping.gif'/>
<a href='http://www.hellowebsite.net/freeshipping.aspx'>$[FreeShipping]</a>"]**
blah blah text here also...
It fails becauise it encouter the first ], it stops there. It returns:
*#[IsFreeShipping, WhenTrue="<img src='/images/fw_freeshipping.gif'/>
<a href='http://www.hellowebsite.net/freeshipping.aspx'>$[Product_FreeShipping]*
My desired result should be
*#[IsFreeShipping, WhenTrue="<img src='/images/fw_freeshipping.gif'/>
<a href='http://www.hellowebsite.net/freeshipping.aspx'>$[FreeShipping]</a>"]*
Your Regex matches exactly what your stated condition indicates : Start with an opening square bracket and match everything upto the first closing square bracket.
If you want to match nested square brackets, you need to specify exactly what is valid when nested. For instance, you could say that square brackets can be nested when enclosed within quotes.
This is a little border-line for a regexp, since it depends on a context, but still...
#\[(\](?=")|[^\]])+\]
should do it.
The idea is to mention a closing square bracket can be part of the parsed content if followed by a double quotes, as part of the end of an attribute.
If that same square bracket were anywhere within the attribute, that would be a lot harder...
The advantage with lookahead expression is that you can specify a regexp with a non-fixed match length.
So if the attribute closing square bracket is not followed by a double quote, but rather by another known expression, you just update the lookahead part:
#\[(\](?=</a>")|[^\]])+\]
will match only the second closing square bracket, since the first is followed by </a>".
Of course, any kind of greedy expression (.*]) would not work, since it would not match the second closing square bracket, but the last one. (Meaning if there are more the one intermediate ], it will be parsed.)
When I've done stuff like this before I've evaluated from the inner most matchable expression before stepping out to larger strings.
In this case your regex should probably try to replace $[FreeShipping] with it's value before evaluating the larger token containing the if clause.
Perhaps you can figure out a way to replace out the value token's like $[FreeShipping] before the ones without $ prepending the token
This is roughly but not exactly
http://en.wikipedia.org/wiki/Multi-pass_compiler versus http://en.wikipedia.org/wiki/One-pass_compiler
Writing this in one regex won't necessarily be any faster than looping over a few simple regex's. All regex's do is abstract string parsing.
If you're only expecting a single match in any given input you could simply allow for a greedy match:
/#\[.*\]/
If you're expecting multiples you have a problem because you no longer have regular text. You'll need to escape the inner brackets in some way.
(Regex is a deep subject - it's quite possible that someone has a better solution)
I'd be interested to lear if I'm wrong, but if I recall correctly, you cannot do this using regular expressions. This looks like a Dyck language to me and you would need a pushdown automaton to accept the expressions. But I must admit I'm not quite sure if this holds true for the extended form of regexp's like those provided by Perl.
It is possible to write a regex for the example you given but in general it fails. A single regex can't work for arbitrary nested expressions.
Your example shows that your DSL has 'if' conditions already. Not before long It could evolve into a Turing-complete language.
Why don't you use an existing template language such as Django template language:
Your example:
blah blah text here... #[IsFreeShipping,
WhenTrue="<img src='/images/fw_freeshipping.gif'/>
<a href='http://www.hellowebsite.net/freeshipping.aspx'>$[FreeShipping]</a>"]
blah blah text here also...
Using Django template language:
blah blah text here... {% if IsFreeShipping %}
<img src='/images/fw_freeshipping.gif'/>
<a href='http://www.hellowebsite.net/freeshipping.aspx'>{{ FreeShipping }}</a>
{% endif %} blah blah text here also...
This works for your sample:
#\[(?:[^\]$]+|\$(?!\[)|\$\[[^\[\]]*\])*\]
It assumes that the inner square brackets can't themselves contain square brackets. If the inner tokens can also contain tokens, you're probably out of luck. Some regex flavors can handle recursive structures, but the resulting regexes are hideous even by regex standards. :D
Tis regex also treats the '$' as special only if it's followed by an opening square bracket. If you want to disallow its use otherwise, remove the second alternative: |\$(?!\[)