Regex: how to ignore nested brackets in [Y](X) pattern? - regex

I'm using Regular Expressions to find and replace parts of the text that look like this pattern:
[Y](X)
where X and Y are different phrases,
and the search process is based on X only
The following pattern does the job, but it replaces more text
when there are nested brackets
\[.*\]\(X\)
e.g.
[ZZZ [Y](X)
Text Sample:
The [epidermis](Epidermis (skin)) (the outermost layer of skin) and the outer dermis (the layer beneath the epidermis).Contact dermatitis results in large, burning, and itchy rashes. These can take anywhere from several days to weeks to heal. This differentiates it from contact urticaria –
How can I fix my Regular Expression to avoid this?

If I understood correctly you wanted to use a positive look ahead like:
In this sample:
[epidermis](Epidermis (skin))
you do something like:
\[[^\[\]]*?\](?=\(Epidermis \(skin\)\))

Related

Find and replace with regular expression in Notepad++

At the moment, I have a PHP function that gets the contents of a CSV file and puts it into a multi-dimensional array, which contains text that I print out in various places, using the indexes.
an example of use would be:
$localText[index][pageText][conceptQualityText][$lang];
The first index, [index], would be the name of the page. The second index [pageText] would indicate what it is (text for the page). The third index, [conceptQualityText] indicates what the actual text is. The last index, [$lang] gets the text in the desired language.
so:
->page location
->what is it
->the content
->what language it should be displayed in.
This all worked fine in the previous PHP versions. However, upgrading to 7.2, PHP seems to be a bit more strict. I was a bit more green ~2 years ago when I first made this solution, and now know that since these indexes aren't defined as strings e.g. encapsulated in single quotes like so: ['index'], they fit the notation of a superglobal (DEFINE). I didn't give it much thought back then, but now PHP seems to interpret them as so (superglobals), and so I get thrown the error that x word is an undefined superglobal.
My initial thought is to make a search and replace on my example string:
$localText[index][pageText][conceptQualityText][$lang];
using the regular expression functionality in Notepad++.
However, the example is just one of many, the notation of the array indexing is basically:
$localText[index][index2][index3][$lang];
So my question is:
How can I make use of the Notepad++ search and replace, using a regular expression, so that my index pointers become strings, instead of acting as superglobal variables?
e.g. make:
$localText[index][index2][index3][$lang];
into:
$localText['index']['index2']['index3'][$lang];
I will need some sort of logic that checks for whatever is inside the brackets and encapsulates them with single quotes, except for the last index, [$lang].
I tried to give as much information as possible, let me know if anything needs to be elaborated.
I tried to refer to these docs without much luck.
I found a solution using
this:
find: \b(localText\[)([a-zA-z0-9_\-]+)(\]\[)([a-zA-z0-9_\-]+)(\]\[)([a-zA-z0-9_\-]+)
replace: $1'$2'$3'$4'$5'$6'
and it works like a charm. Thanks for everyone who took their time to help.
You can use the following regex to match:
\[[^'](\w+)[^']\]
The regex matches a Word between Square brackets unless it quoted.
Replace with:
['$1']
The regex will not match the last brackets because it contains a '$' sign.

Regex to highlight Strings in code

I'm developing a small code editor and I would like to match Strings so I can highlight them a different color.
Example:
myvar = array('VOLVO', 'TOYOTA')
Using regex expression \'.*\' I get one match 'VOLVO', 'TOYOTA'
However, what I want are two matches: 'VOLVO' and 'TOYOTA'
Is this possible to achieve with a single regex expression?
Someone did suggest the following expression:
\'[^\']*\'
Which in fact solves my problem: I get two matches, one for 'VOLVO' and one for 'TOYOTA'.

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

Regular Expression Replace String

I have a rather complicating data file with many rows of many different types. For the particular column I'm interested in I have a pattern that looks like this:
12.6 \pm 0.8
^^ The number of digits before and after the decimal in each of those pieces of the entry may vary.
I'm hoping I can use regular expressions to replace that column entry to:
[12.6,-0.8,+0.8]
What I am requesting help on is how I should go about replacing once I've found entries like what I had earlier. All of the examples I've found so far are for when you want to replace static strings with other static strings, but for each line I'm necessarily going to have different numbers (and different digits perhaps). The regular expression I've attempted so far to find entries like "12.6 \pm 0.8" is the following:
\d*\.\d*\s\\\w{2})\s\d*\.\d*
I would also appreciate if I could get a check on that, too. At the moment I'm just manipulating the datafile in my text editor, but I'm also open to Python solutions, too.
Thanks!
Your expression is close. Are there any conditions where this won't work?
(\d*\.\d*)\s\\\w{2}\s(\d*\.\d*)
with the replace pattern being (for JS)
[$1, -$2, $2]
or for emacs (according to http://www.emacswiki.org/emacs/RegularExpression)
[\1, -\2, \2]

How to enclose text patterns within xml elements, except when it is already inside a certain xml element?

I have several thousand xml files generated from java properties files prepared for translation in the TTX format. They contain quite a few variables, that I need to protect from the translators, as they often break such things. The variables are in the form of numbers or occasionally text between a pair of curly braces eg. {0}, {this}.
I need to surround these variables with an xml element if they are not already an attribute and if they are not already part of the inner text of a ut element, like so:
<ut DisplayText="{0}"><{0}></ut>
My input looks like this:
<ut Type="start"DisplayText="string"><string></ut> text string {0}
<ut DisplayText="{1}"><{1}></ut> in:
<ut DisplayText="\n"><\n/></ut> {2}.
<ut Type="end" DisplayText="resource"></resource></ut>
The correct output should be this:
<ut Type="start"DisplayText="string"><string></ut> text string <ut DisplayText="{0}">{0}</ut>
<ut DisplayText="{1}"><{1}></ut> in:
<ut DisplayText="\n"><\n/></ut> <ut DisplayText="{2}">{2}</ut>.
<ut Type="end" DisplayText="resource"></resource></ut>
My initial approach was to use a regular expression to match the term in the braces and just build the xml elements around it with pattern substitution. This approach fails when the pattern is present found as in the first code block above.
Previous find and replace patters (in notepad++):
Find
({[A-Za-z0-9]*})
Replace
<ut DisplayText="\1">\1</ut>
It is beginning to look like regex is not the right tool for the job, so I would like some suggestions on better approaches to take, different tools, or even just a more complete regex that may allow me to solve this quickly and repeatably.
Update: The problem turned out to be a little more complex than previously envisioned. It seems there are also a couple more things that needed protecting, involving some rather obscure syntax, mixing variables with text in what appears to be some kind of conditional statement. From memory:
{o,choice|1#1 error|1<{0,number,integer} errors}
Where "error" and "errors" are translatable and should not be protected. The simplest solution we have at present is to run the above regex, fix the odd few of erros it creates and then run a couple more normal find & replace passes for the more complex items. It could be abstracted out as regex, but right now there is not much point in doing that.
I appreciate the pointers to xslt and other editors with better regex support, in addition to the improved expressions offered. I will have a play with some of the options when time allows.
Let me know if my assumption is wrong, but from your example it seems you want to change text that is in {} and not in a <ut> element. To me this seems like an easy use of XSLT. Simply output UT elements as they are and process any text in between.
Why not try using the expression
(?<=.){[A-Za-z0-9]+}(?=.$)
This would find the { with 1 or more letters or numbers and the } when this pattern follows the tag and any number of spaces AND is followed by any number of spaces and a line break.
I ended up using a combination of the Regex in the question and manually fixing the odd error that caused. It wasn't ideal but it was quicker than trying to find the perfect solution.