How to enclose text patterns within xml elements, except when it is already inside a certain xml element? - regex

I have several thousand xml files generated from java properties files prepared for translation in the TTX format. They contain quite a few variables, that I need to protect from the translators, as they often break such things. The variables are in the form of numbers or occasionally text between a pair of curly braces eg. {0}, {this}.
I need to surround these variables with an xml element if they are not already an attribute and if they are not already part of the inner text of a ut element, like so:
<ut DisplayText="{0}"><{0}></ut>
My input looks like this:
<ut Type="start"DisplayText="string"><string></ut> text string {0}
<ut DisplayText="{1}"><{1}></ut> in:
<ut DisplayText="\n"><\n/></ut> {2}.
<ut Type="end" DisplayText="resource"></resource></ut>
The correct output should be this:
<ut Type="start"DisplayText="string"><string></ut> text string <ut DisplayText="{0}">{0}</ut>
<ut DisplayText="{1}"><{1}></ut> in:
<ut DisplayText="\n"><\n/></ut> <ut DisplayText="{2}">{2}</ut>.
<ut Type="end" DisplayText="resource"></resource></ut>
My initial approach was to use a regular expression to match the term in the braces and just build the xml elements around it with pattern substitution. This approach fails when the pattern is present found as in the first code block above.
Previous find and replace patters (in notepad++):
Find
({[A-Za-z0-9]*})
Replace
<ut DisplayText="\1">\1</ut>
It is beginning to look like regex is not the right tool for the job, so I would like some suggestions on better approaches to take, different tools, or even just a more complete regex that may allow me to solve this quickly and repeatably.
Update: The problem turned out to be a little more complex than previously envisioned. It seems there are also a couple more things that needed protecting, involving some rather obscure syntax, mixing variables with text in what appears to be some kind of conditional statement. From memory:
{o,choice|1#1 error|1<{0,number,integer} errors}
Where "error" and "errors" are translatable and should not be protected. The simplest solution we have at present is to run the above regex, fix the odd few of erros it creates and then run a couple more normal find & replace passes for the more complex items. It could be abstracted out as regex, but right now there is not much point in doing that.
I appreciate the pointers to xslt and other editors with better regex support, in addition to the improved expressions offered. I will have a play with some of the options when time allows.

Let me know if my assumption is wrong, but from your example it seems you want to change text that is in {} and not in a <ut> element. To me this seems like an easy use of XSLT. Simply output UT elements as they are and process any text in between.

Why not try using the expression
(?<=.){[A-Za-z0-9]+}(?=.$)
This would find the { with 1 or more letters or numbers and the } when this pattern follows the tag and any number of spaces AND is followed by any number of spaces and a line break.

I ended up using a combination of the Regex in the question and manually fixing the odd error that caused. It wasn't ideal but it was quicker than trying to find the perfect solution.

Related

regex finding elements in xml which contain attributes whose values contain two periods

I'm searching some xml and my tool is regex. (my only tools in this case are editors so I"m using either eclipse or notepad++). I need to find all elements which contain attributes that have values containing two periods not adjacent.
so it would find attr1 and attr3 in this:
<myelement attr1 = "ab.cd.ef", attr2="ab", attr3="zy.sa.xa"/>
I've tried this and variations in notepad++
^(([^\"\.])*(\")[^\"\.]*[\.][^\"\.]*[\.][^\"\.]*[\"])+$
but it isn't picking up second attributes with values containing two periods.
I'm going to keep trying but if someone can point me to an answer I'd appreciate it.
I think you can't do this with regex.
Unless you create a monster regex that will create a blackhole swallowing all the life in the Earth (politely saying of course).
Bear in mind that you don't have logic in regex you just use pattern matching, for instance a number is just a number you can't say if I get 1 then get 3 also in a simple way.
You can use if then else in regex like:
(?(?=condition)(then1|then2|then3)|(else1|else2|else3))
But what you want to do is to nest if conditions with multiple conditions for each case, like if 1 then 3 | if 2 then 4 | if 3 then 5 creating an enormous pattern nested.
Another regex approach would be to have multiple regex lookarounds (look ahead in this case) what will do your regex impossible to read.
I think you might find more useful a Xpath or Xquery expressions for this. That it's a better approach to match xml than regex.
I'm searching some xml and my tool is regex.
That's a bit like saying that you are cutting down trees and your tool is a screwdriver. Get the right tool for the job: an XML parser and an XPath engine.

Regular Expression Replace String

I have a rather complicating data file with many rows of many different types. For the particular column I'm interested in I have a pattern that looks like this:
12.6 \pm 0.8
^^ The number of digits before and after the decimal in each of those pieces of the entry may vary.
I'm hoping I can use regular expressions to replace that column entry to:
[12.6,-0.8,+0.8]
What I am requesting help on is how I should go about replacing once I've found entries like what I had earlier. All of the examples I've found so far are for when you want to replace static strings with other static strings, but for each line I'm necessarily going to have different numbers (and different digits perhaps). The regular expression I've attempted so far to find entries like "12.6 \pm 0.8" is the following:
\d*\.\d*\s\\\w{2})\s\d*\.\d*
I would also appreciate if I could get a check on that, too. At the moment I'm just manipulating the datafile in my text editor, but I'm also open to Python solutions, too.
Thanks!
Your expression is close. Are there any conditions where this won't work?
(\d*\.\d*)\s\\\w{2}\s(\d*\.\d*)
with the replace pattern being (for JS)
[$1, -$2, $2]
or for emacs (according to http://www.emacswiki.org/emacs/RegularExpression)
[\1, -\2, \2]

Exclude a certain String from variable in regex

Hi I have a Stylesheet where i use xsl:analyze-string with the following regex:
(&journal_abbrevs;)[\s ]*([0-9]{{4}})[,][\s ][S]?[\.]?[\s ]?([0-9]{{1,4}})([\s ][(][0-9]{{1,4}}[)])?
You don't need to look at the whole thing :)
&journal_abbrevs; looks like this:
"example-String1|example-String2|example-String3|..."
What I need to do know is exclude one of the strings in &journal_abbrevs; from this regex. E.g. I don't want example-String1 to be matched.
Any ideas on how to do that ?
It seems XSLT regex does not support look-around. So I don't think you'll be able to get a solution for this that does not involve writing out all strings from journal_abbrevs in your regex. Related question.
To minimize the amount of writing out, you could split journal_abbrevs into say journal_abbrevs1, journal_abbrevs2 and journal_abbrevs3 (or how many you decide to use) and only write out whichever one that contains the string you wish to exclude. If journal_abbrevs1 contains the string, you'd then end up with something like:
((&journal_abbrevs2;)|(&journal_abbrevs3;)|example-String2|example-String3|...)...
If it supported look-around, you could've used a very simple:
(?!example-String1)(&journal_abbrevs;)...

Parsing comments and strings in source code

I'm designing and implementing a scripting language, and for the "reading" stage I'm taking the time-tested and straightforward approach of splitting the code up into tokens (lexical analysis) followed by using a stack-based AST generator to squeeze syntactic structure out of the token stream (parsing). However, I'm facing an issue with strings, comments, and how they interact.
(for reference, code in my language uses ~ to start comments)
This might be the wrong approach, but I'm performing the lexical analysis step using regular expressions. For n kinds of tokens, I run n tokenization passes on my code, with each pass finding substrings that match the given regex and "tagging" them, until eventually every character is tagged. Each regex ignores matches that lie within already-tagged sections of source, only tagging unclaimed land. This is useful, because you wouldn't want, for example, a number token infiltrating a token like translate3d.
The issue I'm running into is with comments embedded in strings and strings embedded in comments. I don't know how to simultaneously make this
"The ~ is my favorite character! It's so happy-looking!"
be tagged as a string, and have this
~ "handles" the Exception (just logs it to a file nobody ever reads and moves on)
be tagged as a comment. It seems that either way, you have to impose some ordering on the passes of lexical analysis, and that either the comment or the string pass is going to "win" and tag a substring it has no business tokenizing. For example, either the string is tagged like so: (I'm using XML notation because it's a good way to represent tagged regions of text. XML is not actually used in my program at any point)
"The <comment>~ is my favorite character! It's so happy-looking!"</comment>
or the comment is tagged like this:
<comment>~ </comment><string>"handles"</string>the Exception (just logs it to a file and moves on)
Either it's assumed a string starts in the middle of a comment or a comment starts in the middle of a string.
What's odd is that it seems that this system of regex passes tagging substrings is exactly what the syntax highlighting on a text editor does, and comments and strings work fine there. I've already developed the textmate/submlime text 2 syntax definition for my language, and all I had to do was (in a simplified version of the actual format used)
<syntax>
<color>
string_color
</color>
<pattern>
"[^"]*"
</pattern>
</syntax>
<syntax>
<color>
comment_color
</color>
<pattern>
~.*
</pattern>
</syntax>
Everything works fine when I'm writing sample code. When I tried to emulate what I imagine the behavior of the text editor is, however, I ran into the problems mentioned above. How can this be fixed, preferably in the most elegant way possible? Obviously, special handling could be added, stripping all the comments off the source code before any lexical analysis is done, except for comments inside strings (which requires the reader (reader in this case being the machine, not the human) to detect what sections of code are strings twice), but I'm sure there must be a better way, simply because sublime text only has knowledge of the regexes used to specify the two kinds of regions of code, and with only that information it behaves exactly as expected.
Rather than first tagging the source code before tokenizing it, and using several passes to do so, I recommend that you abandon the tagging and just tokenize the code in one pass using one regular expression.
If you construct an all-encompassing regex that contains sub-patterns to match and capture each token, you can then match globally and determine the token type by examining the capture group contents.
In simple example, if you had a regex such as
"([^"]*)"|~([^\n]*)|(\d+(?:.\d+)?)
to match either strings, comments, or numbers, then if a string was matched the first capture group () would contain it, and all the other capture groups would be empty.
So, in your for each loop (D Language Regular expressons) you would use conditional statements and the match object's capture group contents to determine the next token to be added.
And you wouldn't necessarily have to use just one large regex, you could match several token types in one capture group and then within the for each block apply a further regex (or indexOf etc.) on the capture group contents to determine the token.

RegExp get string inside string

Let presume we have something like this:
<div1>
<h1>text1</h1>
<h1>text2</h1>
</div1>
<div2>
<h1>text3</h1>
</div2>
Using RegExp we need to get text1 and text2 but not text3.
How to do this?
Thanks in advance.
EDIT:
This is just an example.
The text I'm parsing could be just plain text.
The main thing I want to accomplish is list all strings from a specific section of a document.
I gave this HTML code for example as it perfectly resembles the thing I need to get.
(?siU)<h1>(.*)</h1> would parse all three strings, but how to get only first two?
EDIT2:
Here is another rather dumb example. :)
Section1
This is a "very" nice sentence.
It has "just" a few words.
Section2
This is "only" an example.
The End
I need quoted words from first but not from second section.
Yet again, (?siU)"(.*)" returns quoted words from whole text,
and I need only those between words Section1 and Section2.
This is for the "Rainmeter" application, which apparently uses Perl regex syntax.
I'm sorry, but I can't explain it better. :)
For the general case of the two examples provided -- for use in Rainmeter regex -- you can use:
(?siU)<h1>(.*)</h1>(?=.+<div2>) for the first sample and
(?siU)"(.*)"(?=.+Section2) for the second.
Note that Rainmeter seems to escape things for you, but you might need to change " to \", above.
These both use Positive Lookahead but beware: both solutions will fail in the case of nested tags/structures or if there are mutiple Section1's and Section2's. Regex is not the best tool for this kind of parsing.
But maybe this is good enough for your current needs?
Use a DOM library and getElementsByTagName('div') and you'll get a nodeList back. You can reference the first item with ->item(0) and then getElementsByTagName('h1') using the div as a context node, grab the text with ->nodeValue property.