Regular Expression to match parent and sub node - regex

I want to development a regular expresion to match the tag :
<claim-text>aaaaaaa
<claim-text>bbbbbbb</claim-text>
<claim-text>ccccccc</claim-text>
</claim-text>
I tried
<claim-text>(.*)</claim-text>
But, only bbbbbbb and ccccccc can be matched. Can I get some help to cover aaaaaaa also?
Thanks

For a generic solution with any depth, you will at least need a stack, which not available for most regular expression implementation. However, if you know the structure will only have the depth you specified, you could use something like this:
<claim-text>([^<\r\n]*)
You can see a working example here: https://regex101.com/r/kbDbwF/1
It will search for your opening tag, and then find anything up to the next opening or closing tag [^<], or to the next line break [^\r\n]. I have combined both character classes to one definition [^<\r\n]. However, this is not a general solution!

Do not under any circumstances try to parse HTML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it.

Related

Regular expression to extract argument in a LaTeX command

I have a large LaTeX document where I have defined a macro like this:
\newcommand{\abs}[1]{\left|#1\right|}
I want to get rid of it by replacing in all the document \abs{...} by \left|...\right|, so I am thinking in a regular expression. I am familiar with their basics but I do not know how to find the bracket that closes the expression, given that the following situations are possible:
\abs{(2+x)^3}
\abs{\frac{2}{3}}
\abs{\frac{2}{\sin(2\abs{x})}}
What I have been able to do for the moment is \\abs\{([^\}]*)\} and then replace as \left\1\right|but it is only able to deal with the pieces of code of the first kind only.
By the way, I am using the TeXstudio regular expression engine.
Well, I did a little more of research and I managed to solve it. According to this response in a similar question, it suffices to use recursive regular expressions and a text editor that supports them, for example Sublime Text 2 (I could not do it with TeXstudio). This does the trick:
Find: \\abs\{(([^\{\}]|(?R))*)\}
Replace: \\left|\1\\right|
EDIT 1: Actually this solves only the two first cases, but fails with the third, so any idea on how to improve the regular expression would be appreciated.
EDIT 2: See comment from #CasimiretHippolyte for full answer using \\abs\{((?>[^{}]+|\{(?1)\})*)\}

Need to create a gmail like search syntax; maybe using regular expressions?

I need to enhance the search functionality on a page listing user accounts. Rather than have multiple search boxes for each possible field, or a drop down menu where the user can only search against one field, I'd like a single search box and to use a gmail like syntax. That's the best way I can describe it, and what I mean by a gmail like search syntax is being able to type the following into the input box:
username:bbaggins type:admin "made up plc"
When the form is submitted, the search string should be split into it's separate parts, which will allow me to construct a SQL query. So for example, type:admin would form part of the WHERE clause so that it would find any record where the field type is equal to admin and the same for username. The text in quotes may be a free text search, but I'm not sure on that yet.
I'm thinking that a regular expression or two would be the best way to do this, but that's something I'm really not good at. Can anyone help to construct a regular expression which could be used for this purpose? I've searched around for some pointers but either I don't know what to search for or it's not out there as I couldn't find anything obvious. Maybe if I understood regular expressions better it would be easier :-)
Cheers,
Adam
No, you would not use regular expressions for this. Just split the string on spaces in whatever language you're using.
You don't necessarily have to use a regex. Regexes are powerful, but in many cases also slow. Regex also does not handle nested parameters very well. It would be easier for you to write a script that uses string manipulation to split the string and extract the keywords and the field names.
If you want to experiment with Regex, try the online REGex tester. Find a tutorial and play around, it's fun, and you should quickly be able to produce useful regexes that find any words before or after a : character, or any sentences between " quotation marks.
thanks for the answers...I did start doing it without regex and just wondered if a regex would be simpler. Sounds like it wouldn't though, so I'll go back to the way I was doing it and test it again.
Good old Mr Bilbo is my go to guy for any naming needs :-)
Cheers,
Adam

Regular Expression matching anything after a word

I am looking to find anything that matches this pattern, the beginning word will be:
organism aogikgoi egopetkgeopt foprkgeroptk 13
So anything that starts with organism needs to be found using regex.
^organism will match anything starting with "organism".
^organism(.*) will also capture everything that follows, into the variable that contains the first match (which varies according to language -- in Perl it's $1).
Also just wanna add for others newbies like me and their various circumstances, you can do it in various ways depending on your text and what you are tryna do.
Like here's an Example where I wanna delete everything after ?spam so I could use .?spm.+ or .?spm.+ or any other ways as long you are creative about it lol.
This might come in handy, here's a Link | Link where you can find some basic necessary regex and their meanings.

What is a better way to write this regular expression?

I am converting XML children into the element parameters and have a dirty regex script I used in Textmate. I know that dot (.) doesn't search for newlines, so this is how I got it to resolve.
Search
language="(.*)"
(.*)<education>(.*)(\n)?(.*)?(\n)?(.*)?(\n)?(.*)?</education>
(.*)<years>(.*)</years>
(.*)<grade>(.*)</grade>
Replace
grade="$13" language="$1" years="$11">
<education>$3$4$5$6$7$8$9</education>
I know there's a better way to do this. Please help me build my regex skills further.
Use an xml parser, don't use regex to parse xml.
If there are no other tags inside the <education> element, I would change that part to:
<education>([^<>]*)</education>
If possible, I would use the same technique everywhere else you're using .*. In the case of the language attribute, it would take this form:
language="([^"]*)"

Regex not returning 2 groups

I'm having a bit of trouble with my regex and was wondering if anyone could please shed some light on what to do.
Basically, I have this Regex:
\[(link='\d+') (type='\w+')](.*|)\[/link]
For example, when I pass it the string:
[link='8' type='gig']Blur[/link] are playing [link='19' type='venue']Hyde Park[/link]"
It only returns a single match from the opening [link] tag to the last [/link] tag.
I'm just wondering if anyone could please help me with what to put in my (.*|) section to only select one [link][/link] section at a time.
Thanks!
You need to make the wildcard selection ungreedy with the "?" operator. I make it:
/\[(link='\d+')\s+(type='\w+')\](.*?)\[\/link\]/
of course this all falls down for any kind of nesting, in which case the language is no longer regular and regexs aren't suitable - find a parser
Regular Expressions Info a is a fantastic site. This page gives an example of dealing with html tags. There's also an Eclipse plugin that lets you develop expressions and see the matching in realtime.
You need to make the .* in the middle of your regex non-greedy. Look up the syntax and/or flag for non-greedy mode in your flavor of regular expressions.