JavaScript Regular expression text parsing - regex

I have a string like the following
~~<b>A<i>C</i></b>~~/~~<u>D</u><b>B</b>~~has done this.
I am trying to get the text inside <b> tag. I am trying
<b>(.+)</b>
But I am getting <b>A<i>C</i></b>~~/~~<u>D</u><b>B</b>, but I need <b>A<i>C</i></b> as first match and <b>B</b> as the second match
Can anyone please help?

You need to use a non-greedy quantifier:
<b>(.+?)</b>
This will ensure that the match stops at the first </b> it finds.
However, I would generally recommend using a proper XML or HTML parser for this sort of thing. Regular expressions are simply not powerful enough to handle the recursive structure of XML.

Related

Finding text between two tags with variable namespace

I have to parse a lot of text files where each text file contain one or more XML documents. I do know every XML is wrapped in a Envelope tag as root tag, but they have varying namespaces.
I tried to create a regular expression to grab these XML documents from a text file, and it does work for most of them, but for some I get an catastrophic backtracking error. I think it's because the text is too large and my expression not very efficient. I'm not really great at regex, so i'm struggling to fix this.
The pattern i'm looking for is:
<namespace:envelope attributes>XML</namespace:envelope>
What i've come up with so far is:
(?i)<[^:]*?:envelope[^>]*?>.*?<\/[^:]*?:envelope>
Any help would be greatly appreciated.
Try to use this regular expression:
#<([^/].*?):envelope\s.+?</\1:envelope>#s
RegEx101 Demo 1
Or shorter one, if you don't need to have namespace separately:
#<([^/].*?:envelope)\s.+?</\1>#s
RegEx101 Demo 2

Problems with finding and replacing

Hey stackoverflow community. Ive need help with huge information file. Is it possible with regular expression to find in this tag:
<category_name><![CDATA[Prekiniai ženklai>Adler|Kita buitinė technika>Buičiai naudingi prietaisai|Kita buitinė technika>Lygintuvai]]></category_name>
Somehow replace all the other data and leave only 'Adler' or 'Lygintuvai'. Im using Altova to edit xml files, so i cant find other way then find-replace. And im new in the regex stuff. So i thought maby you can help me.
#\<category_name\>.+?gt\;([\w]+?)\|.+?gt;([\w]+?)\]\]\>\<\/category_name\>#i
\1 - Adler
\2 - Lygintuvai
PHP
regex101.com
Fields may contain alphanumeric characters without spaces.
If you want to modify the scope of acceptable characters change [\w] to something other:
[a-z] - only letters
[0-9] - only digits
etc.
It's possible, but use of regular expressions to process XML will never be 100% correct (you can prove that using computer science theory), and it may also be very inefficient. For example, the solution given by Luk is incorrect because it doesn't allow whitespace in places where XML allows it. Much better to use XQuery or XSLT, both of which are designed for the job (and both work in Altova). You can then use XPath expressions to locate the element or attribute nodes you are interested in, and you can still use regular expressions (e.g. in the XPath replace() function) to process the content of text or attribute nodes.
Incidentally, your input is rather strange because it uses escape sequences like > within a CDATA section; but XML escape sequences are not recognized in a CDATA section.

How to write a regular expression pattern for this scenario

I am trying to find the special character appearence in my below sample xml.
<?xml version="1.0"?>
<PayLoad>
<requestRows>****</requestRows>
<requestRowLength>1272</requestRowLength>
<exceptionTimestamp>2012070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>201$2070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>20120(702022810680700</exceptionTimestamp>
<exceptionDetail>NO DATA AVAILABLE FOR TIME PERIOD SPECIFIED =</exceptionDetail>
</PayLoad>
I have to find a entire tags that contain $,(,=,- characters. for this i have written below regular expression pattern
(<[\w\d]*>\w*(?<value>[^\w]+)\w*\d*</[\w\d]*>)
and it returns following output(running in Expresso Tool)
<requestRows>****</requestRows>
<exceptionTimestamp>2012070202281068-0700</exceptionTimestamp>
<exceptionTimestamp>20120(702022810680700</exceptionTimestamp>
but it should return below two enrty also.
<exceptionTimestamp>201$2070202281068-0700</exceptionTimestamp>
<exceptionDetail>NO DATA AVAILABLE FOR TIME PERIOD SPECIFIED =</exceptionDetail>
these entries omitted because it contains more than one special characters(including space). Can anyone please give me a correct regular expression for the above scenario.
Thanks.
I would use lookaround for the mid part, so instead of
(<[\w\d]*>\w*(?<value>[^\w]+)\w*\d*</[\w\d]*>)
I would use
(<[\w\d]*>(?=[^<]*[^<\w])(?<value>.*)</[\w\d]*>)
Without the ?<value> part that I don't really recognise the syntax of, this becomes
(<[\w\d]*>(?=[^<]*[^<\w]).*</[\w\d]*>)
Just add capturing groups where you like if you want to save anything in particular.

matching table tag by regular expression in php

I need to match a substring in php substring is like
<table class="tdicerik" id="dgVeriler"
I wrote a regular expression to it like <table\s*\sid=\"dgVeriler\" but it didnot work where is my problem ?
You forgot a dot:
<table\s.*\sid="dgVeriler"
would have worked.
<table\s+.*?\s+id="dgVeriler"
would have been better (making the repetition lazy, matching as little as possible).
<table\s+[^>]*?\s+id="dgVeriler"
would have been better still (making sure that we don't accidentally match outside of the <table>tag).
And not trying to parse HTML with regular expressions, using a parser instead, would probably have been best.
I dont know what you want get but try this:
<table\s*.*id=\"dgVeriler\"

What is a better way to write this regular expression?

I am converting XML children into the element parameters and have a dirty regex script I used in Textmate. I know that dot (.) doesn't search for newlines, so this is how I got it to resolve.
Search
language="(.*)"
(.*)<education>(.*)(\n)?(.*)?(\n)?(.*)?(\n)?(.*)?</education>
(.*)<years>(.*)</years>
(.*)<grade>(.*)</grade>
Replace
grade="$13" language="$1" years="$11">
<education>$3$4$5$6$7$8$9</education>
I know there's a better way to do this. Please help me build my regex skills further.
Use an xml parser, don't use regex to parse xml.
If there are no other tags inside the <education> element, I would change that part to:
<education>([^<>]*)</education>
If possible, I would use the same technique everywhere else you're using .*. In the case of the language attribute, it would take this form:
language="([^"]*)"