Regex Full match - regex

I'm trying to understand regular expressions:
I need to only match on text_01 and text_02 and filter out the tags.
<span>text_01<b>text_02</b>
I've tried to do it like:
(?<=<span>)(([^>]+)<b>)(.+?)(?=</b>)
But it captures 3 groups and and the Full Match includes a tag.
text_01<b>text_02
Could you give me advice on how I need to build a regex whose Full match contains only text and no tags?

Parsing HTML with regular expressions can get very complicated. In general it is not advised practice and better to use a parser for this (some library in whatever language you are using).
But for cases where you are sure the text content does not have < nor >, and these < and > are not nested, you could use this one:
[^<>]*(?=<[^<>]*>)
This only matches text that is followed by a pair of < and >.
If it is enough to test that text is followed by <, it can be simply:
[^<>]*(?=<)

By using a non-capturing group you are able to exclude the middle <b> tag as a capture group, but you will never be able to get a full match without the tag included. It's not possible, a regular expression cannot skip a part while capturing. A match must be consecutive.
(?<=<span>)(.+?)(?:<b>)(.+?)(?=<\/b>)
Full match text_01<b>text_02
Group 1. text_01
Group 2. text_02

Related

regex match expression except specific string (no negative lookahead)

i'm trying to write a regex that matches most cases of HTML elements, like for example:
<script></script>
I would like to make an exception for the following HTML tag specifically:
<b>
Which I don't want to capture. Is there a way to do it without using negative lookahead/lookbehind?
At the moment i have something like this:
((\%3C)|<)[^<b]((\%2F)|\/)*[^<\/b][a-z0-9\%\=\'\(\)\ ]+((\%3E)|>)
https://regex101.com/r/ZxkVMJ/2
It does work, but beside
<b>
it also doesn't capture all 1 character tags
(like <a> for example)
as well as longer tags that start with b, like for example
<balloon>
Thank you for any help
As a disclaimer, if you have the availability of any kind of XML/HTML parser, you should really use that for your current problem. If you are forced to use regex here, then consider this pattern:
<([^b][^>]*|b[^>]+)>.*?<\/\1>
This matches an HTML tag which either starts with a letter other than b, or a tag which does start with b, but then is followed by one or more other characters (thus ruling out <b>). Here is a working demo:
Demo

Regular Expression - Starting and ending with, and contains specific string in the middle

I would like to generate a regex with the following condition:
The string "EVENT" is contained within a xml tag called "SHEM-HAKOVETZ".
For example, the following string should be a match:
<SHEM-HAKOVETZ>104000514813450EVENTS0001dfd0.DAT</SHEM-HAKOVETZ>
I think you want something like this ^<SHEM-HAKOVETZ>.*EVENT.*<\/SHEM-HAKOVETZ>$
Regular expression
^<SHEM-HAKOVETZ>.*EVENTS.*<\/SHEM-HAKOVETZ>$
Parts of the regular expression
^ From the beginning of the line
<SHEM-HAKOVETZ> Starting tag
.* Any character - zero or more
EVENT Middle part
<\/SHEM-HAKOVETZ>$ Ending part of the match
Here is the working regex.
If you want to match this line, you could use this regex:
<SHEM-HAKOVETZ>*EVENTS.*(?=<\/SHEM-HAKOVETZ>)
However, I would not recommend using regex XML-based data, because there may be problems with whitespace handling in XML (see this article for more information). I would suggest using an actual XML parser (and then applying the reg to be sure about your results.
Here is a solution to only match the "value" part ignoring the XML tags:
(?<=<SHEM-HAKOVETZ>)(?:.*EVENTS.*)(?=<\/SHEM-HAKOVETZ>)
You can check it out in action at: https://regex101.com/r/4XiRch/1
It works with Lookbehind and Lookahead to make sure it will only match if the tags are correct, but for further coding will only match the content.

Regular Expression in Notepad++ to replace < and > inside CDATA

I'm using Notepad++ to fix a huge XML export file and one of the challenges here is to replace all < and > characters to < and >. The thing is, I can't simply use the replace all action since the XML file is full of < and > that cannot be changed.
Luckly all the < and > that I need to change are wrapped by CDATA tags, like this:
<![CDATA[Text here... <span class="vSpecial"><p>Special Offer - more text here!</p></span>]]>
I was wondering if there'd be a Regular Expression to identify < and > wrapped in CDATA content, so I could easily use the Replace All to change only them.
UPDATE
The content of CDATA can contain line breaks.
Code
See regex in use here
<!\[CDATA\[)(?:(?!\]\]>).)*?\K(?:(<)|(>))
Replacement: (?{1}<)(?{2}>)
Note: For display purposes the link above uses \G(?!\A). This is not supported in Notepad++, thus it's been dropped in the actual answer. I added it to the link to show what it basically does.
See the Notepadd++ documentation for more information. It mentions the following:
For those readers familiar with Perl, \G is not supported.
Results
Before
After
Explanation
Click Replace All repeatedly until the message at the bottom shows Replace All: 0 occurrences were replaced. It will replace the first occurrence, then the second occurrence, then third, etc. for each CDATA that is found until there are no more matches.
Pattern
<!\[CDATA\[ Matches <![[CDATA[ literally
(?:(?!\]\]>).)*? Tempered lazy token matching any character any number of times, but as few as possible ensuring what follows doesn't match ]]>
\K Resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match
(?:(<)|(>)) Match either of the following
(<) Capture < literally into capture group 1
(>) Capture > literally into capture group 2
Replacement
Notepad++ allows conditional replacements, so (?{1}<) makes reference to capture group one and (?{2}>) makes reference to capture group 2.

regex expression for selecting a value

I want to write a regexp formula for the below sip message that takes number:
< sip:callpark#as1sip1.com:5060;user=callpark;service=callpark;preason=park;paction=park;ptoken=150009;pautortrv=180;nt_server_host=47.168.105.100:5060 >
(Actually there are "<" and ">" signs in the message, but the site does not let me write)
For this case, I want to select ptoken value.. I wrote an expression such as: ptoken=(.*);p but it returns me ptoken=150009;p, I just need the number:150009
How do I write a regexp for this case?
PS: I write this for XML script..
Thanks,
I SOLVE THE PROBLEM BY USING TWO REGEX:
ereg assign_to="token" check_it="true" header="Refer-To:" regexp="(ptoken=([\d]*))" search_in="hdr"/
ereg assign_to="callParkToken" search_in="var" variable="token" check_it="true" regexp="([\d].*)" /
You could use the following regex:
ptoken=(\d+)
# searches for ptoken= literally
# captures every digit found in the first group
Your wanted numbers are in the first group then. Take a look at this demo on regex101.com. Depending on your actual needs, there could be better approaches (Xpath? as tagged as XML) though.
You should use lookahead and lookbehind:
(?<=ptoken=)(.+?)(?=;)
It captures any character (.+?) before which is ptoken= and behind which is ;
The <ereg ... > action has the assign_to parameter. In your case assign_to="token". In fact, the parameter can receive several variable names. The first is assigned the whole string matching the regular expression, and the following are assigned the "capture groups" of the regular expression.
If your regexp is ptoken=([\d]*), the whole match includes ptoken which is bad. The first capture group is ([\d]*) which is the required value. Thus, use <ereg regexp="ptoken=([\d]*)" assign_to="dummyvar,token" ..other parameters here.. >.
Is it working?

Changing some XML tags names but leaving unchanged values between them

In one of my XML file I need to find and replace some opening tags names using regex and Notepad++. Also I need to leave unchanged every text between them.
Example:
<uri>http://domain-name.com/41874_01_home_big.jpg</image_big>
I need to change into:
<image_big>http://domain-name.com/41874_01_home_big.jpg</image_big>
For some reasons I can't just change uri tag, cause there are others closing tags like /image_small in the document (opened with uri of course).
I tried to change it like:
<uri>.*?</image_big>
But I don't know with what I should replace it.
I tried with:
<image_big>\1</image_big>
but result is:
<image_big></image_big>
without any text inside.
I need your help. I'm not good with regex.
Just put .*? inside a capturing group.
<uri>(.*?)<\/image_big>
Then replace the match with <image_big>\1</image_big> or <image_big>$1</image_big>
Your regex <uri>.*?</image_big> matches correctly but in-order to fetch all the characters which are matched by .*? pattern, you must need to put that pattern inside a capturing group. So that we could back-reference it for later use.
DEMO
Find:<uri>(.*?)</image_big>
Replace:<image_big>\1</image_big> or <image_big>$1</image_big>
See demo.
https://www.regex101.com/r/rK5lU1/19