How to match fuzzy empty div with a regular expression? - regex

I have the following HTML code:
<div id="page126-div" style="position:relative;width:918px;height:1188px;">
</div>
<div id="page127-div" style="position:relative;width:918px;height:1188px;">
sometext for example
</div>
<div id="page128-div" style="position:relative;width:918px;height:1188px;">
</div>
My task is to match empty divs. Empty means in this context that they do not content at all (no characters between open > and closing <) or contain just newline, or just a space or newline or less than 5 characters. So emptyness is pretty fuzzy.
If I would match all divs, not only empty I would use the following regex:
\<div id="page.*?"\>.*?\<\/div\>
Naturally I should use it with dotall modifier.
But when I try to match only empty divs I try to use this expression:
\<div id="page.*?"\>.{0,5}?\<\/div\>
I expect to get first and last(third) divs, because they contain: opening div tag with attributes, then div content that can be from 0 to 5 characters and closing div tag.
First match is right, but second match is second and third divs stacked together instead of third div only.
I do not understand why.

This regex is pretty straight-forward:
<div id=\"[^"]+?\" style=[^>]+?>(\s|\n|[^\n]{,5})<\/div>
Just notice it doesn't necessarily requires the exact same id and style properties.

You can give this a try.
Scraper Series
/(?><div(?=(?:[^>"']|"[^"]*"|'[^']*')*?\sid\s*=\s*(?:(['"])\s*page(?:(?!\1)[\S\s])*\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+>)\s*[\S\s]{0,5}\s*<\/div\s*>/
https://regex101.com/r/x8jf8D/1
Formatted
(?>
< div # div tag
(?= # Asserttion (a pseudo atomic group)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
\s id \s* = \s*
(?:
( ['"] ) # (1), Quote
\s* page # With 'id = "page XXX"
(?:
(?! \1 )
[\S\s]
)*
\1
)
)
\s+
(?:
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
>
)
\s* # Optional whitespaces (remove if necessary)
[\S\s]{0,5} # Optional 1-5 anything (including wsp)
\s* # Optional whitespaces (remove if necessary)
</div \s* >

Related

Using regular expression extractor to extract a value?

I am trying to extract the value from the following code. Even though my regex expression is fine it is still not extracting the value.
token" value="(.+?)"
this does give me the exact match which I checked using regex101.com
<input type="hidden" name="token" value="GSYGEP2UUWOTMZ2SFV1G5D2M8L247KIG">
what the regex expression should be
Your original regular expression is just fine:
value="(.+?)"
It might be additional spaces, or code problems with it. Let's remove the token" or try to escape ", if necessary.
DEMO 1
DEMO 2
Reference:
Regular Expressions
Try this
<input(?=(?:[^>"']|"[^"]*"|'[^']*')*?\sname\s*=\s*(['"])\s*token\s*\1)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\svalue\s*=\s*(['"])((?:(?!\2)[\S\s])*)\2)\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>
The Value content you're after is in Capture Group 3
https://regex101.com/r/HJhStT/1
https://regex101.com/r/8BWONb/1
Explained
< input # Input tag
(?= # Name attribute: Assert (a pseudo atomic group)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
\s name \s* = \s* # name =
( ['"] ) # (1), Quote
\s* token \s* # token
\1
)
(?= # Value attribute
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
\s value \s* = \s* # value =
( ['"] ) # (2), Quote
( # (3 start), value content
(?:
(?! \2 )
[\S\s]
)*
) # (3 end)
\2
)
# Just get rest of tag
\s+
(?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
>

Is there a regex expression to remove text between tags containing a specific word

I have made a regex expression to remove text between <FormattingRule and </FormattingRule>
Now i also want to include a extra condition: It must contains EdtJobEmpId.
Can someone assist me with this?
I have tried to following regex expression:
<FormattingRule(.|\n)*?<\/FormattingRule>
It can be found on the site : https://regex101.com/r/ttUMON/1
I want to remove the following text based on the extra condition:
<FormattingRule Action="OnChange">
<Triggers>
<Trigger PropertyName="${EdtJobEmpId}" />
</Triggers>
<Choose>
<When Condition="${EdtJobSkcId}==Empty">
<Assign PropertyName="${EdtJobSkcId.Value}" Value="=${EdtEmpSkcId.Value}" />
</When>
</Choose>
</FormattingRule>
This regex matches <FormattingRule> nodes only if they contain EdtJobEmpId:
(?m)<FormattingRule((?!/FormattingRule).)*EdtJobEmpId((?!/FormattingRule).)*\/FormattingRule>
See live demo.
It works by used the "multi-line" flag (?m) and negative lookahead (?!/FormattingRule) to not match outside the currently matching tag.
There is no regular expression that will get this 100% right every time. For example, most attempts will be defeated by such things as comments, CDATA sections, and entity or character references in the source.
The right tool for this job is XSLT.
This is the way it is done.
If you think you will run into problems where your html/xml has
constructs that could hide markup like Comments or CDATA (or anything else)
and you are worried about it, let me know and I'll patch up this
regex with a couple of functions to consume those bad boys.
(?:<(?:(FormattingRule)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)(?:(?!</\1\s*>)[\S\s])*?EdtJobEmpId(?:[\S\s]*?</\1\s*>|(*SKIP)(*FAIL)))
https://regex101.com/r/Plih3R/1
Readable version
(?:
<
(?:
( # (1 start), End tag req'd
FormattingRule
) # (1 end)
(?:
\s+
(?>
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
)
(?:
(?! </ \1 \s* > )
[\S\s]
)*?
EdtJobEmpId
(?:
[\S\s]*? </ \1 \s* >
|
(*SKIP)(*FAIL)
)
)

how to match the iframe text, then skip and match another string in wordpress

I have this iframe code that I want to match for both the text right in the beginning of the string and continue with the code to find the "soundcloud" text:
<iframe src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/297769462&color=%23ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&show_teaser=true" width="100%" height="166" frameborder="no" scrolling="no"></iframe>
My regex, which is: (<iframe.*?><\/iframe>), which tries to match the iframe and anything in between.
What I want is the + skip everything in between until it finds soundcloud. If both conditions are fulfilled, then it's a match.
Any help would be great thank you.
Try this
(?i)<iframe(?=((?:[^>"']|"[^"]*"|'[^']*')*?\s(src\s*=\s*(['"])(?:(?!\3)[\S\s])*?soundcloud(?:(?!\3)[\S\s])*\3)(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\1\s*</iframe\s*>
https://regex101.com/r/KkJH6x/1
Formatted
(?i) # Case insensitive modifier
< iframe # The iframe tag
(?= # Asserttion (a pseudo atomic group)
( # (1 start)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
\s
( # (2 start), src attribute with 'soundcloud' in value
src \s* = \s*
( ['"] ) # (3), Quote
(?:
(?! \3 )
[\S\s]
)*?
soundcloud # 'Soundcloud'
(?:
(?! \3 )
[\S\s]
)*
\3 # Close quote
) # (2 end)
# The remainder of the tag parts
(?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
>
) # (1 end)
)
\1
\s*
</iframe \s* >

How to remove a line break between two strings (RegEx)

I am trying to develop a quick hack in SublimeText2 (not ideal, I know):
I have this (frequent) code in my markup:
{% url '
someURL ' %}
How can I use regex to remove the line breaks such that I have {% url 'someURL '%}
I have succeeded in selecting eveything between the brackets:
\{\%[\s\S]*?\%\}
However, I can't figure out how to select only the linebreaks \n and double spaces within it.
Use the below regex and then replace the match with a single space.
(?s)\s+(?=(?:(?!%}|\{%).)*%\})
DEMO
Explanation:
(?s) set flags for this block (with . matching
\n) (case-sensitive) (with ^ and $
matching normally) (matching whitespace
and # normally)
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times)
(?= look ahead to see if there is:
(?: group, but do not capture (0 or more
times):
(?! look ahead to see if there is not:
%} '%}'
| OR
\{ '{'
% '%'
) end of look-ahead
. any character
)* end of grouping
% '%'
\} '}'
) end of look-ahead
You can use this pattern:
(?:\G(?!\A)|\{%)[^%\r\n]*\K(?:\r?\n)+(?=[^%]*%\})
The replacement is an empty string.
This pattern ensure that you are always between tags {% and %} using the \G anchor that matches the position at the end of the previous match.
The \K removes all that have been matched on the left from the match result. So only the CRLF or LF is removed.
This pattern can be improved if you want to allow % characters between tags:
(?:\G(?!\A)|\{%)(?:[^%\r\n]|%(?!\}))*\K(?:\r?\n)+(?=(?:[^%]|%(?!\}))*%\})
or more efficient (if it is possible with sublimetext):
(?:\G(?!\A)|\{%)(?>[^%\r\n]+|%+(?!\}))*\K(?:\r?\n)+(?=(?>[^%]+|%+(?!\}))*%\})
a little shorter (if sublimetext regex engine is smart enough):
(?:\G(?!\A)|{%)(?>[^%\r\n]+|%+(?!}))*\K\R+(?=(?>[^%]+|%+(?!}))*%})
Note: if you are sure that tags are always balanced, you can remove the last lookahead (but this way is less safe):
(?:\G(?!\A)|{%)(?>[^%\r\n]+|%+(?!}))*\K\R+
(\{%.*)\n\s*(.*%\})
With replace string \1\2 will change
{% url '
someURL ' %}
to {% url 'someURL ' %}

sublime text regex multiple parameters

I want to take Parameter1 using regex in Sublime Text. Other parameter will not be used.
Initial tags:
<description><![CDATA[<b>Parameter1</b></br></br>
This not to be copied and can be long]]></description>
This expression in Regex Sublime Text...
<description><!\[CDATA\[<b>(\w+)</b></br></br>(\w*)\]\]</description>
cannot find what I need (when I reach it stops finding)
Your regex doesn't match the test string.
There are whitespaces between the word letters.
It also won't match non-word letters like punctuation.
Below are two Regexs'
1. This is just to match your test string.
# <description>\s*<!\[CDATA\[\s*<b>([\s\w]+)</b>\s*</br>\s*</br>([\s\w]*)\]\]\s*</description>
<description>
\s*
<!\[CDATA\[
\s*
<b>
( # (1)
[\s\w]+
)
</b> \s* </br> \s* </br>
( # (2)
[\s\w]*
)
\]\]
\s*
</description>
2. This is how it should be done if your engine supports lookahead assertions.
# (?s)<description>\s*<!\[CDATA\[\s*<b>((?:(?!\]\]|\s*</b>).)+?)\s*</b>\s*</br>\s*</br>\s*((?:(?!\s*\]\]).)*)\s*\]\]\s*</description>
(?s)
<description>
\s*
<!\[CDATA\[
\s*
<b>
( # (1)
(?:
(?! \]\] | \s* </b> )
.
)+?
)
\s* </b> \s* </br> \s* </br> \s*
( # (2)
(?:
(?! \s* \]\] )
.
)*
)
\s*
\]\]
\s*
</description>