Regex Pattern Matching at Beginning of String with BeautifulSoup - regex

I'm currently looking for a way to perform pattern matching via regex at the beginning of an HTML class name. The pattern I'm trying to match is:
"col-xs-.*"
Two examples of classes in the HTML page are:
<div class="col-xs-12 col-sm-12 col-lg-12">
<div class="mod-tiles__sizer col-xs-6 col-sm-4 col-lg-3">
The goal is to only match the above class name, as it actually starts with "col-xs-.*", which is what I am after. Using my current regex matching I can't seem to single these class names out. Currently I'm trying to match using the following regex pattern:
regex = re.compile('^col-xs-.*$')
soup.find_all("div", class_ = regex)
Unfortunately this pattern also prints out the second class name (where "col-xs-.*" appears in the middle and not just at the start). Hopefully someone has a solution to this issue.

I think you want attribute = value css selector with starts with ^ operator to specify the prefix string to find in the class attribute.
soup.select('[class^="col-xs-"]')
Example:
from bs4 import BeautifulSoup as bs
html = '''
<div class="col-xs-12 col-sm-12 col-lg-12">
<div class="mod-tiles__sizer col-xs-6 col-sm-4 col-lg-3">
'''
soup = bs(html, 'lxml')
classes = [' '.join(item['class']) for item in soup.select('[class^="col-xs-"]')]
print(classes)

I'm guessing that this expression might likely extract those desired classes:
import re
regex = r"[\"']\s*(\bcol-xs-[0-9]+\b[^\"']+?)\s*[\"']"
test_str = """
<div class="col-xs-12 col-sm-12 col-lg-12"><div class=" col-xs-12 col-sm-12 col-lg-12 ">
<div class="mod-tiles__sizer col-xs-6 col-sm-4 col-lg-3"><div class="col-xs-12 col-sm-12 col-lg-12">
<div class="mod-tiles__sizer col-xs-6 col-sm-4 col-lg-3">
"""
print(re.findall(regex, test_str, re.MULTILINE | re.IGNORECASE))
Output
['col-xs-12 col-sm-12 col-lg-12', 'col-xs-12 col-sm-12 col-lg-12', 'col-xs-12 col-sm-12 col-lg-12']
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

If you want to find them without beautiful supe, this is the way to do it.
All div tags with a class attribute where col-xs- is at the beginning of the value:
Includes whitespace trimming.
r"(?i)<div(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?(?<=\s)class\s*=\s*(?:(['\"])\s*(col-xs-(?:(?!\1)[\S\s])*?)\s*\1))\s+(?:\"\S\s]*?\"|'\S\s]*?'|[^>]*?)+>"
https://regex101.com/r/rsXqI9/1
Formatted:
Class value is in group 2.
(?i)
< div
(?=
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
(?<= \s )
class \s* = \s*
(?:
( ['"] ) # (1)
\s*
( # (2 start)
col-xs-
(?:
(?! \1 )
[\S\s]
)*?
) # (2 end)
\s*
\1
)
)
\s+
(?: " \S\s ]*? " | ' \S\s ]*? ' | [^>]*? )+
>

Related

Regex match multiple items

I have the following string:
<button {{ $attributes->class([
'bg-blue-600 hover:bg-blue-700 text-white px-3 py-2 rounded',
'bg-blue-600 px-3 py-2 hover:bg-blue-700 text-white rounded',
])->merge([
'wire:click' => $click,
]) }}>
{{ $label }}
</button>
I'm trying to get a VS Code extension (headwind) to match the stuff inside the class method single quotes via custom regex setting.
I have this regex which works a bit:
class\(([^)]*)\)
However the problem is its matching everything inside of the braces, which makes headwind mess up.
I need it to match each occurence of stuff inside the single quotes. How do I do this?
You can use
(?<=\\bclass\\(\\[\\s*'(?:[^']*'\\s*,\\s*')*)[^']+(?=')
I.e., the (?<=\bclass\(\[\s*'(?:[^']*'\s*,\s*')*)[^']+(?=') escaped version.
See the regex demo. Details:
(?<=\bclass\(\[\s*'(?:[^']*'\s*,\s*')*) - a position that is immediately preceded with a whole word class([, zero or more whitespaces, ', and then zero or more occurrences of any zero or more chars other than ', ', a , enclosed with zero or more whitespaces and then a ' char
[^']+ - one or more chars other than '
(?=') - a location that is immediately followed with a ' char.

Regex find and replace between <div class="customclass"> and </div> tag

I cant find anywhere a working regex expression to find and replace the text between the div tags
So there is this html where i want to select everything between the <div class="info"> and </div> tag and replace it with some other texts
<div class="extraUserInfo">
<p>Hello World! This is a sample text</p>
<javascript>.......blah blah blah etc etc
</div>
and replace it with
My custom text with some codes
<tags> asdasd asdasdasdasdasd</tags>
so it would look like
<div class="extraUserInfo">
My custom text with some codes
<tags> asdasd asdasdasdasdasd</tags>
</div>
here is a refiddle that all my code is there and as you can see I want to replace the whole bunch of codes between the and tag
http://refiddle.com/1h6j
Hope you get what I mean :)
If there's no nesting, would just do a plain match non-greedy (lazy)
(?s)<div class="extraUserInfo">.*?</div>
.*? matches any amount of any character (as few as possible) to meet </div>
Used s modifier for making the dot match newlines too.
Edit: Here a Javascript-version without s modifier
/<div class="extraUserInfo">[\s\S]*?<\/div>/g
And replace with new content:
<div class="extraUserInfo">My custom...</div>
See example at regex101; Regex FAQ

RegEx - Match only when a certain word occurs

Its probably super simple.
I want to match only where a certain word exists in between full <headers>
This is what i have so far.
(<h[d{1-6}](.itemprop.)(.*?)</h[d{1-6}]>)
I want it to match
<h1 class="test" itemprop="name">Test</h1>
AND
<h2 itemprop="name" class="test">Test</h2>
AND
<h6 class="test"><strong itemprop="Price">9,99</strong>Test</h6>
As it is now it only matches <h{1-6} itemprop etc
How about:
<h([1-6]).*?\bitemprop\b.*?</h\1>)

Find text between key phrases

I have a var that have some text in:
<cfsavecontent variable="foo">
element.password_input=
<div class="holder">
<label for="$${input_id}" > $${label_text}</label>
<input name="$${input_name}" id="$${input_id}" value="$${input_value}" type="password" />
</div>
# END element.password_input
element.text_input=
<div class="ctrlHolder">
<label for="$${element_id}" > $${element_label_text}</label>
<input name="$${element_name}" id="$${element_id}"
value="$${element_value}" type="text"
class="textInput" />
</div>
# END element.text_input
</cfsavecontent>
and I am trying to parse through the var to get all of the different element type(s) here is what I have so far:
ar = REMatch( "element\.+(.*=)(.*?)*", foo )
but it is only giving me this part:
element.text_input=
element.password_input=
any help will be appreciated.
Your immediate problem is that by default . doesn't include newlines - you would need to use the flag (?s) in your regex for it to do this.
However, simply enabling that flag still wont result in your present regex doing what you're expecting it to do.
A better regex would be:
(element\.\w+)=(?:[^##]+|##(?! END \1))+(?=## END \1)
You would then do ListFirst(match[i],'=') and ListRest(match[i],'=') to get the name and value. (rematch doesn't return captured groups).
(Obviously the #s above are doubled to escape them for CF.)
The above regex dissected is:
(element\.\w+)=
Match element. and any alphanumeric, placed it into capture group 1, then match = character.
(?:
[^##]+
|
##(?! END \1)
)+
Match any number of non-hash characters, or a hash not followed by the ending token (using negative lookahead (?!...)) and referencing capture group 1 (\1), repeat as many times as possible (+), using a non-capturing group ((?:...)).
(?=## END \1)
Lookahead (?=...) to confirm the variable's ending token is present.

preg_match backreference to find ending tag

I'm trying to use regex to parse a template I have. I want to find the _stop tag for the _start tag. I need to find the specific ones since there can be nested _stop and _start tags.
The regex I'm using is
/{(.*?)_start}.*{(\1_stop)}/s
and throwing that into a preg_match
And the template
<div data-role="collapsible-set" class="mfe_collapsibles" data-theme="c" data-inset="false">
{MakeAppointment_start}
<div id="appointmentHeading" data-action-id="appointmentNext" data-action-text="Next" data-a data-role="collapsible" data-collapsed="true" data-collapsed-icon="arrow-r" data-expanded-icon="arrow-d" data-iconpos="right">
<h3 class="collapsibleMainHeading">New {AppointmentTerm}</h3>
<p>
{AppointmentForm}
</p>
</div>
{MakeAppointment_stop}
{RegisterSection_start}
<div id="registerHeading" class="preRegistration" data-action-id="register" data-action-text="Register" data-role="collapsible" data-collapsed="true" data-collapsed-icon="arrow-r" data-expanded-icon="arrow-d" data-iconpos="right">
<h3 class="collapsibleMainHeading">Register</h3>
<p>
{RegisterForm}
</p>
</div>
{RegisterSection_stop}
<div data-role="collapsible" class="preRegistration" data-collapsed="true" data-collapsed-icon="arrow-r" data-expanded-icon="arrow-d" data-iconpos="right">
<h3 class="collapsibleMainHeading">Login</h3>
<p>
{LoginForm}
</p>
</div>
</div>
</div>
The results are
Array
(
[0] => {MakeAppointment_start}
<div id="appointmentHeading" data-action-id="appointmentNext" data-action-text="Next" data-a data-role="collapsible" data-collapsed="true" data-collapsed-icon="arrow-r" data-expanded-icon="arrow-d" data-iconpos="right">
<h3 class="collapsibleMainHeading">New {AppointmentTerm}</h3>
<p>
{AppointmentForm}
</p>
</div>
{MakeAppointment_stop}
[1] => MakeAppointment
[2] => MakeAppointment_stop
)
Index 0 is correct however 1 and 2 are not. 1 should have the register tags and content and 2 should not exist.
What am I doing wrong here?
Firstly, preg_match only returns one match. use preg_match_all instead. Secondly, the indices 1 and 2 you get, are your capturing groups. You can simply ignore them, although your second capturing group is quite redundant; you could just remove the second pair or parentheses in your regex. Using preg_match_all will yield the full match and all capturing groups for all matches.
I also think you should escape your { and } since they are regex meta-characters. I wonder why the engine does not choke on them this way, but I think it is better practice to just escape them anyway.