Regex Pattern for document indexed

Regex Pattern for document indexed - regex

there is a extended document with title indexed in format ascend, for example 8.1, 8.1.1... 8.1.1.1.1.1.1 such as:
<h1 class="topicTitle-h1">8.12.1.1.12.1.1 title03</h1>
<h1 class="topicTitle-h1">8.1 title01</h1>
<h1 class="topicTitle-h1">8.1.1.1.1.1.1 title03</h1>
<h1 class="topicTitle-h1">8.1.1 title02</h1>
<h1 class="topicTitle-h1">8.1.1.1.1.2.1 title03</h1>
<h1 class="topicTitle-h1">8.1.1.1 title03</h1>
<h1 class="topicTitle-h1">8.1.1.3.2.3.1 title03</h1>
<h1 class="topicTitle-h1">8.1.1.1.1 title05</h1>
<h1 class="topicTitle-h1">8.1.4.2.5.9.3 title03</h1>
<h1 class="topicTitle-h1">8.1.1.1.1.1 title06</h1>
<h1 class="topicTitle-h1">8.1.11.12.14.3.1 title03</h1>
I tried to get only title03 with regex expression re.search(r'\">\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3} (.*)</h1>',x) but it matches all of the title without exceptions instead of only matches for d.d.d.d.d.d.d
thanks in advance

Use:
r'">\d{1,3}(?:\.\d{1,3}){6} (.*)</h1>'
Demo & explanation

Try it
re.search(r'\">\d{1,3}(\.\d{1,3})* (.*)</h1>',x)

Related

Why doesnt this regexp work for this html?

<div class="_1zGQT _2ugFP message-in">
<div class="-N6Gq">
<div class="copyable-text" data-pre-plain-text="[18:09, 3.6.2019] Лера сестра: ">
<div class="_12pGw">
<div class="_3X58t selectable-text invisible-space copyable-text">
<span class="_2ZDCk">
<img crossorigin="anonymous" src="URL" alt="😆" draggable="false" class="_298rb _2FANH selectable-text invisible-space copyable-text" data-plain-text="😆" style="visibility: visible;">
</span>
</div>
</div>
</div>
</div>
</div>
Ive try to get with this code:
soup.find('div', class_=re.compile('^selectable-text invisible-space copyable-text'))
All i got: None.
The problem is that part of the class (_3X58t ) is changing.

This would be likely due to using ^ anchor, which we could modify to:
soup.find('div', class_=re.compile('selectable-text invisible-space copyable-text'))
or we might try this expression for the divs:
(.+?selectable-text invisible-space copyable-text)
Demo

I would first see if a single class, from the compound class list, could be used e.g.
soup.select_one('.selectable-text')
Else combine classes
soup.select_one('[class$="selectable-text invisible-space copyable-text"]')
Rather than resorting to regex.

Non-breaking space with Django template code and Bootstrap 4 badges

I am trying to keep text generated with Django template language which is contained within a Bootstrap 4 badge together with some additional text that is not contained in the badge.
Here is my code:
<span>Submitted by: <span class="badge badge-primary">{{
user.username }}</span></span>
I want all the words in the phrase "Submitted by USER" to always be on the same line, but the code above does not achieve that. Any idea what is wrong?

Add the class text-nowrap to the outer <span> element and remove the unnecessary .
text-nowrap in Bootstrap 4 prevents wrapping as the name suggests.
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css" integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">
<div class="container">
<div class="row">
<div class="col-4 bg-success">
<span class="text-nowrap">Submitted by: <span class="badge badge-primary">Usernameverylongusernameevenlongerthanthat</span></span>
</div>
</div>
</div>

phpQuery returning wrong results with regex

$html = '<html>
<body>
<div id="dupe_1">1
<div class="dupe_1.1">1.1</div>
<div id="dupe_1.2">1.2</div>
</div>
<div id="dupe_2">2
<div class="dupe_2.1">2.1</div>
<div id="dupe_2.2">2.2</div>
<div>extra</div>
</div>
</body>
</html>';
$html = phpQuery::newDocumentHTML($html);
$node = pq('div:regex(id,^dupe_\d+$)',$html);
echo count($node);
This returns 7 that is all divs. It should return only 2 divs (dupe_1 and dupe_2)

I would avoid doing things like that, you should be able to get those with css:
[id*=dupe_]:not([id*="."])

Nested floats do not work in CFDOCUMENT css

The below html was provided inside a <cfdocumentitem type="header"> block.
But the output is empty.
<div class="grid">
<div class="span5">
<div class="span5">
Label1
</div>
<div class="span5">
Data1
</div>
</div>
<div class="span5">
<div class="span5">
Label2
</div>
<div class="span5">
Data2
</div>
</div>
<div style="clear:both"></div>
</div>
But when I remove the nested 'class="span5"' divs and put some content there, it is working fine. Is there any problem with nested float in cfdocument???

Unfortunately, CSS support in CFDOCUMENT is kind of hit or miss.
2 rules to follow that might help:
Make sure your HTML validates as XHTML 1.0 Transitional
Import your style sheets using
<style type="text/css" media="screen">#import "style.css";</style>
This same information can be found here: http://rip747.wordpress.com/2007/09/10/cfdocument-it-works-if-you-know-how/

UnicodeDecodeError in template

I get the following error code when trying to load the template.
'utf8' codec can't decode byte 0x94 in position 720: invalid start byte
Here is the template:
{% extends "base.html" %}
{% block site_wrapper %}
<div id="main">
Skip to main content
<div id="banner">
<div class="bannerIEPadder">
<div class="cart_box">
[link to cart here]
</div>
Modern Musician
</div>
</div>
<div id="navigation">
<div class="navIEPadder">
[navigation here]
</div>
</div>
<div id="middle">
<div id="sidebar">
<div class="sidebarIEPadder">
[search box here]
<br/>
[category listing here]
</div>
</div>
<div id="content">
<a name=”content”></a>
<div class="contentIEPadder">
{% block content %}{% endblock %}
</div>
</div>
</div>
<div id="footer">
<div class="footerIEPadder">
[footer here]
</div>
</div>
</div>
{% endblock %}

In UTF-8 0x94 is nothing, however in ISO1252 it's a right quote (). Generally speaking the plain quote (") is much safer.
Make sure you're not copying and pasting this out of some blog that has weird accented quotes or something like that.
If you're using a text editor save it as ascii and see what crops up missing.

You have weird double quotes around div#content, try replacing them with ASCII quotes.
Maybe your template is encoded with something other than utf-8? It depends on your terminal/editor or maybe OS settings.

I had some strange characters in my code because i copied out of a pdf-file.

I had this same error . . . and it turned out that the problem was I included a "©" in my source copied as a part of a template.
Got to check that code for strange characters.........

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex Pattern for document indexed - regex

Use: r'">\d{1,3}(?:\.\d{1,3}){6} (.*)</h1>' Demo & explanation

Try it re.search(r'\">\d{1,3}(\.\d{1,3})* (.*)</h1>',x)

Related

Why doesnt this regexp work for this html?

Non-breaking space with Django template code and Bootstrap 4 badges

phpQuery returning wrong results with regex

Nested floats do not work in CFDOCUMENT css

UnicodeDecodeError in template

Categories

Resources