'TypeError: expected string or buffer' when performing re.sub on beautiful soup result set iteration - regex

content_a is a beautiful soup result set (ie the type is <class 'bs4.element.ResultSet'>) that is made up of values whose type is <class 'bs4.element.Tag'>.
If i print 'content_a' i get:
[<div class="class1 class2">Here is the first sentence.
<br/> <br/> Here is some text "and some more text."
<br/> <br/> Here is another sentence.
<br/> Text<br/><span class="class3">Text</span></div>, <div class="class1 class2">Here is the first sentence.
<br/> <br/> Here is some text "and some more text."
<br/> <br/> Here is another sentence.
<br/> Text<br/><span class="class3">Text</span></div>, etc
So it seems to me it should be a simple iterable list of divs.
I am wanting to replace <div class="class1 class2"> with <div class="class1 class2"><p> (my eventual goal being to replace all <br />'s with paragraph tags).
In my test where the source content is a string I have:
import re
blablabla = ['<div class="class1 class2">', '<div class="class1 class2">']
for _ in blablabla:
_ = re.sub('(<div class=\"class1 class2\">)', r"\1<p>",_)
print _
which returns, as required:
<div class="class1 class2"><p>
<div class="class1 class2"><p>
I am trying to perform the same process on each iterable in content_a with:
import re
for _ in content_a:
_ = re.sub('(<div class=\"class1 class2\">)', r"\1<p>",_)
print _
but am getting the error:
...in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
So the only difference that i can tell between the two examples is that one is a beautiful soup result set and one is just a plain list.
Can anyone see why this error could be occuring?
Edit:
Someone has pointed out here that sub requires a string as the third argument, so the third argument that i am passing is the iterable value which is of type <class 'bs4.element.Tag'>. So perhaps this is the problem. But i need to retain the nature of these values for later modification so i am not sure how to proceed at the moment.
Update/Workaround:
Just to save someone spending time on an answer, i figured out a workaround, basically i realised i could adjust the content later in the process and i did this by converting it to a string with read() and could then perform all the re.sub changes on the required elements in the string.
And the little regex i came up with was:
string = re.sub('([^\r]*)\r', r'\1</p>\n<p>', string)

As suggested, I am posting the workaround I used as the solution:
Update/Workaround:
Just to save someone spending time on an answer, I figured out a workaround, basically I realised I could adjust the content later in the process and i did this by converting it to a string with read() and could then perform all the re.sub changes on the required elements in the string.
And the little regex I came up with was:
string = re.sub('([^\r]*)\r', r'\1</p>\n<p>', string)

Related

Yesod Hamlet breaks HTML by replacing single quotes with double quotes

I have some HTML code that I'm using in Hamlet:
<div .modal-card .card data-options='{"valueNames": ["name"]}' data-toggle="lists">
Notice that the single quotes for data-options allows the use of double quotes inside the string.
The problem is that when Hamlet renders the page, Hamlet puts " around the ' and so the HTML is broken:
<div class="modal-card card" data-options="'{" valuenames":"="" ["name"]}'="" data-toggle="lists">
Some external JS library plugin code runs, it tries to parse the JSON inside data-options and fails.
How can I tell Hamlet to include a literal string?
I've tried various combinations of:
let theString = "{\"valueNames\": [\"name\"]}"
let theString2 = "data-options='{\"valueNames\": [\"name\"]}'"
etc
And in the hamlet file:
<div .modal-card .card data-options='#{ preEscapedText theString }' data-toggle="lists">
or
<div .modal-card .card #{ preEscapedText theString2 } data-toggle="lists">
But all attempts produce invalid HTML or invalid JSON inside the string.
How can I instruct Hamlet to simply include a literal string in the output HTML?
Update:
Tried more things, no result.
The string2 example doesn't work because Hamlet seems to think that I'm trying to set id="{" as per https://www.yesodweb.com/book/shakespearean-templates#shakespearean-templates_attributes
Why not render the JSON escaped (" become ") and “handle” the quotes later when parsing?
Interpolate in Hamlet:
<div #the-modal .modal-card .card data-options='#{theString}' data-toggle="lists">
Parse the data attribute as JSON:
let json = document.getElementById("the-modal").getAttribute("data-options");
let opts = JSON.parse(json); // At least in Chrome, it works!
As for theString2 alternative, you can also interpolate attributes in Hamlet using a tuple or list of tuples and the star symbol:
let dataOptions = ("data-options", "{\"valueNames\": [\"name\"]}") :: (Text, Text)
...
<div #the-modal .modal-card .card *{dataOptions} data-toggle="lists">

RegExp replace all but selected

So I'm trying to erase everything except the matched case in this 1900 line document with Notepad++ RegExp Find/Replace, so that I only have the file names, which shorten it to under about 1000 lines at minimum. I know the code that selects the text ((?<=/images/item/)(.*)(?=" a) but the problem is I don't know how to make it erase anything that doesn't match that case. Here's a portion of the document.
using notepad++, it would find and select abyssal-scepter.gif, aegis-of-the-legion.gif, etc
<img src="/images/item/abyssal-scepter.gif" alt="LoL Item: Abyssal Scepter"><br> <div id="id_77" class="tier-wrapper drag-items health magic-resist health-regen champ-box float-left ajax-tooltip {t:'Item',i:'77'} classic-and-dominion filter-is-dominion filter-is-classic filter-tier-advanced filter-bonus-aura filter-category-health filter-category-magic-resist filter-category-health-regen ui-draggable ui-draggable-handle">
<img src="/images/item/aegis-of-the-legion.gif" alt="LoL Item: Aegis of the Legion"><br> <div id="id_235" class="tier-wrapper drag-items ability-power movement champ-box float-left ajax-tooltip {t:'Item',i:'235'} filter-tier-advanced filter-bonus-unique-passive filter-category-ability-power filter-category-movement ui-draggable ui-draggable-handle">
<img src="/images/item/aether-wisp.gif" alt="LoL Item: Aether Wisp"><br>
<div class="info">
<div class="champ-name">Aether Wisp</div>
<div class="champ-sub">
<img src="/images/gold.png" alt="Item Cost" style="width:16px; vertical-align:middle;"> 850 / 415
</div>
</div>
</div>
<div id="id_21" class="tier-wrapper drag-items ability-power champ-box float-left ajax-tooltip {t:'Item',i:'21'} classic-and-dominion filter-is-dominion filter-is-classic filter-tier-basic filter-category-ability-power ui-draggable ui-draggable-handle">
<img src="/images/item/amplifying-tome.gif" alt="LoL Item: Amplifying Tome"><br>
<div class="info">
<div class="champ-name">Amplifying Tome</div>
<div class="champ-sub">
I'm not familiar with RegExp, so to summarize, I need it to look like this at the end of it.
abyssal-scepter.gif
aegis-of-thelegion.gif
aether-wisp.gif
amplifying-tome.gif
Thank you for your time
A Notepad++ solution:
Find what : .*?/images/item/(.*?)"|.*
Replace with : $1\n
Search mode : Regular expression (with ". matches newline" checked)
The result will have an extra linefeed at the end.
But that shouldn't pose a problem I suppose.
Maybe this can help. or not since you dropped the Javascript tag out of your original post
<script type="text/javascript">
var thestring = "<img src=\"/images/item/aegis-of-the-legion.gif\" alt=\"LoL Item: Aegis of the Legion\"><br>";
var thestring2 = "<img src=\"/images/otherstuff/aegis-of-the-legion.gif\" alt=\"LoL Item: Aegis of the Legion\"><br>";
function ParseIt(incomingstring) {
var pattern = /"\/images\/item\/(.*)" /;
if (pattern.test(incomingstring)) {
return pattern.exec(incomingstring)[1];
}
else {
return "";
}
//return pattern.test(incomingstring) ? pattern.exec(incomingstring)[1] : "";
}
</script>
Calling ParseIt(thestring) returns "aegis-of-the-legion.gif"
Calling ParseIt(thestring2) return ""
Since you are doing this in NP++, this works for me. In cases like this where speed and results are more important than specific technique, I'll usually run several regexes. First, I'll get each tag on its own line by doing a search for > and replacing it with >\n. This gets each tag on its own line for simpler processing. Then a replace of ^>*<.*?".*?/?([\w\d\-_]+\.\w{2,4})?".*>.*$ with $1 will will extract all the filenames from the tags, removing the unneeded text. Then, finally, to clear all the tags that didn't have a filename in them, just replace <.*> with an empty string. Finally, use Edit>Line Operations>Remove empty lines, and you'll have the result you're looking for. It's not a 100% regex solution, but this is a one time action that you just need a simple result from.

Need regex extraction help in vb.net

I have a html source and needs to retrieve values between <h1>15</h1> tag.
<h1> instance is appearing many times in full html code.
Below is the sample portion of html code
<div class="rs_text_11_may">
<p>Rs</p>
<h1>15</h1>
</div>
I tried a lot but i didnt achieve it. Help me friends...
You need following regex: (?<=<h1>).+?(?=</h1>) This works fine in C#.
Regex regex = new Regex(#"(?<=<h1>).+?(?=</h1>)");
MatchCollection mc = regex.Matches(#"<h1>15</h1><h1>14</h1>", 0);
Try this (a far better way is to use LINQ for this not RegEx)
Public Function test() As Boolean
Try
Dim xe As XElement = <div class="rs_text_11_may">
<p>Rs</p>
<h1>15</h1>
</div>
Dim h1s = xe...<h1>
For Each ele As XElement In h1s
MsgBox(ele.Value)
Next
Return True
Catch ex As Exception
Return False
End Try
End Function

Add a newline after each closing html tag in web2py

Original
I want to parse a string of html code and add newlines after closing tags + after the initial form tag. Here's the code so far. It's giving me an error in the "re.sub" line. I don't understand why the regex fails.
def user():
tags = "<form><label for=\"email_field\">Email:</label><input type=\"email\" name=\"email_field\"/><label for=\"password_field\">Password:</label><input type=\"password\" name=\"password_field\"/><input type=\"submit\" value=\"Login\"/></form>"
result = re.sub("(</.*?>)", "\1\n", tags)
return dict(form_code=result)
PS. I have a feeling this might not be the best way... but I still want to learn how to do this.
EDIT
I was missing "import re" from my default.py. Thanks ruakh for this.
import re
Now my page source code shows up like this (inspected in client browser). The actual page shows the form code as text, not as UI elements.
<form><label for="email_field">Email:</label>
<input type="email" name="email_field"/><label
for="password_field">Password:</label>
<input type="password" name="password_field"/><input
type="submit" value="Login"/></form>
EDIT 2
The form code is rendered as UI elements after adding XML() helper into default.py. Thanks Anthony for helping. Corrected line below:
return dict(form_code=XML(result))
FINAL EDIT
Fixing the regex I figured myself. This is not optimal solution but at least it works. The final code:
import re
def user():
tags = "<form><label for=\"email_field\">Email:</label><input type=\"email\" name=\"email_field\"/><label for=\"password_field\">Password:</label><input type=\"password\" name=\"password_field\"/><input type=\"submit\" value=\"Login\"/></form>"
tags = re.sub(r"(<form>)", r"<form>\n ", tags)
tags = re.sub(r"(</.*?>)", r"\1\n ", tags)
tags = re.sub(r"(/>)", r"/>\n ", tags)
tags = re.sub(r"( </form>)", r"</form>\n", tags)
return dict(form_code=XML(tags))
The only issue I see is that you need to change "\1\n" to r"\1\n" (using the "raw" string notation); otherwise \1 is interpreted as an octal escape (meaning the character U+0001). But that shouldn't give you an error, per se. What error-message are you getting?
By default, web2py escapes all text inserted in the view for security reasons. To avoid that, simply use the XML() helper, either in the controller:
return dict(form_code=XML(result))
or in the view:
{{=XML(form_code)}}
Don't do this unless the code is coming from a trusted source -- otherwise it could contain malicious Javascript.

Pythonic way to find a regular expression match

Is there a more succinct/correct/pythonic way to do the following:
url = "http://0.0.0.0:3000/authenticate/login"
re_token = re.compile("<[^>]*authenticity_token[^>]*value=\"([^\"]*)")
for line in urllib2.urlopen(url):
if re_token.match(line):
token = re_token.findall(line)[0]
break
I want to get the value of the input tag named "authenticity_token" from an HTML page:
<input name="authenticity_token" type="hidden" value="WTumSWohmrxcoiDtgpPRcxUMh/D9m7O7T6HOhWH+Yw4=" />
Could you use Beautiful Soup for this? The code would essentially look something like so:
from BeautifulSoup import BeautifulSoup
url = "hhttp://0.0.0.0:3000/authenticate/login"
page = urlli2b.urlopen(page)
soup = BeautifulSoup(page)
token = soup.find("input", { 'name': 'authenticity_token'})
Something like that should work. I didn't test this but you can read the documentation to get it exact.
You don't need the findall call. Instead use:
m = re_token.match(line)
if m:
token = m.group(1)
....
I second the recommendation of BeautifulSoup over regular expressions though.
there's nothing "pythonic" with using regex. If you don't want to use BeautifulSoup(which you should ideally), just use Python's excellent string manipulation capabilities
for line in open("file"):
line=line.strip()
if "<input name" in line and "value=" in line:
item=line.split()
for i in item:
if "value" in i:
print i
output
$ more file
<input name="authenticity_token" type="hidden" value="WTumSWohmrxcoiDtgpPRcxUMh/D9m7O7T6HOhWH+Yw4=" />
$ python script.py
value="WTumSWohmrxcoiDtgpPRcxUMh/D9m7O7T6HOhWH+Yw4="
As to why you shouldn't use regular expressions to search HTML, there are two main reasons.
The first is that HTML is defined recursively, and regular expressions, which compile into stackless state machines, don't do recursion. You can't write a regular expression that can tell, when it encounters an end tag, what start tag it encountered on its way to that tag it belongs to; there's nowhere to save that information.
The second is that parsing HTML (which BeautifulSoup does) normalizes all kinds of things that are allowable in HTML and that you're probably not going to ever consider in your regular expressions. To pick a trivial example, what you're trying to parse:
<input name="authenticity_token" type="hidden" value="xxx"/>
could just as easily be:
<input name='authenticity_token' type="hidden" value="xxx"/>
or
<input type = "hidden" value = "xxx" name = 'authenticity_token' />
or any one of a hundred other permutations that I'm not thinking about right now.