Pythonic way to find a regular expression match - regex

Is there a more succinct/correct/pythonic way to do the following:
url = "http://0.0.0.0:3000/authenticate/login"
re_token = re.compile("<[^>]*authenticity_token[^>]*value=\"([^\"]*)")
for line in urllib2.urlopen(url):
if re_token.match(line):
token = re_token.findall(line)[0]
break
I want to get the value of the input tag named "authenticity_token" from an HTML page:
<input name="authenticity_token" type="hidden" value="WTumSWohmrxcoiDtgpPRcxUMh/D9m7O7T6HOhWH+Yw4=" />

Could you use Beautiful Soup for this? The code would essentially look something like so:
from BeautifulSoup import BeautifulSoup
url = "hhttp://0.0.0.0:3000/authenticate/login"
page = urlli2b.urlopen(page)
soup = BeautifulSoup(page)
token = soup.find("input", { 'name': 'authenticity_token'})
Something like that should work. I didn't test this but you can read the documentation to get it exact.

You don't need the findall call. Instead use:
m = re_token.match(line)
if m:
token = m.group(1)
....
I second the recommendation of BeautifulSoup over regular expressions though.

there's nothing "pythonic" with using regex. If you don't want to use BeautifulSoup(which you should ideally), just use Python's excellent string manipulation capabilities
for line in open("file"):
line=line.strip()
if "<input name" in line and "value=" in line:
item=line.split()
for i in item:
if "value" in i:
print i
output
$ more file
<input name="authenticity_token" type="hidden" value="WTumSWohmrxcoiDtgpPRcxUMh/D9m7O7T6HOhWH+Yw4=" />
$ python script.py
value="WTumSWohmrxcoiDtgpPRcxUMh/D9m7O7T6HOhWH+Yw4="

As to why you shouldn't use regular expressions to search HTML, there are two main reasons.
The first is that HTML is defined recursively, and regular expressions, which compile into stackless state machines, don't do recursion. You can't write a regular expression that can tell, when it encounters an end tag, what start tag it encountered on its way to that tag it belongs to; there's nowhere to save that information.
The second is that parsing HTML (which BeautifulSoup does) normalizes all kinds of things that are allowable in HTML and that you're probably not going to ever consider in your regular expressions. To pick a trivial example, what you're trying to parse:
<input name="authenticity_token" type="hidden" value="xxx"/>
could just as easily be:
<input name='authenticity_token' type="hidden" value="xxx"/>
or
<input type = "hidden" value = "xxx" name = 'authenticity_token' />
or any one of a hundred other permutations that I'm not thinking about right now.

Related

HTML - Using pattern attribute

In html form, I need textarea which allows any type of text: numbers, symbols, newline or letters, including Hebrew letters. The only two rules:
The input must include the string: "{ser}"
The input should prohibit any use of "{" or "}" except for the above string
I tried this:
<form action="#">
...
<textarea pattern="[^\{\}]*\{ser\}[^\{\}]*" required>
האם אתה נמצא בשבת הקרובה? אם כן נא השב {ser} + שם מלא
</textarea>
...
<input type="submit" />
...
</form>
But for some reason it also allows sending texts that do not meet the rules. I would appreciate your help.
You cannot use pattern attribute on textareas, see the documentation.
maxlength specifies a maximum number of characters that the
is allowed to contain. You can also set a minimum length that is
considered valid using the minlength attribute, and specify that the
will not submit (and is invalid) if it is empty, using the
required attribute. This provides the with simple
validation, which is more basic than the other form elements (for
example, you can't provide specific regexs to validate the value
against using the pattern attribute, like you can with the input
element).
Perhaps implement a regex match with javascript?
function validateTextarea(text) {
var re = /ser/g;
var result = text.match(re);
if(result != null && result.length > 0)
// Do something
}
Then probably the best way is to check the function in onsubmit form attribute.

Regex for HTML RESPONSE BODY present under div tag

I need to build a regex for extracting the value present under value field.
i.e "f70a8c3d0a6cbe2e235c7fd1dd27d052df7412ea"
HTML RESPONSE BODY :
Note: I have pasted just a minor part of the response....but formToken key is unique
<div class="hidden">
<input name="formToken type="hidden"
value="f70a8c3d0a6cbe2e235c7fd1dd27d052df7412ea"
/>
</div>
I wrote the below regex but it returned nothing:
regex("formToken" type="hidden" value="([^"]*)"/>).find(0).exists, found nothing
Can you try this?
regex("type="hidden".*value="(.*?)[ \t]*"/>).find(0).exists
Instead of a regex, you could use a css selector check which is probably way easier once you have ids or css classes to search for.
Thank you all....I was able to get formToken using css
.check(css("input[name='formToken']", "value").saveAs("formTokex"))
Works like this for me:
.exec(http("request_1")
.get("<<<<YOUR_URL>>>>>")
.check(css("form[name='signInForm']", "action").saveAs("urlPath"))
and later printing it:
println(session( "urlPath" ).as[String])

RegExing a veiwstate

First of all, what is a viewstate?
In testautomation I probably need to correlate this value as it is unique for every user logging in?
How can I get the 'value' / token below using regex?
<div>
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKMTUxOTg3NDM2NGQYCAUkY3RsMDAkTWFpbk1lbnUkTWVudSRjdGwwMSRjdGwwMCRNZW51DxQrAA5kZGRkZGRkPCsABQACBWRkZGYC/////w9kBSRjdGwwMCRNYWluTWVudSRNZW51JGN0bDAzJGN0bDAwJE1lbnUPFCsADmRkZGRkZGQ8KwAEAAIEZGRkZgL/////D2QFJGN0bDAwJE1haW5NZW51JE1lbnUkY3RsMDQkY3RsMDAkTWVudQ8UKwAOZGRkZGRkZDwrAAcAAgdkZGRmAv////8PZAUkY3RsMDAkTWFpbk1lbnUkTWVudSRjdGwwNiRjdGwwMCRNZW51DxQrAA5kZGQCAmRkZDwrAAkAAglkZGRmAv////8PZAUkY3RsMDAkTWFpbk1lbnUkTWVudSRjdGwwMiRjdGwwMCRNZW51DxQrAA5kZGQCAmRkZDwrAAsAAgtkZGRmAv////8PZAUpY3RsMDAkRm9vdGVyUmVnaW9uJGN0bDAwJEZvb3RlckxpbmtzJExpc3QPD2ZkZAUTY3RsMDAkY3RsMDMkUnNzTGlzdA8PZmRkBSRjdGwwMCRNYWluTWVudSRNZW51JGN0bDA1JGN0bDAwJE1lbnUPFCsADmRkZGRkZGQ8KwAJAAIJZGRkZgL/////D2R3kjxauWd2eu+C/bmZz+/bI7YRkg==" />
Read this: RegEx match open tags except XHTML self-contained tags
then if you still want to have a go, use this:
(?<=input )(?:.*)(value\=\".*\")

'TypeError: expected string or buffer' when performing re.sub on beautiful soup result set iteration

content_a is a beautiful soup result set (ie the type is <class 'bs4.element.ResultSet'>) that is made up of values whose type is <class 'bs4.element.Tag'>.
If i print 'content_a' i get:
[<div class="class1 class2">Here is the first sentence.
<br/> <br/> Here is some text "and some more text."
<br/> <br/> Here is another sentence.
<br/> Text<br/><span class="class3">Text</span></div>, <div class="class1 class2">Here is the first sentence.
<br/> <br/> Here is some text "and some more text."
<br/> <br/> Here is another sentence.
<br/> Text<br/><span class="class3">Text</span></div>, etc
So it seems to me it should be a simple iterable list of divs.
I am wanting to replace <div class="class1 class2"> with <div class="class1 class2"><p> (my eventual goal being to replace all <br />'s with paragraph tags).
In my test where the source content is a string I have:
import re
blablabla = ['<div class="class1 class2">', '<div class="class1 class2">']
for _ in blablabla:
_ = re.sub('(<div class=\"class1 class2\">)', r"\1<p>",_)
print _
which returns, as required:
<div class="class1 class2"><p>
<div class="class1 class2"><p>
I am trying to perform the same process on each iterable in content_a with:
import re
for _ in content_a:
_ = re.sub('(<div class=\"class1 class2\">)', r"\1<p>",_)
print _
but am getting the error:
...in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
So the only difference that i can tell between the two examples is that one is a beautiful soup result set and one is just a plain list.
Can anyone see why this error could be occuring?
Edit:
Someone has pointed out here that sub requires a string as the third argument, so the third argument that i am passing is the iterable value which is of type <class 'bs4.element.Tag'>. So perhaps this is the problem. But i need to retain the nature of these values for later modification so i am not sure how to proceed at the moment.
Update/Workaround:
Just to save someone spending time on an answer, i figured out a workaround, basically i realised i could adjust the content later in the process and i did this by converting it to a string with read() and could then perform all the re.sub changes on the required elements in the string.
And the little regex i came up with was:
string = re.sub('([^\r]*)\r', r'\1</p>\n<p>', string)
As suggested, I am posting the workaround I used as the solution:
Update/Workaround:
Just to save someone spending time on an answer, I figured out a workaround, basically I realised I could adjust the content later in the process and i did this by converting it to a string with read() and could then perform all the re.sub changes on the required elements in the string.
And the little regex I came up with was:
string = re.sub('([^\r]*)\r', r'\1</p>\n<p>', string)

Add a newline after each closing html tag in web2py

Original
I want to parse a string of html code and add newlines after closing tags + after the initial form tag. Here's the code so far. It's giving me an error in the "re.sub" line. I don't understand why the regex fails.
def user():
tags = "<form><label for=\"email_field\">Email:</label><input type=\"email\" name=\"email_field\"/><label for=\"password_field\">Password:</label><input type=\"password\" name=\"password_field\"/><input type=\"submit\" value=\"Login\"/></form>"
result = re.sub("(</.*?>)", "\1\n", tags)
return dict(form_code=result)
PS. I have a feeling this might not be the best way... but I still want to learn how to do this.
EDIT
I was missing "import re" from my default.py. Thanks ruakh for this.
import re
Now my page source code shows up like this (inspected in client browser). The actual page shows the form code as text, not as UI elements.
<form><label for="email_field">Email:</label>
<input type="email" name="email_field"/><label
for="password_field">Password:</label>
<input type="password" name="password_field"/><input
type="submit" value="Login"/></form>
EDIT 2
The form code is rendered as UI elements after adding XML() helper into default.py. Thanks Anthony for helping. Corrected line below:
return dict(form_code=XML(result))
FINAL EDIT
Fixing the regex I figured myself. This is not optimal solution but at least it works. The final code:
import re
def user():
tags = "<form><label for=\"email_field\">Email:</label><input type=\"email\" name=\"email_field\"/><label for=\"password_field\">Password:</label><input type=\"password\" name=\"password_field\"/><input type=\"submit\" value=\"Login\"/></form>"
tags = re.sub(r"(<form>)", r"<form>\n ", tags)
tags = re.sub(r"(</.*?>)", r"\1\n ", tags)
tags = re.sub(r"(/>)", r"/>\n ", tags)
tags = re.sub(r"( </form>)", r"</form>\n", tags)
return dict(form_code=XML(tags))
The only issue I see is that you need to change "\1\n" to r"\1\n" (using the "raw" string notation); otherwise \1 is interpreted as an octal escape (meaning the character U+0001). But that shouldn't give you an error, per se. What error-message are you getting?
By default, web2py escapes all text inserted in the view for security reasons. To avoid that, simply use the XML() helper, either in the controller:
return dict(form_code=XML(result))
or in the view:
{{=XML(form_code)}}
Don't do this unless the code is coming from a trusted source -- otherwise it could contain malicious Javascript.