Unwanted characters in regular expressions python - regex

So, I have a site that has an XML string, and I'd like my program to return a list of strings that appear between two strings. Here's my code:
response = requests.get(url)
artists=re.findall(re.escape('<name>')+'(.*?)'+re.escape('</name>'),str(response.content))
print(artists)
This returns a list of strings. The problem is, some strings have unwanted characters in them. For example, one of the strings in the list is "Somethin\\' \\'Bout A Truck" and I'd like it to be 'Somethin' 'Bout A Truck'.
Thanks in advance.

I think the beautiful soup(bs4) will solve this problem and it will also support for higher version of python 3.4

Those escapes (single backslashes, each displayed as \\) may be "unwanted" from your viewpoint but they're no doubt "present" in the response you received. So if characters are present but unwanted, you can remove them, e.g using in lieu of str(response.content)
str(response.content).replace('\\'. '')
if what you actually want to do is remove all such escapes (if you want to do something different than that you'd better explain what it is:-).
BeautifulSoup4 as recommended in the accepted answer, though a nice package indeed, does not wantonly remove characters present in the input -- it can't read your mind, so it can't know what's "unwanted" to you. E.g:
>>> import bs4
>>> s = '<name>Somethin\\\' \\\'Bout A Truck</name>'
>>> soup = bs4.BeautifulSoup(s)
>>> print(soup)
<name>Somethin\' \'Bout A Truck</name>
>>>
As you see, the escapes (backslashes) are still there before the single-quotes.

Related

MicroPython Regex not matching although it does online

I have a strange Problem. When I parse my Regex online it works fine, but in MicroPython doesn't match it.
regex:
()*<div>(.*?)<\/div>()*or<div>(.*?)<\/div>or<div>(.*?)</div>
toMatch:
<Storage {}>86400<div>Uhrzeit in Sekunden: 65567</div><div>Timer: 20833</div>
none of these match with python but do online (http://regexr.com/ or https://pythex.org/)
This is just a short part of what i want to get. But what i want is the data inside the div.
EDIT:
I am using micropython on a esp8266. I am limited and cant use a html parser.
I suspect your problem is that you are not passing a raw string to re.compile(). If I do this I get what I think you want:
>>> rx = re.compile(r"<div>(.*?)<\/div>")
>>> rx.findall("<Storage {}>86400<div>Uhrzeit in Sekunden: 65567</div><div>Timer: 20833</div>")
>>> ['Uhrzeit in Sekunden: 65567', 'Timer: 20833']
You need a raw string because \ is both the Python string escape character and the regex escape character. Without it you have to put \\ in your regex when you mean \ and that very quickly becomes confusing.

Finding a group of words using Regular Expressions

I am using python to get user input and then by using regular expressions I want to check for certain words. In this case I want to check how the user is feeling and then store it in a list. The problem is that when I print the list it is empty.
import re
phrase = raw_input("How are you feeling ")
phrase = phrase.lower()
feel=(re.findall(r^(?=.*\bsad\b)(?=.*\bhappy\b)(?=.*\bjoyful\b)(?=.*\bmad\b)(?=.*\bsad\b), phrase))
print feel
I'm not a python expert, but am fairly decent with regex. Why wouldn't you just use something like:
\b(happy|sad|joyful|mad)\b
Add chars to match
...(?=.*\bsad\b).*

What is the regex required to find specific urls within content from a list of urls generated by a for loop?

As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.

regular expression: how to ignore rest of the line

I have an input like this (a JSON format)
{"location":[{"id":"1BCDEFGHIJKLM","somename":"abcd","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"ROTXY","fewCode":"NL","pCode":"ROTXY","someid":"1BCDEFGHIJKLM","fewid":"GIC8"},{"id":"7823XYZHMOPRE","somename":"abcd Junction","fewname":"United States","sid":"","sname":"","regionname":"New York","type":"some","siteCode":"","someCode":"USRTJ","fewCode":"US","pCode":"USNWK","someid":"7823XYZHMOPRE","fewid":"7823XYZLMOPRE"},{"id":"799XYZHMOPRE","somename":"abcd-Maasvlakte","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"XYROT","fewCode":"NL","pCode":"","someid":"799XYZHMOPRE","fewid":"OIUOWER348534"}]}
Now, I want to pick up the first "id" value which is 1BCDEFGHIJKLM using regular expressions. I have managed upto the point using
[^({"location":[?{"id":")].{0,12} but this is incomplete. Could some one help how do I ignore the rest of the line after the value 1BCDEFGHIJKLM
Regex isn't the way to do this. Whatever platform you are using, it must have a JSON parser.
That will be your best error-free solution.
Assuming you must use regex, you can grab all the id's using "id":"(.*?)", and take the first match.
I found the following article, which might help you.
While messy, how is your regex incomplete?
It could be shortened to ("id":"([^"]+)") which is more readable, and doesn't limit the ID to twelve characters. If that is beneficial.
If you problem is getting more than one result, most languages have a "g" global switch.
In javascript, the following would return "1BCDEFGHIJKLM":
var firstID = str.match(/"id":"([^"]+)"/)[1]
As match()returns an array, in which [0] is the entire returned string, and [1] the first parenthasis.
Don't have to use regex. In your favourite language, split on commas. Then go through each item, check for "id" and split on colon (:). Get the last element. Eg Python
>>> s
'{"location":[{"id":"1BCDEFGHIJKLM","somename":"abcd","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"ROTXY","fewCode":"NL","pCode":"ROTXY","someid":"1BCDEFGHIJKLM","fewid":"GIC8"},{"id":"7823XYZHMOPRE","somename":"abcd Junction","fewname":"United States","sid":"","sname":"","regionname":"New York","type":"some","siteCode":"","someCode":"USRTJ","fewCode":"US","pCode":"USNWK","someid":"7823XYZHMOPRE","fewid":"7823XYZLMOPRE"},{"id":"799XYZHMOPRE","somename":"abcd-Maasvlakte","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"XYROT","fewCode":"NL","pCode":"","someid":"799XYZHMOPRE","fewid":"OIUOWER348534"}]}'
>>> for i in s.split(","):
... if '"id"' in i:
... print i.split(":")[-1]
... break
...
"1BCDEFGHIJKLM"
Of course, ideally, you should use a dedicated JSON parser.

Regex: Match URLs for specific domain EXCEPT when a certain querystring parameter has a certain value

In short, I need to match all URLs in a block of text that are for a certain domain and don't contain a specific querystring parameter and value (refer=twitter)
I have the following regex to match all URLs for the domain.
\b(https?://)?([a-z0-9-]+\.)*example\.com(/[^\s]*)?
I just can't get the last part to work
(?![&?]refer=twitter)\b(https?://)?([a-z0-9-]+\.)*example\.com(/[^\s]*)?
So the following SHOULD match
example.com
http://example.com/
https://www.example.com#link
www.example.com?somevalue=foo
But these should NOT
https://www.anotherexample.com#link
www.example.com?refer=twitter
EDIT:
And if you can get it to match the
http://example.com?foo=foo.bar
out of a sentence like
For examples go to http://example.com?foo=foo.bar.
without picking up the period, that would be great!
EDIT2:
Fixed the trailing period issue with this
\b(https?://)?([a-z0-9-]+\.)*example\.com/?([^\s]*[^.])?
EDIT3:
This seems to work, or at least 99% of the tests I've thrown at it
(?!\b.*[&?]refer=twitter)\b(https?://)?([a-z0-9-]+\.)*example\.com/?([^\s]*[^.])?
EDIT4:
Settled on
\b(?!.*[&?]refer=twitter)(https?://)?([a-z0-9-]+\.)*nygard\.com(?!\.)[^\s]*\b+
(?!\b.*[&?]refer=twitter)
Is what you're looking for.
To be honest, at first the thought of using a regex didn't even cross my mind (which is a good sign - using a regex must, IMO, always be a secondary option, not primary). Here is how I'd do it in my language of choice
>>> from urlparse import urlparse, parse_qs
>>> p = urlparse(r'http://foo.bar.com/baz?refer=twitter&rock=paper')
>>> parse_qs(p.query)
{'rock': ['paper'], 'refer': ['twitter']}
You can do anything from here.