I have a strange Problem. When I parse my Regex online it works fine, but in MicroPython doesn't match it.
regex:
()*<div>(.*?)<\/div>()*or<div>(.*?)<\/div>or<div>(.*?)</div>
toMatch:
<Storage {}>86400<div>Uhrzeit in Sekunden: 65567</div><div>Timer: 20833</div>
none of these match with python but do online (http://regexr.com/ or https://pythex.org/)
This is just a short part of what i want to get. But what i want is the data inside the div.
EDIT:
I am using micropython on a esp8266. I am limited and cant use a html parser.
I suspect your problem is that you are not passing a raw string to re.compile(). If I do this I get what I think you want:
>>> rx = re.compile(r"<div>(.*?)<\/div>")
>>> rx.findall("<Storage {}>86400<div>Uhrzeit in Sekunden: 65567</div><div>Timer: 20833</div>")
>>> ['Uhrzeit in Sekunden: 65567', 'Timer: 20833']
You need a raw string because \ is both the Python string escape character and the regex escape character. Without it you have to put \\ in your regex when you mean \ and that very quickly becomes confusing.
I am using python to get user input and then by using regular expressions I want to check for certain words. In this case I want to check how the user is feeling and then store it in a list. The problem is that when I print the list it is empty.
import re
phrase = raw_input("How are you feeling ")
phrase = phrase.lower()
feel=(re.findall(r^(?=.*\bsad\b)(?=.*\bhappy\b)(?=.*\bjoyful\b)(?=.*\bmad\b)(?=.*\bsad\b), phrase))
print feel
I'm not a python expert, but am fairly decent with regex. Why wouldn't you just use something like:
\b(happy|sad|joyful|mad)\b
Add chars to match
...(?=.*\bsad\b).*
As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.
I have an input like this (a JSON format)
{"location":[{"id":"1BCDEFGHIJKLM","somename":"abcd","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"ROTXY","fewCode":"NL","pCode":"ROTXY","someid":"1BCDEFGHIJKLM","fewid":"GIC8"},{"id":"7823XYZHMOPRE","somename":"abcd Junction","fewname":"United States","sid":"","sname":"","regionname":"New York","type":"some","siteCode":"","someCode":"USRTJ","fewCode":"US","pCode":"USNWK","someid":"7823XYZHMOPRE","fewid":"7823XYZLMOPRE"},{"id":"799XYZHMOPRE","somename":"abcd-Maasvlakte","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"XYROT","fewCode":"NL","pCode":"","someid":"799XYZHMOPRE","fewid":"OIUOWER348534"}]}
Now, I want to pick up the first "id" value which is 1BCDEFGHIJKLM using regular expressions. I have managed upto the point using
[^({"location":[?{"id":")].{0,12} but this is incomplete. Could some one help how do I ignore the rest of the line after the value 1BCDEFGHIJKLM
Regex isn't the way to do this. Whatever platform you are using, it must have a JSON parser.
That will be your best error-free solution.
Assuming you must use regex, you can grab all the id's using "id":"(.*?)", and take the first match.
I found the following article, which might help you.
While messy, how is your regex incomplete?
It could be shortened to ("id":"([^"]+)") which is more readable, and doesn't limit the ID to twelve characters. If that is beneficial.
If you problem is getting more than one result, most languages have a "g" global switch.
In javascript, the following would return "1BCDEFGHIJKLM":
var firstID = str.match(/"id":"([^"]+)"/)[1]
As match()returns an array, in which [0] is the entire returned string, and [1] the first parenthasis.
Don't have to use regex. In your favourite language, split on commas. Then go through each item, check for "id" and split on colon (:). Get the last element. Eg Python
>>> s
'{"location":[{"id":"1BCDEFGHIJKLM","somename":"abcd","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"ROTXY","fewCode":"NL","pCode":"ROTXY","someid":"1BCDEFGHIJKLM","fewid":"GIC8"},{"id":"7823XYZHMOPRE","somename":"abcd Junction","fewname":"United States","sid":"","sname":"","regionname":"New York","type":"some","siteCode":"","someCode":"USRTJ","fewCode":"US","pCode":"USNWK","someid":"7823XYZHMOPRE","fewid":"7823XYZLMOPRE"},{"id":"799XYZHMOPRE","somename":"abcd-Maasvlakte","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"XYROT","fewCode":"NL","pCode":"","someid":"799XYZHMOPRE","fewid":"OIUOWER348534"}]}'
>>> for i in s.split(","):
... if '"id"' in i:
... print i.split(":")[-1]
... break
...
"1BCDEFGHIJKLM"
Of course, ideally, you should use a dedicated JSON parser.
In short, I need to match all URLs in a block of text that are for a certain domain and don't contain a specific querystring parameter and value (refer=twitter)
I have the following regex to match all URLs for the domain.
\b(https?://)?([a-z0-9-]+\.)*example\.com(/[^\s]*)?
I just can't get the last part to work
(?![&?]refer=twitter)\b(https?://)?([a-z0-9-]+\.)*example\.com(/[^\s]*)?
So the following SHOULD match
example.com
http://example.com/
https://www.example.com#link
www.example.com?somevalue=foo
But these should NOT
https://www.anotherexample.com#link
www.example.com?refer=twitter
EDIT:
And if you can get it to match the
http://example.com?foo=foo.bar
out of a sentence like
For examples go to http://example.com?foo=foo.bar.
without picking up the period, that would be great!
EDIT2:
Fixed the trailing period issue with this
\b(https?://)?([a-z0-9-]+\.)*example\.com/?([^\s]*[^.])?
EDIT3:
This seems to work, or at least 99% of the tests I've thrown at it
(?!\b.*[&?]refer=twitter)\b(https?://)?([a-z0-9-]+\.)*example\.com/?([^\s]*[^.])?
EDIT4:
Settled on
\b(?!.*[&?]refer=twitter)(https?://)?([a-z0-9-]+\.)*nygard\.com(?!\.)[^\s]*\b+
(?!\b.*[&?]refer=twitter)
Is what you're looking for.
To be honest, at first the thought of using a regex didn't even cross my mind (which is a good sign - using a regex must, IMO, always be a secondary option, not primary). Here is how I'd do it in my language of choice
>>> from urlparse import urlparse, parse_qs
>>> p = urlparse(r'http://foo.bar.com/baz?refer=twitter&rock=paper')
>>> parse_qs(p.query)
{'rock': ['paper'], 'refer': ['twitter']}
You can do anything from here.