MicroPython Regex not matching although it does online - regex

I have a strange Problem. When I parse my Regex online it works fine, but in MicroPython doesn't match it.
regex:
()*<div>(.*?)<\/div>()*or<div>(.*?)<\/div>or<div>(.*?)</div>
toMatch:
<Storage {}>86400<div>Uhrzeit in Sekunden: 65567</div><div>Timer: 20833</div>
none of these match with python but do online (http://regexr.com/ or https://pythex.org/)
This is just a short part of what i want to get. But what i want is the data inside the div.
EDIT:
I am using micropython on a esp8266. I am limited and cant use a html parser.

I suspect your problem is that you are not passing a raw string to re.compile(). If I do this I get what I think you want:
>>> rx = re.compile(r"<div>(.*?)<\/div>")
>>> rx.findall("<Storage {}>86400<div>Uhrzeit in Sekunden: 65567</div><div>Timer: 20833</div>")
>>> ['Uhrzeit in Sekunden: 65567', 'Timer: 20833']
You need a raw string because \ is both the Python string escape character and the regex escape character. Without it you have to put \\ in your regex when you mean \ and that very quickly becomes confusing.

Related

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

Unwanted characters in regular expressions python

So, I have a site that has an XML string, and I'd like my program to return a list of strings that appear between two strings. Here's my code:
response = requests.get(url)
artists=re.findall(re.escape('<name>')+'(.*?)'+re.escape('</name>'),str(response.content))
print(artists)
This returns a list of strings. The problem is, some strings have unwanted characters in them. For example, one of the strings in the list is "Somethin\\' \\'Bout A Truck" and I'd like it to be 'Somethin' 'Bout A Truck'.
Thanks in advance.
I think the beautiful soup(bs4) will solve this problem and it will also support for higher version of python 3.4
Those escapes (single backslashes, each displayed as \\) may be "unwanted" from your viewpoint but they're no doubt "present" in the response you received. So if characters are present but unwanted, you can remove them, e.g using in lieu of str(response.content)
str(response.content).replace('\\'. '')
if what you actually want to do is remove all such escapes (if you want to do something different than that you'd better explain what it is:-).
BeautifulSoup4 as recommended in the accepted answer, though a nice package indeed, does not wantonly remove characters present in the input -- it can't read your mind, so it can't know what's "unwanted" to you. E.g:
>>> import bs4
>>> s = '<name>Somethin\\\' \\\'Bout A Truck</name>'
>>> soup = bs4.BeautifulSoup(s)
>>> print(soup)
<name>Somethin\' \'Bout A Truck</name>
>>>
As you see, the escapes (backslashes) are still there before the single-quotes.

golang regex to find urls in a string

I am tring to find all links in a string and then hyperlink them
like this js lib https://github.com/bryanwoods/autolink-js
i tried to use alot of regex but i always got too many errors
http://play.golang.org/p/iQiccXvFiB
i don't know if go has a different regex syntax
so, what regex that works in go that is good to match urls in strings
thanks
You can use xurls:
import "mvdan.cc/xurls"
func main() {
xurls.Relaxed().FindString("Do gophers live in golang.org?")
// "golang.org"
xurls.Relaxed().FindAllString("foo.com is http://foo.com/.", -1)
// ["foo.com", "http://foo.com/"]
xurls.Strict().FindAllString("foo.com is http://foo.com/.", -1)
// ["http://foo.com/"]
}
Use back-ticks instead of double-quotes for your string literals. Back-slashes inside double-quotes start escape sequences, which you don't need/want for this use case.
Additionally, how did you expect this to work?
"$0"

Selecting URLs using RegExp but ignoring them when surrounded by double quotes

I've searched around quite a bit now, but I can't get any suggestions to work in my situation. I've seen success with negative lookahead or lookaround, but I really don't understand it.
I wish to use RegExp to find URLs in blocks of text but ignore them when quoted. While not perfect yet I have the following to find URLs:
(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?
I want it to match the following:
www.test.com:50/stuff
http://player.vimeo.com/video/63317960
odd.name.amazone.com/pizza
But not match:
"www.test.com:50/stuff
http://plAyerz.vimeo.com/video/63317960"
"odd.name.amazone.com/pizza"
Edit:
To clarify, I could be passing a full paragraph of text through the expression. Sample paragraph of what I'd like below:
I would like the following link to be found www.example.com. However this link should be ignored "www.example.com". It would be nice, but not required, to have "www.example.com and www.example.com" ignored as well.
A sample of a different one I have working below. language is php:
$articleEntry = "Hey guys! Check out this cool video on Vimeo: player.vimeo.com/video/63317960";
$pattern = array('/\n+/', '/(https?\:\/\/)?(player\.vimeo\.com\/video\/[0-9]+)/');
$replace = array('<br/><br/>',
'<iframe src="http://$2?color=40cc20" width="500" height="281" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe>');
$articleEntry = preg_replace($pattern,$replace,$articleEntry);
The result of the above will replace any new lines "\n" with a double break "" and will embed the Vimeo video by replacing the Vimeo address with an iframe and link.
I've found a solution!
(?=(([^"]+"){2})*[^"]*$)((https?:\/\/)?(\w+\.)+\w{2,}(:[0-9]+)?((\/\w+)+(\.\w+)?)?\/?)
The first part from (? to *$) what makes it work for me. I found this as an answer in java Regex - split but ignore text inside quotes? by https://stackoverflow.com/users/548225/anubhava
While I had read that question before, I had overlooked his answer because it wasn't the one that "solved" the question. I just changed the single quote to double quote and it works out for me.
add ^ and $ to your regex
^(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?$
please notice you might need to escape the slashes after http (meaning https?\:\/\/)
update
if you want it to be case sensitive, you shouldn't use \w but [a-z]. the \w contains all letters and numbers, so you should be careful while using it.

Regex String Search Problem

I am trying to fetch specific URLs from a Large String.
http://www.rubular.com/r/OYxQHVTWfF
The same URLs I can extract from Web Interface.
But the same RegEx pattern is not working in iPhone SDK.
I tried with it , so many times.
I am trying with nsregularexpression.
NSRegularExpression *regex =
[NSRegularExpression regularExpressionWithPattern:
#"\"low_resolution\"[:][\\s]{\"url\"[:][\\s]\"([^\"]*)"
Please help out.
If you want just to extract URL from that data blob, then this regex pattern will do just fine on most regex engines:
low_resolution": {"url": "([^"]+)"
You just retreive 1st backreference...
Note: If you use double quotes inside string, use single quotes around it, so you wont need to escape double quotes.
You forgot to escape the {:
#"\"low_resolution\":\\s\\{\"url\":\\s\"([^\"]*)"