How to search using backslash - regex

I'm working to look through websites to find specific words. I use the re.compile with bs4 to search for the word. I am having issues if my word contains a backslash ('\'). I was hoping I could get some help on how to do this. My code is usually like this
results = self.soup.find_all(string=re.compile('.*{0}.*'.format(searched_word), re.IGNORECASE), recursive=True)
This code throws an error of re.error: bad escape \M at position 13 when I try to have searched_word = Software\Microsoft\Windows\CurrentVersion\Run
I read somewhere that in order to escape backslash, I should make it Software\\Microsoft\\Windows\\CurrentVersion\\Run which throws an error. Or Software\\\\Microsoft\\\\Windows\\\\CurrentVersion\\\\Run which doesn't throw an error but does not return the text.

It seems you are not escaping the string for re.compile(). To do that, use re.escape() (doc):
results = self.soup.find_all(string=re.compile('.*{0}.*'.format(re.escape(searched_word)), re.IGNORECASE), recursive=True)

Related

Python regex to ignore what is between quotes

I have a string, part of which is surrounded within quotes. Like the one at the third line of the code snippet below. I want the string to be formatted into a dict literal. Meaning wherever the quotes are missing, they should be added. But the part which is within the quotes has to be ignored. I came up with the code below to handle this:
from ast import literal_eval
from re import sub
str = "key1:[val1,val2,val3],key2:'val4A,val4B'"
str = sub(r"([\w\-\.]+|[\"'].*[\"'])", r"'\1'", f"{{{str}}}")
str = sub(r"[\"']{2,}(.*)[\"']{2,}", r"'\1'", str)
fin = literal_eval(str)
print(fin)
This code does the work, but I want to know if there is a way to achieve this with one time usage of sub. Before you mark this as a duplicate, I tried a large number of the solutions provided on the web including positive and negative look ahead and look behind, exclusion, and simple negative match. Couldn't find any which would work. If there is a solution I have missed or anyone has a solutions, I would highly appreciate knowing about it.
Try this ([\w\-\.]+(?=(?:[^']*'[^']*')*[^']*$)) :
Live Demo

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

Matching multiple python regexes in a line in tarfile opened tar file

Please how do I overcome the problem of
TypeError: cannot use a string pattern on a bytes-like object
when trying to run multiple regexes match against a line from the file?
The multiple match I am trying is:
re.match('|'.join('(?:{0})'.format(x) for x in (regex1, regex2, regex3)), line):
which works in plain text file matches and which I attribute to StackOverflow assistance.
I have compiled the regexes like so:
regex1 = re.compile(b'http\:\/\/ipaddress\:port\/service\?')
regex2 = re.compile(b'\_event\=new?')
regex3 = re.compile(b'askment\:')
but this TypeError still appears.
Earlier in my script I can get away with this:
match = re.search(b'something-string:\s+111+\d{2,5}', line)
So I thought prefixing the regexes with 'b' in the multiple match was sufficient.
Please what am I doing wrong?
I had to decode the line, since its coming in as a binary stream.
re.match('|'.join('(?:{0})'.format(x) for x in (regex1, regex2, regex3)), line.decode("ascii or something else")):

Setting regular expression to validate URL format in Adobe CQ5

I want to validate a URL inside a textfield using Adobe CQ5, so I set up the properties regex and regexText as usual, but for some reason is not working:
<facebook
jcr:primaryType="cq:Widget"
emptyText="http://www.facebook.com/account-name"
fieldDescription="Set the Facebook URL"
fieldLabel="Facebook"
name="./facebookUrl"
regex="/^(http://www.|https://www.|http://|https://)[a-z0-9]+([-.]{1}[a-z0-9]+)*.[a-z]{2,5}(:[0-9]{1,5})?(/.*)?$/"
regexText="Invalid URL format"
xtype="textfield"/>
So when I type inside the component I can see an error message at the console:
Uncaught TypeError: this.regex.test is not a function
To be more accurate the error comes from this line:
if (this.regex && !this.regex.test(value)) {
I tried several regular expressions and none of them worked. I guess the problem is the regular expression itself, because in the other hand I have this other regex to evaluate email address, and it works perfectly fine:
/^[A-za-z0-9]+[\\._]*[A-za-z0-9]*#[A-za-z.-]+[\\.]+[A-Za-z]{2,4}$/
Any suggestions? Thanks in advance.
The syntax of your regex seems to treat the forward slashes (/) as special characters. Since you want to parse a URL containing slashes, my guess is you should escape them twice like this: '\\/' instead of '/'. The result would be:
/^(http:\\/\\/www.|https:\\/\\/www.|http:\\/\\/|https:\\/\\/)[a-z0-9]+([-.]{1}[a-z0-9]+)‌​*.[a-z]{2,5}(:[0-9]{1,5})?(\\/.*)?$/
You need to escape them twice because the string to be compiled as a regex must contain '\/' to escape the slashes, but to introduce a backslash in a string you have to escape the backslash itself too.

golang regex to find urls in a string

I am tring to find all links in a string and then hyperlink them
like this js lib https://github.com/bryanwoods/autolink-js
i tried to use alot of regex but i always got too many errors
http://play.golang.org/p/iQiccXvFiB
i don't know if go has a different regex syntax
so, what regex that works in go that is good to match urls in strings
thanks
You can use xurls:
import "mvdan.cc/xurls"
func main() {
xurls.Relaxed().FindString("Do gophers live in golang.org?")
// "golang.org"
xurls.Relaxed().FindAllString("foo.com is http://foo.com/.", -1)
// ["foo.com", "http://foo.com/"]
xurls.Strict().FindAllString("foo.com is http://foo.com/.", -1)
// ["http://foo.com/"]
}
Use back-ticks instead of double-quotes for your string literals. Back-slashes inside double-quotes start escape sequences, which you don't need/want for this use case.
Additionally, how did you expect this to work?
"$0"