Regex to match string containing two (or more) words in any order and case not sensitive - regex

Hope you guys can help me.
I need to make a string that alerts me when the following conditions are met:
Two (or more) words are identified in a message
It does not have to look just at the "whole" words but also at that ones that contain the text I am searching. For istance, I search for "error", it has to be alert me also when it founds "errors" or "errorless".
It should not to be case sensitive
It has to look at word1 and word2 but also viceversa, in others words it has to look at them irrespective of their order
I have played a while with regex101 but I have not been able to reach all conditions (condition # 4 is still missing).
You can find at the following link what I have been able to make:
https://regex101.com/r/Z4cE9A/5
Please note that I need matches with the following expressions characteristics:
Flavor: golang / Flag: single line
Important note: I cannot use the character "|" as it does not work properly on the system where I am going to use this string.
Any help would be more than appreciated. Thanks in advance for your support.
EDIT: I did confusion. The non functioning character is "|". However if possible is better to avoid also the "/" as I am not sure if it works. If you want we can provide me with two strings, one without the symbol "/" and one without, in case it does not work.

This should do what you want:
(?i:(http)|(error))
You can replace http and error with any other keywords that you are searching for.
To do that in Golang:
package main
import (
"fmt"
"regexp"
"strings"
)
func main() {
keywords := []string{
"error",
"http",
}
p := "(?i:(" + strings.Join(keywords, ")|(") + "))"
text := `
Gran Turismo Sport
Shipment Error
Attempt
https://
`
re := regexp.MustCompile(p)
fmt.Println(re.MatchString(text))
}
You can test that in Golang Playground:
https://play.golang.org/p/XOhNVBCh8Pt
EDIT:
Based on the new limitation of not being able to use the | char, I would suggest that you search using this:
(?i:(error)?(http)?)
This will always return true (or a list of empty strings in find all) but the good thing is you can filter out all the empty strings and you will end up with the result that you want.
This is a an example of this working in Golang Playground:
https://play.golang.org/p/miVC0hdLtQc
EDIT 2:
In case you want to make sure ALL the keywords are in the text change the ? in regex with {1,}. Also you don't need the loop any more.
(?i:(error){1,}(http){1,})
This is an an example working in Golang Playground
https://play.golang.org/p/f9eFcvObDsA

Related

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

Regex: matching unknown groups that repeat?

I'm trying to create a generic regex pattern for a crawler, to avoid so called "crawler traps" (links that just add url parameters and refer to the exact same page, which results in tons of useless data). Alot of times, those links just add the same part to a URL over and over again. Here is an example out of a log file:
http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/...
I can use regular expressions to narrow the scope of the crawler and i would love to have a pattern, that tells the crawler to ignore everything that has repeating parts. Is that possible with a regex?
Thanks in advance for some tips!
JUST TO CLARIFY:
the crawlertraps are not designed to prevent crawling, they are a result of poor web design. All the pages we are crawling explicitly allowed us to do so!
If you are already looping through a list of URLs, you could add matching as a condition to skip the current iteration:
array = ["/abcd/abcd/abcd/abcd/", "http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/", "http://examplepage/apple/cake/banana/"]
import re
pattern1 = re.compile(r'.*?([^\/\&?]{4,})(?:[\/\&\?])(.*?\1){3,}.*')
for url in array:
if re.match(pattern1, url):
print "It matches; skipping this URL"
continue
print url
Example regex:
.*?([^\/\&?]{4,})(?:[\/\&\?])(.*?\1){3,}.*
([^\/\&?]{4,}) matches and captures sequences of anything, but not containing [/&?], repeated 4 or more times.
(?:[\/\&\?]) looks for one /,& or ?
(.*?(?:[\/\&\?])\1){3,} match anything until [/&?], followed by what we captured, doing all of this 3 or more times.
demo
You can use a backreference in Python/PERL regexes (and possibly others) to catch a pattern which is repeated:
>>> re.search(r"(/.+)\1", "http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/").group(1)
'/cssms/chrome'
\1 references the first match, so (/.+)\1 means the same sequence repeated twice in a row. The leading / is just to avoid the regex matching the first single repeating letter (which is the t in http) and catch repetitions in the path.

Case Insensitive Regex expression for getting file

I have a scenario where i am taking files from a folder for data loading which is having naming convention as .Customer_..txt.But also i would like to make this expression case insensitive so if any file named CUSTOMER_1234 comes.It will also accept that and process accordingly
Try the below regex:
(?i)customer(?-i).*\.txt
in the wildcard section of the "get files" steps or any other regex step you are using. This will filter out files starting with either "customer" or "CUSTOMER".
Attached a sample code here.
Hope this helps :)
Sample Screenshot:
Modifying my previous answer based on the comment below:
If you are looking to match the pattern "customer_" irrespective of case sensitivity, first of all you can easily do it using a Javascript "match" function. You just need to pass the file names in upper case and match with the uppercase pattern. This will easily fetch you the result. Check the JS snip below:
var pattern="customer_"; //pattern is the word pattern you want to match
var match_files= upper(files).match(upper(pattern)); // files in the list of files you are getting from the directory
if(upper(match_files)==upper(pattern)){
//set one flag as 'match'
}
else{
// set the flag as 'not match'
}
But in case you need to use regex expression only. Then you can try the below regex:
.*(?i)(customer|CUSTOMER).*(?-i)\.txt
This would work for "_123_Customer_1vasd.txt" patterns too.
Hope this helps :)

What is the regex required to find specific urls within content from a list of urls generated by a for loop?

As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.

Ruby Puppet Regex string matching

I'm somewhat new to ruby and have done a ton of google searching but just can't seem to figure out how to match this particular pattern. I have used rubular.com and can't seem to find a simple way to match. Here is what I'm trying to do:
I have several types of hosts, they take this form:
Sample hostgroups
host-brd0000.localdomain
host-cat0000.localdomain
host-dog0000.localdomain
host-bug0000.localdomain
Next I have a case statement, I want to keep out the bugs (who doesn't right?). I want to do something like this to match the series of characters. However, it starts matching at host-b, host-c, host-d, and matches only a single character as if I did a [brdcatdog].
case $hostgroups { #variable takes the host string up to where the numbers begin
# animals to keep
/host-[["brd"],["cat"],["dog"]]/: {
file {"/usr/bin/petstore-friends.sh":
owner => petstore,
group => petstore,
mode => 755,
source => "puppet:///modules/petstore-friends.sh.$hostgroups",
}
}
I could do something like [bcd][rao][dtg] but it's not very clean looking and will match nonsense like "bad""cot""dat""crt" which I don't want.
Is there a slick way to use \A and [] that I'm missing?
Thanks for your help.
-wootini
How about using negative lookahead?
host-(?!bug).*
Here is the RUBULAR permalink matching everything except those pesky bugs!
Is this what you're looking for?
host-(brd|cat|dog)
(Following gtgaxiola's example, here's the Rubular permalink)