Regular expression for validating literals and alphanumeric character in string - regex

We like to validate following with Regex:
string may contain 0 or more alphabets, digits, underscore OR
string may contain literals %sample1% or %sample2% (0 or more times in any order)
For example:
%sample1%_%sample2% is valid
%sample2%%sample1% is valid
1_abc is valid
%sampleee1% is not valid
%sample2%%sample1%_%sample1%_%sample1% is valid
we tried this:
^(%sample1%)*[a-zA-Z0-9_]*(%sample2%)*$
but it is not matching following:
%sample2%%sample1%
What should be our regex in this case.

This regex does what you want:
^(%sample1%|%sample2%|[a-zA-Z0-9_])*$
See live demo
Note that this may be shortened to:
^(%sample[12]%|\w)*$
Although you may not want to combine the "sample" terms, the regex \w is the same as [a-zA-Z0-9_] (if you are expecting only latin characters - \w includes letters and digits from many languages).

Just break it down in code just the way you described it:
txt='''\
%sample1%_%sample2% is valid
%sample2%%sample1% is valid
1_abc is valid
%sampleee1% is not valid
%sample2%%sample1%%sample1%%sample1% is valid'''
import re
for line in txt.splitlines():
print line.split(' ', 1)
if re.search(r'_', line) and re.search(r'\d', line) and re.search(r'[a-zA-Z]', line):
print 'valid #1'
elif re.search(r'%sample\d+%', line):
print 'valid #2'
else:
print 'not valid'
Prints:
['%sample1%_%sample2%', 'is valid ']
valid #1
['%sample2%%sample1%', 'is valid ']
valid #2
['1_abc', 'is valid ']
valid #1
['%sampleee1%', 'is not valid ']
not valid
['%sample2%%sample1%%sample1%%sample1%', 'is valid']
valid #2

This will also solve your problem:
^((%sample1%)(%sample2%)[a-zA-Z0-9])$

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Regex absolute begginer: filter alphanumeric

I'm playing codewars in Ruby and I'm stuck on a Kata. The goal is to validate if a user input string is alphanumeric. (yes, this is quite advanced Regex)
The instructions:
At least one character ("" is not valid)
Allowed characters are uppercase / lowercase latin letters and digits from 0 to 9
No whitespaces/underscore
What I've tried :
^[a-zA-Z0-9]+$
^(?! !)[a-zA-Z0-9]+$
^((?! !)[a-zA-Z0-9]+)$
It passes all the test except one, here's the error message:
Value is not what was expected
I though the Regex I'm using would satisfy all the conditions, what am I missing ?
SOLUTION: \A[a-zA-Z0-9]+\z (and better Ruby :^) )
$ => end of a line
\z => end of a string
(same for beginning: ^ (line) and \A (string), but wasn't needed for the test)
Favourite answer from another player:
/\A[A-z\d]+\z/
My guess is that maybe, we would start with an expression similar to:
^(?=[A-Za-z0-9])[A-Za-z0-9]+$
and test to see if it might cover our desired rules.
In this demo, the expression is explained, if you might be interested.
Test
re = /^(?=[A-Za-z0-9])[A-Za-z0-9]+$/m
str = '
ab
c
def
abc*
def^
'
# Print the match result
str.scan(re) do |match|
puts match.to_s
end
str !~ /[^A-Za-z\d]/
The string contains alphanumeric characters only if and only if it does not contain a character other than an alphnumeric character.

RegEx not recognized although it should be

I'm trying to split texts like these:
§1Hello§fman, §0this §8is §2a §blittle §dtest :)
by delimiter "§[a-z|A-Z
My first approach was the following:
^[§]{1}[a-fA-F]|[0-9]$
But pythex.org won't find any occurrences in my example text by using this regex.
Do you know why?
The ^[§]{1}[a-fA-F]|[0-9]$ pattern matches a string starting with § and then having a letter from a-f and A-F ranges, or a digit at the end of the string.
Note the ^ matches the start of the string, and $ matches the end of the string positions.
To extract those words after § and a hex char after it you may use
re.findall(r'§[A-Fa-z0-9]([^\W\d_]+)', s)
# => ['Hello', 'man', 'this', 'is', 'a', 'little', 'test']
To remove them, you may use re.sub:
re.sub(r'\s*§[A-Fa-z0-9]', ' ', s).strip()
# => Hello man, this is a little test :)
To just get a string of those delimiters you may use
"".join(re.findall(r'§[A-Za-z0-9]', s))
# => §1§f§0§8§2§b§d
See this Python demo.
Details
§ - a § symbol
[A-Fa-z0-9] - 1 digit or ASCII letter from a-f and A-F ranges (hex char)
([^\W\d_]+) - Group 1 (this value will be extracted by re.findall): one or more letters (to include digits, remove \d)
Your regex uses anchors to assert the start and the end of the string ^$.
You could update your regex to §[a-fA-F0-9]
Example using split:
import re
s = "§1Hello§fman, §0this §8is §2a §blittle §dtest :)"
result = [r.strip() for r in re.split('[§]+[a-fA-F0-9]', s) if r.strip()]
print(result)
Demo

Comparing strings with regex

I basically want to match strings like: "something", "some,thing", "some,one,thing", but I want to not match expressions like: ',thing', '_thing,' , 'some_thing'.
The pattern I want to match is: A string beginning with only letters and the rest of the body can be a comma, space or letters.
Here's what I did:
import re
x=re.compile('^[a-zA-z][a-zA-z, ]*') #there's space in the 2nd expression here
stri='some_thing'
x.match(str)
It gives me:
<_sre.SRE_Match object; span=(0, 4), match='some'>
The thing is, my regex somehow works but, it actually extracts the parts of the string that do match, but I want to compare the entire string with the regular expression pattern and return False if it does not match the pattern. How do I do this?
You use [a-Z] which matches more thank you think.
If you want to match [a-zA-Z] for both you might use the case insensitive flag:
import re
x=re.compile('^[a-z][a-z, ]*$', re.IGNORECASE)
stri='some,thing'
if x.match(stri):
print ("Match")
else:
print ("No match")
Test
the easiest way would be to just compare the result to the original string.
import re
x=re.compile('^[a-zA-z][a-zA-z, ]*')
str='some_thing'
x.match(str).group(0) == str #-> False
str = 'some thing'
x.match(str).group(0) == str #-> True

Wordnet how to know if string is valid query string

So I'm having trouble calling functions from Wordnet::SenseRelate because some of the "words" in the text are not valid queries. I've tried surrounding with try and catch so that the program doesn't quit and skips it but no luck. I wanted to check if a word was valid by using Wordnet::QueryData but it will quit when i use an invalid word like:
$wn->querySense("#44");
I get:
(querySense) Bad query string: #44
The regex which is used can be found in the statement:
my ($word, $pos, $sense) = $string =~ /^([^\#]+)(?:\#([^\#]+)(?:\#(\d+))?)?$/;
If in doubt whether a token will be accepted, test it against this regex.
Commenting on the specific question, there cannot be any leading or trailing # characters (the problem experienced). If # characters are present, there can be 1 or 2 but not more than 2 in the query string. The # characters if present as as delimiters to determine what is word, what is pos and what is sense.