Regex to match a string based on particular character count [duplicate] - regex

This question already has answers here:
Regular expression to match exact number of characters?
(2 answers)
Closed 4 years ago.
It is not a duplicate question, other questions are about the repetition of a regex and my question is how can we get/limit a particular character count in regex for validation, I am looking for a regex to match a string only when the string has the count of character ' as 1.
Example:
patt = #IDontKnow
s = "Shubham's"
if re.match(patt, s):
print ("The string has only ONE '")
else:
print ("The String has either less or more than ONE ' count")

I guess what you are looking for is this:
import re
pat = "^[^\']*\'[^\']*$"
print (re.match(pat, "aeh'3q4'bl;5hkj5l;ebj3'"))
print (re.match(pat, "aeh3q4'bl;5hkj5l;ebj3'"))
print (re.match(pat, "aeh3q4bl;5hkj5l;ebj3'"))
print (re.match(pat, "'"))
print (re.match(pat, ""))
Which gives output:
None
None
<_sre.SRE_Match object; span=(0, 21), match="aeh3q4bl;5hkj5l;ebj3'">
<_sre.SRE_Match object; span=(0, 1), match="'">
None
What "^[^\']*\'[^\']*$" does?
^ matches the beginning of string
[^\']* - * matches 0 or more characters from set defined in []. Here, we have a set negated using ^. The set is defined as one character - ', which is escaped so it looks \'. Altogether, this group matches any number of any characters except '
\' - matches one and only character '
$ - matches end of the string. Without it partial matches would be possible which could contain more ' characters. You can compare with above:
print (re.match("^[^\']*\'[^\']*", "aeh'3q4'bl;5hkj5l;ebj3'"))
<_sre.SRE_Match object; span=(0, 7), match="aeh'3q4">

Why not just use .count()?
s = "Shubham's"
if s.count("\'") == 1:
print ("The string has only ONE '")
else:
print ("The String has either less or more than ONE ' count")

Related

Regex match string where symbol is not repeated

I have like this strings:
group items % together into% FALSE
characters % that can match any single TRUE
How I can match sentences where symbol % is not repeated?
I tried like this pattern but it's found first match sentence with symbol %
[%]{1}
You may use this regex in python to return failure for lines that have more than one % in them:
^(?!([^%]*%){2}).+
RegEx Demo
(?!([^%]*%){2}) is a negative lookahead that fails the match if % is found twice after line start.
You could use re.search as follows:
items = ['group items % together into%', 'characters % that can match any single']
for item in items:
output = item
if re.search(r'^.*%.*%.*$', item):
output = output + ' FALSE'
else:
output = output + ' TRUE'
print(output)
This prints:
group items % together into% FALSE
characters % that can match any single TRUE
Just count them (Python):
>>> s = 'blah % blah %'
>>> s.count('%') == 1
False
>>> s = 'blah % blah'
>>> s.count('%') == 1
True
With regex:
>>> re.match('[^%]*%[^%]*$','gfdg%fdgfgfd%')
>>> re.match('[^%]*%[^%]*$','blah % blah % blah')
>>> re.match('[^%]*%[^%]*$','blah % blah blah')
<re.Match object; span=(0, 16), match='blah % blah blah'>
re.match must match from start of string, use ^ (match start of string) if using re.search, which can match in the middle of a string.
>>> re.search('^[^%]*%[^%]*$','gfdg%fdgfgfd%')
>>> re.search('^[^%]*%[^%]*$','gfdg%fdgfgfd')
<re.Match object; span=(0, 12), match='gfdg%fdgfgfd'>
I am assuming that "sentence" in your question is the same as a line in the input text. With that assumption, you can use the following:
^[^%\r\n]*(%[^%\r\n]*)?$
This, along with the multi-line and global flags, will match all lines in the input string that contain 0 or 1 '%' symbols.
^ matches the start of a line
[^%\r\n]* matches 0 or more characters that are not '%' or a new line
(...)? matches 0 or 1 instance of the contents in parentheses
% matches '%' literally
$ matches the end of a line

matching two or more characters that are not the same

Is it possible to write a regex pattern to match abc where each letter is not literal but means that text like xyz (but not xxy) would be matched? I am able to get as far as (.)(?!\1) to match a in ab but then I am stumped.
After getting the answer below, I was able to write a routine to generate this pattern. Using raw re patterns is much faster than converting both the pattern and a text to canonical form and then comaring them.
def pat2re(p, know=None, wild=None):
"""return a compiled re pattern that will find pattern `p`
in which each different character should find a different
character in a string. Characters to be taken literally
or that can represent any character should be given as
`know` and `wild`, respectively.
EXAMPLES
========
Characters in the pattern denote different characters to
be matched; characters that are the same in the pattern
must be the same in the text:
>>> pat = pat2re('abba')
>>> assert pat.search('maccaw')
>>> assert not pat.search('busses')
The underlying pattern of the re object can be seen
with the pattern property:
>>> pat.pattern
'(.)(?!\\1)(.)\\2\\1'
If some characters are to be taken literally, list them
as known; do the same if some characters can stand for
any character (i.e. are wildcards):
>>> a_ = pat2re('ab', know='a')
>>> assert a_.search('ad') and not a_.search('bc')
>>> ab_ = pat2re('ab*', know='ab', wild='*')
>>> assert ab_.search('abc') and ab_.search('abd')
>>> assert not ab_.search('bad')
"""
import re
# make a canonical "hash" of the pattern
# with ints representing pattern elements that
# must be unique and strings for wild or known
# values
m = {}
j = 1
know = know or ''
wild = wild or ''
for c in p:
if c in know:
m[c] = '\.' if c == '.' else c
elif c in wild:
m[c] = '.'
elif c not in m:
m[c] = j
j += 1
assert j < 100
h = tuple(m[i] for i in p)
# build pattern
out = []
last = 0
for i in h:
if type(i) is int:
if i <= last:
out.append(r'\%s' % i)
else:
if last:
ors = '|'.join(r'\%s' % i for i in range(1, last + 1))
out.append('(?!%s)(.)' % ors)
else:
out.append('(.)')
last = i
else:
out.append(i)
return re.compile(''.join(out))
You may try:
^(.)(?!\1)(.)(?!\1|\2).$
Demo
Here is an explanation of the regex pattern:
^ from the start of the string
(.) match and capture any first character (no restrictions so far)
(?!\1) then assert that the second character is different from the first
(.) match and capture any (legitimate) second character
(?!\1|\2) then assert that the third character does not match first or second
. match any valid third character
$ end of string

Get the word before & after '_-_' with REGEX PowerShell

I am trying to get the Word before and decimal string following a non guaranteed string that looks like ' - '.
Consider this string
"some str (targetWord - 12434 trailing string)"
this string is not guaranteed to have spaces before or after the '-'
so it could look like one of the following
"some str (targetWord-12434 trailing string)"
"some str (targetWord- 12434 trailing string)"
"some str (targetWord -12434 trailing string)"
"some str (targetWord- 12434 trailing string)"
So far I have the following
$allServices = (Get-Service "Known Service Prefix*").DisplayName
foreach ($service in $allServices){
$service = $service.split('\((.*?)\)')[1] #esc( 'Match any non greedy' esc)
if($service.split()[0] -Match '-'){
$arr_services += $service.split('( - )')[0..1]
}else{
$arr_services += ($service -replace '-','').split()[0..1]
}
}
This works to handle the simple case of ' - ' & '-', but cant handle anything else. I feel like this is the kind of problem that could be handled by one line of REGEX or at most two.
What I want to end up with is an array of strings, where the evens (including zero) are the targetWord, and the odd values are the decimal strings.
My issue isn't that I can't make this happen, it's that it looks like crap...
what I mean is my goal is to try and use REGEX to get each word, ignore the '-', and push out to a growing array the targetWord & decimalString.
I see this as more of a puzzle than anything and am trying to use this to improve my REGEX skills. Any help is appreciated!
A single regex passed to the -match operator should suffice:
$arr_services = $allServices | ForEach-Object {
if ($_ -match '\((?<word>\w+) *- *(?<number>\d+)') {
# Output the word and number consecutively.
$Matches.word, $Matches.number
}
}
# Output the resulting array.
$arr_services
Note how the pipeline output can be directly collected in a variable as an array ($arr_services = ...) - no need to iteratively "add" to an array. If you need to ensure that $arr_services is always an array - even if the pipeline outputs only one object, use [array] $arr_services = ...
With your sample strings, the above yields (a flat array of consecutive word-number pairs):
targetWord
12434
targetWord
12434
targetWord
12434
targetWord
12434
As for the regex:
\( matches a literal (
\w+ matches a nonempty run (+) of word characters (\w - letters, digits, _), captured in named capture group word ((?<word>...).
 *- * matches a literal - surrounded by any number of spaces - including none (*).
\d+ matches a nonempty run of digits (\d), captured in named group digits.
if the -match operator finds a match, the results are reflected in the automatic $Matches variable, a hashtable that enables accessing named capture groups directly by name.
here's one way to handle the data set you posted. it presumes all the strings will have the same general format that you posted. that means it WILL FAIL if your sample data set is not realistic. [grin]
$InStuff = #(
'some str (targetWord - 12434 trailing string)'
'some str (targetWord-12434 trailing string)'
'some str (targetWord- 12434 trailing string)'
'some str (targetWord -12434 trailing string)'
'some str (targetWord- 12434 trailing string)'
)
$Results = foreach ($IS_Item in $InStuff)
{
$Null = $IS_Item -match '.+\((?<Word>.+) *- *(?<Number>\d{1,}) .+\)'
[PSCustomObject]#{
Word = $Matches.Word.Trim()
Number = $Matches.Number
}
}
$Results
output ...
Word Number
---- ------
targetWord 12434
targetWord 12434
targetWord 12434
targetWord 12434
targetWord 12434

Python Replacement of Shortcodes using Regular Expressions

I have a string that looks like this:
my_str = "This sentence has a [b|bolded] word, and [b|another] one too!"
And I need it to be converted into this:
new_str = "This sentence has a <b>bolded</b> word, and <b>another</b> one too!"
Is it possible to use Python's string.replace or re.sub method to do this intelligently?
Just capture all the characters before | inside [] into a group . And the part after | into another group. Just call the captured groups through back-referencing in the replacement part to get the desired output.
Regex:
\[([^\[\]|]*)\|([^\[\]]*)\]
Replacemnet string:
<\1>\2</\1>
DEMO
>>> import re
>>> s = "This sentence has a [b|bolded] word, and [b|another] one too!"
>>> m = re.sub(r'\[([^\[\]|]*)\|([^\[\]]*)\]', r'<\1>\2</\1>', s)
>>> m
'This sentence has a <b>bolded</b> word, and <b>another</b> one too!'
Explanation...
Try this expression: [[]b[|](\w+)[]] shorter version can also be \[b\|(\w+)\]
Where the expression is searching for anything that starts with [b| captures what is between it and the closing ] using \w+ which means [a-zA-Z0-9_] to include a wider range of characters you can also use .*? instead of \w+ which will turn out in \[b\|(.*?)\]
Online Demo
Sample Demo:
import re
p = re.compile(ur'[[]b[|](\w+)[]]')
test_str = u"This sentence has a [b|bolded] word, and [b|another] one too!"
subst = u"<bold>$1</bold>"
result = re.sub(p, subst, test_str)
Output:
This sentence has a <bold>bolded</bold> word, and <bold>another</bold> one too!
Just for reference, in case you don't want two problems:
Quick answer to your particular problem:
my_str = "This sentence has a [b|bolded] word, and [b|another] one too!"
print my_str.replace("[b|", "<b>").replace("]", "</b>")
# output:
# This sentence has a <b>bolded</b> word, and <b>another</b> one too!
This has the flaw that it will replace all ] to </b> regardless whether it is appropriate or not. So you might want to consider the following:
Generalize and wrap it in a function
def replace_stuff(s, char):
begin = s.find("[{}|".format(char))
while begin != -1:
end = s.find("]", begin)
s = s[:begin] + s[begin:end+1].replace("[{}|".format(char),
"<{}>".format(char)).replace("]", "</{}>".format(char)) + s[end+1:]
begin = s.find("[{}|".format(char))
return s
For example
s = "Don't forget to [b|initialize] [code|void toUpper(char const *s)]."
print replace_stuff(s, "code")
# output:
# "Don't forget to [b|initialize] <code>void toUpper(char const *s)</code>."

Python regexp. My small program it is impossible to distinguish letters from numbers

Given the coordinates of the polygon and have to check the input string containing the data coordinates.
Here is my code
import re
t = "(0,0),(0,2),(2,2),(2,0),(0,1)"
#tt = "(0,0),(0,2),(2,2),(2,0),(0,'a')"
p='((\([0-9]+.?[0-9]*(\s)*,(\s)*[0-9]+.?[0-9]*(\s)*\)(\s)*,?(\s)*)+)'
b=re.search(p,t)
if b:
print "found"
else:
print "not found"
In both cases (t and tt) , the function returns true. Why is it so
Just add anchors. RE can match anywhere in your string
p='^((\([0-9]+.?[0-9]*(\s)*,(\s)*[0-9]+.?[0-9]*(\s)*\)(\s)*,?(\s)*)+)$'
>>> import re
>>> t = "(0,0),(0,2),(2,2),(2,0),(0,1)"
>>> p='((\([0-9]+.?[0-9]*(\s)*,(\s)*[0-9]+.?[0-9]*(\s)*\)(\s)*,?(\s)*)+)'
>>> re.search(p, t)
<_sre.SRE_Match object at 0x01AE7E20>
>>> tt = "(0,0),(0,2),(2,2),(2,0),(0,'a')"
>>> re.search(p, tt)
<_sre.SRE_Match object at 0x01AE7E90>
>>> p='^((\([0-9]+.?[0-9]*(\s)*,(\s)*[0-9]+.?[0-9]*(\s)*\)(\s)*,?(\s)*)+)$'
>>> re.search(p, tt)
>>> #no matching!
The ^ matches the start and the $ matches the end. This make the matching string only contains paired number from the beginning to the end.
That big subexpression is supposed to match an ordered pair (with an optional comma at the end), and I think it does. The + just means "one or more"; tt has four of them, and four is more than one, so the expression gets matched (with those four points as the match). If you want your pattern to match the whole string, then you need begin and end anchors in there, i.e. ^ and $.