Regex not matching repeated letters - regex

myString = "THIS THING CAN KISS MY BUTT. HERE ARE MORE SSS";
myNewString = reReplace(myString, "[^0-9|^S{2}]", "|", "All");
myNewString is "|||S||||||||||||SS||||||||||||||||||||||||SSS"
What I want is "||||||||||||||||SS|||||||||||||||||||||||||||" which is what I thought ^S{2} would do (exclude exactly 2 S). Why is it matching any S? Can someone tell me how to fix it? TIA.
actual goal
I'm trying to validate a list of values. Acceptable values would be 6 digit numbers or 5 digit numbers proceeded by SS so 123456,SS12345 is a valid list.
what i'm trying to do is make everything that isn't SS or a number into a new delimiter because i have no control over the input.
for example 123456 AND SS12345 should be changed to 123456|||||SS12345. after changing the | delimiter to , the result is 123456,SS12345. If a user were to enter 123456 PLUS SS12345 ends up with 123456||||S|SS12345 which = 123456,S,SS12345 which is invalid and the user gets an error, but it should be valid if it didn't match the single S.

The [^0-9|^S{2}] actually means:
[^ # any character except
0-9 # 0 to 9
| # a vertical bar
^ # a caret
S # an S <-----
{ # an open brace
2 # a 2, and
} # a close brace
]
Therefore it is not matching any S.
Since CodeFusion doesn't support lookbehind or having a callback in the replacement, I don't think this can be solved simply with just REReplace.
I don't know CF but I'll try something like:
resultString = "";
sCount = 0
for character in myString + "$":
if character == 'S':
sCount += 1
else:
if sCount == 2:
resultString += "SS"
else:
resultString += "|" * sCount
sCount = 0
if isdigit(character):
resultString += character
else:
resultString += "|"
resultString = resultString[:-1]

Am I reading this correctly in that you want to replace everything except exactly two consecutive S characters?
Is this restricted to a single replace call or can you run it through multiple regex operations? If multiple operations are allowed, it might be easier to run the string through one regex that matches against S{3,} (to pick up instances of three or more S characters) and then through a second one that uses ([^S])S([^S]) (to pick up single S characters). A third run could match against the rest of your rule ([^0-9]).

You're using a negative character class with [^....], any character NOT in 0-9|^S{2} will be replaced so 0-9, ^,{ & } will also survive. Negative matching of actual strings instead of characters would be quite hard. Simply only replacing 'SS{2}' would be: (?<!S)SS(?!S), anything BUT 'SS' is hardly doable. My best effort would be (?<=SS)S|S(?=SS)|(?<=S)S(?=S)|(?<!S)S(?!S)|[^S0-9], but I cannot guarantee it.

Related

Regex: not all BLANKS but allow certain characters, with limit

Trying to come up with a Regex, or combination of Regex, that returns False if a) they have only entered only BLANK(s), or they b) entered "non-legal" characters. Lastly, the number of characters has a set limit.
The closest I have thus far is below. Where it fails is that it does not count any leading spaces; only the non-BLANKs are counted, and so it fails. Using js.
const reg = /^(**[ ]***[!-~\u2018-\u201d\u2013\u2014]){1,10}$/;
EDIT: I think the above is incorrect, and I meant to post this:
const re4 = /^(?!\s*$)[!-~\u2018-\u201d\u2013\u2014]{1,10}$/;
EDIT 2: this has less clutter; allow space and all other 'standard' keyboard chars:
const re5 = /^(?!\s*$)[!-~]{1,10}$/;
So, this says you can enter a bunch of spaces, and must include at least 1 other character from the list following; but the {1,10} only counts the non-spaces and so I can end up with too many in total.
EDIT:
So, using re5 above --
s = ' '; // should fail
s = ' blah blah'; // should pass
s = ' blah blah'; // should fail, as there are 11 characters
Try ^(?:\s*\S){1,10}\s*$
Allow 1-10 non whiter, change \S to allow chars
Update 2: After learning that you cannot invert the match result in code, here's one last suggestion using negative lookahead (like you already tried yourself).
This regex matches only strings of 1-10 non-banned characters that are not all whitespace:
const re4 = /^(?!\s+$)[^\!-\~\u2018-\u201d\u2013\u2014]{1,10}$/
Update 1: Use this regex to match all-whitespace string OR strings longer than 10 chars OR strings containing bad characters:
const re4 = /(^\s+$|^.{11,}$|[\!-\~\u2018-\u201d\u2013\u2014])/
I understand that you want to impose a length restriction via regex. I would suggest against that and recommend using str.length instead.
This regex will match whitespace-only strings and strings containing one or more bad characters:
const re4 = /(^\s+$|[\!-\~\u2018-\u201d\u2013\u2014])/;
Regarding prohibition of all-whitespace strings: Instead of packing it into a regex, you might consider using something more explicit like if (s.trim().length == 0). IMO this makes your intention clearer and your code propably more readable, leaving you with this easy to read regex:
# matches any string containing a *bad* character
const re4 = /[\!-\~\u2018-\u201d\u2013\u2014]/;
If you use trim for the all-whitespace check, you might convert your regex into a positive assertion, even with length restriction:
# matches any string consisting of 1-10 characters not considered *bad*
const re4 = /^[^\!-\~\u2018-\u201d\u2013\u2014]{1,10}$/;
To match the input when it’s from 1 to 10 chars long and can't be all blanks, use a negative look ahead to assert not all blanks:
^(?! *$).{1,10}
If you want to restrict allowable chars, change the dot to a suitable character class of allowable chars.

Regular expression for checking every substring

I had this question in my end semester exam, unfortunately I couldn't solve it and tried it for several days and no luck.
Condition- For every substring of length 4 in a string of length n, write a RE to force the rule- there must be exactly three 1's.
My solution kind of looked like
(1+11+111+€)(0111)*(0+€).
But this is obviously wrong, string 11011 is a valid solution too.
Update- my new solution is (1+11+111+€)(0111)*(0+01+011+€).
Update- the plus operator is actually 'OR'
Update- € is empty string
Update - the string length has no requirements. A string of length 5 will have 2 substrings of length 4, the first 4 chars and the last 4 chars
I think the professor is looking for the realization that in order for the condition to hold, a 1 may never be bracketed by zeros on both sides.
(0?11)*
To complete the picture, we also need to include the case for n=1 where either a 0 or a 1 is allowed (I presume).
^[01]$|(0?11)*
I'm using traditional regex here, where [...] denotes a character class, | is "or", parentheses do grouping, and * specifies zero or more repetitions (Kleene star).
Edit: Consider the following regular expression where € refers to the empty string and assuming the alphabet consists of only {0,1,€}:
€ + (0111)* + (1110)* + (1101)* + (1011)*
Python 2
import re
# regular expression
# '^' in the start of the expression means from the very start of the string
# "[^1.]*" means match any character unless it is '1'
# '$' in the end of the expression means till the very end of the string
# Summary:
# from the start of the string till the end of it check if we have
# '1' exactly 3 times,
# and before or after them you might find any type of characters
# and these characters must not be equal '1' ...
expression = '^[^1.]*1[^1.]*1[^1.]*1[^1.]*$'
# testing strings
tests = ["01110", "11100", "00111", "010101", "00100100111", "00100"]
prog = re.compile(expression)
for test in tests:
print 'matched' if prog.match(test) else 'not matched'

Excel VBA Regex Check For Repeated Strings

I have some user input that I want to validate for correctness. The user should input 1 or more sets of characters, separated by commas.
So these are valid input
COM1
COM1,COM2,1234
these are invalid
COM -- only 3 characters
COM1,123 -- one set is only 3 characters
COM1.1234,abcd -- a dot separator not comma
I googled for a regex pattern to this and found a possible pattern that tested for a recurring instance of any 3 characters, and I modified like so
/^(.{4,}).*\1$/
but this is not finding matches.
I can manage the last comma that may or may not be there before passing to the test so that it is always there.
Preferably, I would like to test for letters (any case) and numbers only, but I can live with any characters.
I know I could easily do this in straight VBA splitting the input on a comma delimiter and looping through each character of each array element, but regex seems more efficient, and I will have more cases than have slightly different patterns, so parameterising the regex for that would be better design.
TIA
I believe this does what you want:
^([A-Z|a-z|0-9]{4},)*[A-Z|a-z|0-9]{4}$
It's a line beginning followed by zero or more groups of four letters or numbers ending with a comma, followed by one group of four letters or number followed by an end-of-line.
You can play around with it here: https://regex101.com/r/Hdv65h/1
The regular expression
"^[\w]{4}(,[\w]{4})*$"
should work.
You can try this to see whether it works for all your cases using the following function. Assuming your test strings are in cells A1 thru A5 on the spreadsheet:
Sub findPattern()
Dim regEx As New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "^[\w]{4}(,[\w]{4})*$"
Dim i As Integer
Dim val As String
For i = 1 To 5:
val = Trim(Cells(i, 1).Value)
Set mat = regEx.Execute(val)
If mat.Count = 0 Then
MsgBox ("No match found for " & val)
Else
MsgBox ("Match found for " & val)
End If
Next
End Sub

Regex to grab formulas

I am trying to parse a file that contains parameter attributes. The attributes are setup like this:
w=(nf*40e-9)*ng
but also like this:
par_nf=(1) * (ng)
The issue is, all of these parameter definitions are on a single line in the source file, and they are separated by spaces. So you might have a situation like this:
pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0
The current algorithm just splits the line on spaces and then for each token, the name is extracted from the LHS of the = and the value from the RHS. My thought is if I can create a Regex match based on spaces within parameter declarations, I can then remove just those spaces before feeding the line to the splitter/parser. I am having a tough time coming up with the appropriate Regex, however. Is it possible to create a regex that matches only spaces within parameter declarations, but ignores the spaces between parameter declarations?
Try this RegEx:
(?<=^|\s) # Start of each formula (start of line OR [space])
(?:.*?) # Attribute Name
= # =
(?: # Formula
(?!\s\w+=) # DO NOT Match [space] Word Characters = (Attr. Name)
[^=] # Any Character except =
)* # Formula Characters repeated any number of times
When checking formula characters, it uses a negative lookahead to check for a Space, followed by Word Characters (Attribute Name) and an =. If this is found, it will stop the match. The fact that the negative lookahead checks for a space means that it will stop without a trailing space at the end of the formula.
Live Demo on Regex101
Thanks to #Andy for the tip:
In this case I'll probably just match on the parameter name and equals, but replace the preceding whitespace with some other "parse-able" character to split on, like so:
(\s*)\w+[a-zA-Z_]=
Now my first capturing group can be used to insert something like a colon, semicolon, or line-break.
You need to add Perl tag. :-( Maybe this will help:
I ended up using this in C#. The idea was to break it into name value pairs, using a negative lookahead specified as the key to stop a match and start a new one. If this helps
var data = #"pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0";
var pattern = #"
(?<Key>[a-zA-Z_\s\d]+) # Key is any alpha, digit and _
= # = is a hard anchor
(?<Value>[.*+\-\\\/()\w\s]+) # Value is any combinations of text with space(s)
(\s|$) # Soft anchor of either a \s or EOB
((?!\s[a-zA-Z_\d\s]+\=)|$) # Negative lookahead to stop matching if a space then key then equal found or EOB
";
Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
.OfType<Match>()
.Select(mt => new
{
LHS = mt.Groups["Key"].Value,
RHS = mt.Groups["Value"].Value
});
Results:

Regex to create url friendly string

I want to create a url friendly string (one that will only contain letters, numbers and hyphens) from a user input to :
remove all characters which are not a-z, 0-9, space or hyphens
replace all spaces with hyphens
replace multiple hyphens with a single hyphen
Expected outputs :
my project -> my-project
test project -> test-project
this is # long str!ng with spaces and symbo!s -> this-is-long-strng-with-spaces-and-symbos
Currently i'm doing this in 3 steps :
$identifier = preg_replace('/[^a-zA-Z0-9\-\s]+/','',strtolower($project_name)); // remove all characters which are not a-z, 0-9, space or hyphens
$identifier = preg_replace('/(\s)+/','-',strtolower($identifier)); // replace all spaces with hyphens
$identifier = preg_replace('/(\-)+/','-',strtolower($identifier)); // replace all hyphens with single hyphen
Is there a way to do this with one single regex ?
Yeah, #Jerry is correct in saying that you can't do this in one replacement as you are trying to replace a particular string with two different items (a space or dash, depending on context). I think Jerry's answer is the best way to go about this, but something else you can do is use preg_replace_callback. This allows you to evaluate an expression and act on it according to what the match was.
$string = 'my project
test project
this is # long str!ng with spaces and symbo!s';
$string = preg_replace_callback('/([^A-Z0-9]+|\s+|-+)/i', function($m){$a = '';if(preg_match('/(\s+|-+)/i', $m[1])){$a = '-';}return $a;}, $string);
print $string;
Here is what this means:
/([^A-Z0-9]+|\s+|-+)/i This looks for any one of your three quantifiers (anything that is not a number or letter, more than one space, more than one hyphen) and if it matches any of them, it passes it along to the function for evaluation.
function($m){ ... } This is the function that will evaluate the matches. $m will hold the matches that it found.
$a = ''; Set a default of an empty string for the replacement
if(preg_match('/(\s+|-+)/i', $m[1])){$a = '-';} If our match (the value stored in $m[1]) contains multiple spaces or hyphens, then set $a to a dash instead of an empty string.
return $a; Since this is a function, we will return the value and that value will be plopped into the string wherever it found a match.
Here is a working demo
I don't think there's one way of doing that, but you could reduce the number of replaces and in an extreme case, use a one liner like that:
$text=preg_replace("/[\s-]+/",'-',preg_replace("/[^a-zA-Z0-9\s-]+/",'',$text));
It first removes all non-alphanumeric/space/dash with nothing, then replaces all spaces and multiple dashes with a single one.
Since you want to replace each thing with something different, you will have to do this in multiple iterations.
Sorry D: