Regular Expressions for string Contains pattern - regex

Am trying to form Regex for the below pattern for string
String may contain alphanumeric characters
String may contain the following special characters like Space ( ) - :/
If the string contains anything apart from this my Regex should return false
I have tried with below regex [0-9a-zA-Z\s():-]+ but it is not working out. match is returning true even if it contais characters like ,; etc
am able to achieve for blacklisting but am trying to achieve what characters are allowed if anything othet than that found return false
Some one who is good in writing regular expressions can help me out.
Thanks

Make sure to use start/end anchors to avoid matching unwanted input data:
^[0-9a-zA-Z \/():-]+$

Related

How to apply correct regex?

I have a special task which requires lots of regex and javascript parsing.
My head is almost exploding, so maybe I'm tired and forgot some small thing else I'm not newbie to regex so perhaps someone will point me to good direction here and show me where I did mistake.
So I have this regex code:
((?<=\ffmpg=).+(?=////u0026cs=nt))
to get the value of substring between 2 strings. The first string is called:
ffmpg= from this string it should start and it will end just before the other string start called //u0026cs=nt
The problem is that it is working fine until the html page contains only one parameter with the same name; because the source html has inside like 10's of ffmg and the same end string called cs=nt.
I can not even make regex to count the characters because every time you visit the html page the number of characters are different, sometimes +3 else +10. So the only way is to get this sting from the start of param1 to the end of param2.
This is the string I need to get: 1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012
This is the source html example:
\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\
I have copied 3 times the same just for this purpose because it is very big html source and I doubt I can upload it here.
Thanks for your help.
In your questions, you use (?<=\ffmpg=) where \f will match a form feed character which is not present in the data example. If you meant to use \\f it will match \f which is also not present in the example data.
You could get the match using a capturing group instead of using lookarounds as lookbehinds are not widely supported by all browsers.
If you just want to get a single match, you can omit the /g global flag.
If you use .+ you will match too much as the .+ will match until the end of the string and then backtracks until the first time it can match \\u0026cs=nt
What you could do instead is be specific in what you would allow to match which for the current string is a character class with the following characters [AC0-9%]+
You could broaden the character class with a range to match chars A-Z instead of AC for example and add more chars or ranges as required.
ffmpg=([AC0-9%]+)\\\\u0026cs=nt
Regex demo
For example
const regex = /ffmpg=([AC0-9%]+)\\\\u0026cs=nt/;
const str = `\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\`;
console.log(str.match(regex)[1]);
Try this:
(?<=ffmpg=)([A-F0-9%]+)
Explanation
Since your string only consists of url-encoded characters, you can use [A-F0-9%]+character class to capture it. It will stop when next string starts because there will be a backslash.
See online demo here.

Interesting easy looking Regex

I am re-phrasing my question to clear confusions!
I want to match if a string has certain letters for this I use the character class:
[ACD]
and it works perfectly!
but I want to match if the string has those letter(s) 2 or more times either repeated or 2 separate letters
For example:
[AKL] should match:
ABCVL
AAGHF
KKUI
AKL
But the above should not match the following:
ABCD
KHID
LOVE
because those are there but only once!
that's why I was trying to use:
[ACD]{2,}
But it's not working, probably it's not the right Regex.. can somebody a Regex guru can help me solve this puzzle?
Thanks
PS: I will use it on MYSQL - a differnt approach can also welcome! but I like to use regex for smarter and shorter query!
To ensure that a string contains at least two occurencies in a set of letters (lets say A K L as in your example), you can write something like this:
[AKL].*[AKL]
Since the MySQL regex engine is a DFA, there is no need to use a negated character class like [^AKL] in place of the dot to avoid backtracking, or a lazy quantifier that is not supported at all.
example:
SELECT 'KKUI' REGEXP '[AKL].*[AKL]';
will return 1
You can follow this link that speaks on the particular subject of the LIKE and the REGEXP features in MySQL.
If I understood you correctly, this is quite simple:
[A-Z].*?[A-Z]
This looks for your something in your set, [A-Z], and then lazily matches characters until it (potentially) comes across the set, [A-Z], again.
As #Enigmadan pointed out, a lazy match is not necessary here: [A-Z].*[A-Z]
The expression you are using searches for characters between 2 and unlimited times with these characters ACDFGHIJKMNOPQRSTUVWXZ.
However, your RegEx expression is excluding Y (UVWXZ])) therefore Z cannot be found since it is not surrounded by another character in your expression and the same principle applies to B ([ACD) also excluded in you RegEx expression. For example Z and A would match in an expression like ZABCDEFGHIJKLMNOPQRSTUVWXYZA
If those were not excluded on purpose probably better can be to use ranges like [A-Z]
If you want 2 or more of a match on [AKL], then you may use just [AKL] and may have match >= 2.
I am not good at SQL regex, but may be something like this?
check (dbo.RegexMatch( ['ABCVL'], '[AKL]' ) >= 2)
To put it in simple English, use [AKL] as your regex, and check the match on the string to be greater than 2. Here's how I would do in Java:
private boolean search2orMore(String string) {
Matcher matcher = Pattern.compile("[ACD]").matcher(string);
int counter = 0;
while (matcher.find())
{
counter++;
}
return (counter >= 2);
}
You can't use [ACD]{2,} because it always wants to match 2 or more of each characters and will fail if you have 2 or more matching single characters.
your question is not very clear, but here is my trial pattern
\b(\S*[AKL]\S*[AKL]\S*)\b
Demo
pretty sure this should work in any case
(?<l>[^AKL\n]*[AKL]+[^AKL\n]*[AKL]+[^AKL\n]*)[\n\r]
replace AKL for letters you need can be done very easily dynamicly tell me if you need it
Is this what you are looking for?
".*(.*[AKL].*){2,}.*" (without quotes)
It matches if there are at least two occurences of your charactes sorrounded by anything.
It is .NET regex, but should be same for anything else
Edit
Overall, MySQL regular expression support is pretty weak.
If you only need to match your capture group a minimum of two times, then you can simply use:
select * from ... where ... regexp('([ACD].*){2,}') #could be `2,` or just `2`
If you need to match your capture group more than two times, then just change the number:
select * from ... where ... regexp('([ACD].*){3}')
#This number should match the number of matches you need
If you needed a minimum of 7 matches and you were using your previous capture group [ACDF-KM-XZ]
e.g.
select * from ... where ... regexp('([ACDF-KM-XZ].*){7,}')
Response before edit:
Your regex is trying to find at least two characters from the set[ACDFGHIJKMNOPQRSTUVWXZ].
([ACDFGHIJKMNOPQRSTUVWXZ]){2,}
The reason A and Z are not being matched in your example string (ABCDEFGHIJKLMNOPQRSTUVWXYZ) is because you are looking for two or more characters that are together that match your set. A is a single character followed by a character that does not match your set. Thus, A is not matched.
Similarly, Z is a single character preceded by a character that does not match your set. Thus, Z is not matched.
The bolded characters below do not match your set
ABCDEFGHIJKLMNOPQRSTUVWXYZ
If you were to do a global search in the string, only the italicized characters would be matched:
ABCDEFGHIJKLMNOPQRSTUVWXYZ

How to ignore whitespace in a regular expression subject string?

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.

Can I shorten this regular expression?

I have the need to check whether strings adhere to a particular ID format.
The format of the ID is as follows:
aBcDe-fghIj-KLmno-pQRsT-uVWxy
A sequence of five blocks of five letters upper case or lower case, separated by one dash.
I have the following regular expression that works:
string idFormat = "[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}";
Note that there is no trailing dash, but the all of the blocks within the ID follow the same format. Therefore, I would like to be able to represent this sequence of four blocks with a trailing dash inside the regular expression and avoid the duplication.
I tried the following, but it doesn't work:
string idFormat = "[[a-zA-Z]{5}[-]{1}]{4}[a-zA-Z]{5}";
How do I shorten this regular expression and get rid of the duplicated parts?
What is the best way to ensure that each block does also not contain any numbers?
Edit:
Thanks for the replies, I now understand the grouping in regular expressions.
I'm running a few tests against the regular expression, the following are relevant:
Test 1: aBcDe-fghIj-KLmno-pQRsT-uVWxy
Test 2: abcde-fghij-klmno-pqrst-uvwxy
With the following regular expression, both tests pass:
^([a-zA-Z]{5}-){4}[a-zA-Z]{5}$
With the next regular expression, test 1 fails:
^([a-z]{5}-){4}[a-z]{5}$
Several answers have said that it is OK to omit the A-Z when using a-z, but in this case it doesn't seem to be working.
You can try:
([a-z]{5}-){4}[a-z]{5}
and make it case insensitive.
If you can set regex options to be case insensitive, you could replace all [a-zA-Z] with just plain [a-z]. Furthermore, [-]{1} can be written as -.
Your grouping should be done with (, ), not with [, ] (although you're correctly using the latter in specifying character sets.
Depending on context, you probably want to throw in ^...$ which matches start and end of string, respectively, to verify that the entire string is a match (i.e. that there are no extra characters).
In javascript, something like this:
/^([a-z]{5}-){4}[a-z]{5}$/i
This works for me, though you might want to check it:
[a-zA-Z]{5}(-[a-zA-Z]{5}){4}
(One group of five letters, followed by [dash+group of five letters] four times)
([a-zA-Z]{5}[-]{1}){4}[a-zA-Z]{5}
Try
string idFormat = "([a-zA-Z]{5}[-]{1}){4}[a-zA-Z]{5}";
I.e. you basically replace your brackets by parentheses. Brackets are not meant for grouping but for defining a class of accepted characters.
However, be aware that with shortened versions, you can use the expression for validating the string, but not for analyzing it. If you want to process the 5 groups of characters, you will want to put them in 5 groups:
string idFormat =
"([a-zA-Z]{5})-([a-zA-Z]{5})-([a-zA-Z]{5})-([a-zA-Z]{5})-([a-zA-Z]{5})";
so you can address each group and process it.

How to match a string that does not end in a certain substring?

how can I write regular expression that dose not contain some string at the end.
in my project,all classes that their names dont end with some string such as "controller" and "map" should inherit from a base class. how can I do this using regular expression ?
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Do a search for all filenames matching this:
(?<!controller|map|anythingelse)$
(Remove the |anythingelse if no other keywords, or append other keywords similarly.)
If you can't use negative lookbehinds (the (?<!..) bit), do a search for filenames that do not match this:
(?:controller|map)$
And if that still doesn't work (might not in some IDEs), remove the ?: part and it probably will - that just makes it a non-capturing group, but the difference here is fairly insignificant.
If you're using something where the full string must match, then you can just prefix either of the above with ^.* to do that.
Update:
In response to this:
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Not quite sure what you're attempting with the public/class stuff there, so try this:
public.*class.*(?<!controller|map)$`
The . is a regex char that means "anything except newline", and the * means zero or more times.
If this isn't what you're after, edit the question with more details.
Depending on your regex implementation, you might be able to use a lookbehind for this task. This would look like
(?<!SomeText)$
This matches any lines NOT having "SomeText" at their end. If you cannot use that, the expression
^(?!.*SomeText$).*$
matches any non-empty lines not ending with "SomeText" as well.
You could write a regex that contains two groups, one consists of one or more characters before controller or map, the other contains controller or map and is optional.
^(.+)(controller|map)?$
With that you may match your string and if there is a group() method in the regex API you use, if group(2) is empty, the string does not contain controller or map.
Check if the name does not match [a-zA-Z]*controller or [a-zA-Z]*map.
finally I did it in this way
public.*class.*[^(controller|map|spec)]$
it worked