Regex for comparing Strings with spaces - regex

Im trying to compare is a string is present among a list of Strings using regex.
I tried using the following...
(?!MyDisk1$|MyDisk2$)
But this isnt working... for the scenarios like
(?!My disk1$|My Disk2$)
Can you suggest a better approach to deal with such situations..
I get the list of strings from an sql query... So I am not sure where the spaces are present. The list of Strings vary like My Disk1, MyDisk2, My_Disk3, ABCD123, XYZ_123, MNP 123 etc.... or any other String with [a-zA-Z0-9_ ]

You can make the spaces optional using a zero-or-one quantifier (?):
(?!My ?disk1$|My ?Disk2$)
This assertion will reject substrings like MyDisk2 or My Disk2. Or to handle potentially many spaces, use a zero-or-more quantifier (*):
(?!My *disk1$|My *Disk2$)
Note that if you're running this in an engine which ignores whitespace in the pattern you may need to use a character class, like this:
(?!My[ ]*disk1$|My[ ]*Disk2$)
Or to handle spaces or underscores:
(?!My[ _]*disk1$|My[ _]*Disk2$)
Unfortunately if the spaces can be anywhere in the string, (but you still care about matching the other letters in order), you'd have to do something like this:
(?! *M *y *d *i *s *k *1$| *M *y *D *i *s *k *2$)
Or to handle spaces or underscores:
(?![ _]*M[ _]*y[ _]*d[ _]*i[ _]*s[ _]*k[ _]*1$|[ _]*M[ _]*y[ _]*D[ _]*i[ _]*s[ _]*k[ _]*2$)
But to be honest, at that point, you may be better off preprocessing your data before you try to use your regex with it.

use this Regex upending i at the end that will mean that your regex is case-insensitive
/my\s?disk[12]\$/i
this will match all possible scenarios.

You can do this:
/(?[^\s_-]+(\s|_|-)?[^\s_-]*?$)/i
'?' quantifier means 0 or 1 of the preceding pattern.
/i is for case insensitive. The separator can be space or underscore or dash.I have replace My and disk with a string of length 1 or more which does not contain space ,underscore or dash.. Now it wil match "Shikhar Subedi" "dprpradeep" or "MyDisk 54".
The + quantifier means 1 or more. ^ means not. * means 0 or more. So the string after the space is optional.

Related

Regex: not all BLANKS but allow certain characters, with limit

Trying to come up with a Regex, or combination of Regex, that returns False if a) they have only entered only BLANK(s), or they b) entered "non-legal" characters. Lastly, the number of characters has a set limit.
The closest I have thus far is below. Where it fails is that it does not count any leading spaces; only the non-BLANKs are counted, and so it fails. Using js.
const reg = /^(**[ ]***[!-~\u2018-\u201d\u2013\u2014]){1,10}$/;
EDIT: I think the above is incorrect, and I meant to post this:
const re4 = /^(?!\s*$)[!-~\u2018-\u201d\u2013\u2014]{1,10}$/;
EDIT 2: this has less clutter; allow space and all other 'standard' keyboard chars:
const re5 = /^(?!\s*$)[!-~]{1,10}$/;
So, this says you can enter a bunch of spaces, and must include at least 1 other character from the list following; but the {1,10} only counts the non-spaces and so I can end up with too many in total.
EDIT:
So, using re5 above --
s = ' '; // should fail
s = ' blah blah'; // should pass
s = ' blah blah'; // should fail, as there are 11 characters
Try ^(?:\s*\S){1,10}\s*$
Allow 1-10 non whiter, change \S to allow chars
Update 2: After learning that you cannot invert the match result in code, here's one last suggestion using negative lookahead (like you already tried yourself).
This regex matches only strings of 1-10 non-banned characters that are not all whitespace:
const re4 = /^(?!\s+$)[^\!-\~\u2018-\u201d\u2013\u2014]{1,10}$/
Update 1: Use this regex to match all-whitespace string OR strings longer than 10 chars OR strings containing bad characters:
const re4 = /(^\s+$|^.{11,}$|[\!-\~\u2018-\u201d\u2013\u2014])/
I understand that you want to impose a length restriction via regex. I would suggest against that and recommend using str.length instead.
This regex will match whitespace-only strings and strings containing one or more bad characters:
const re4 = /(^\s+$|[\!-\~\u2018-\u201d\u2013\u2014])/;
Regarding prohibition of all-whitespace strings: Instead of packing it into a regex, you might consider using something more explicit like if (s.trim().length == 0). IMO this makes your intention clearer and your code propably more readable, leaving you with this easy to read regex:
# matches any string containing a *bad* character
const re4 = /[\!-\~\u2018-\u201d\u2013\u2014]/;
If you use trim for the all-whitespace check, you might convert your regex into a positive assertion, even with length restriction:
# matches any string consisting of 1-10 characters not considered *bad*
const re4 = /^[^\!-\~\u2018-\u201d\u2013\u2014]{1,10}$/;
To match the input when it’s from 1 to 10 chars long and can't be all blanks, use a negative look ahead to assert not all blanks:
^(?! *$).{1,10}
If you want to restrict allowable chars, change the dot to a suitable character class of allowable chars.

split text into words and exclude hyphens

I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)

Regular expression to find position of the last alpha character that is followed by a space?

I am using ColdFusion 10. I rarely need to use regular expression and really need some help.
I have some lengthy content (up to 8,000 characters) and want to create a teaser. After a certain length (which I will define elsewhere), I want to find the last alpha character that is followed by a space. I will remove everything after that character. I will then add the ellipsis (...)
MyString = "The lazy brown fox is not a dog."
In this case, I would delete everything after the "a" that precedes "dog".
MyString = "There are 123 boxes on up the hill, says that 612 guy."
In this case, I would delete everything after the "that" that precedes "612 ".
MyString = "I fell down the stairs on June 30th, 1962."
In this case, I would delete everything after the "June" that precedes "30th".
What regular expression would I use to find the position of the last alpha [a-Z] character that is followed by a space?
MyReg = "";
LastPosition = reFindNoCase(MyReg, MyString);
I'm not sure about REFindNoCase, but I think you can try with REReplaceNoCase. I hope that CF can take back references like most regex engines do:
REReplaceNoCase(MyString, "(.*\b[a-zA-Z]+\b)\s.*", "$1", ALL);
EDIT: for the backreference, it appears that you use the backslash instead of the dollar sign:
REReplaceNoCase(MyString, "(.*\b[a-zA-Z]+\b)\s.*", "\1", ALL);
And if it goes well, you should have something like this.
.* matches anything besides a newline character, \b matches word boundaries, [a-zA-Z]+ are for alphabet characters and \s is for the space just after it.
The greediness of the first .*'s is being exploited here to capture as much as possible until you get the last word followed by a space.
And I guess you can add the ellpses after the $1 like so:
REReplaceNoCase(MyString, "(.*\b[a-zA-Z]+\b)\s.*", "\1 (...)", ALL)
If you only want to use REFind(), you could maybe use this:
REFindNoCase("[A-Za-z](?:\s\d+|\w+,)*\s[^\s]+\.$", MyString);
Note that I haven't tested this against other possible scenarios, but I tried a few which don't work with the above but with this one:
REFindNoCase("[A-Za-z](?:\s\d+|\s?\w+[,.-]+)*\s[^\s]+[.\s]*$", MyString);
And those are the few test subjects: link.
REFind will give you the position of the last alpha character. You can add 1 to get the position of the space in the original string.
If you're dealing with long strings, a regex would need to scan the whole string to get to the end, and it's likely more efficient to instead start at the end and work backwards.
Like this:
LastPos = len(String);
while( LastPos > 1 )
{
LastPos = String.lastIndexOf(' ',LastPos-1);
if ( mid(String,LastPos,1).matches('[a-zA-Z]') )
break;
}
NewString = left(String,LastPos);
The idea is to keep stepping backwards finding spaces, and break the loop when the previous character is a letter (or the start of the string is reached).
If you really want a regex solution, just do:
NewString = rematch('.*[a-zA-Z] ',MyString)[1];
To get the position, you do len(NewString).
(If newlines are involved, you'd need to put (?s) at the start of the expression so that the dot matches them.)

Why doesn't this regex pattern work?

I'm trying to select commas without numbers of 4 digits or the word "id" before, I tried with this:
( ? < ! [ \ d { 5 } | id ] ) ,
The problem
for example, if input string is "1999," that comma is not selected, I don't understand why.
Try this pattern:
(?<!\d{5}|id),
Your pattern, (?<![\d{5}|id]), is looking for a comma that is not after a digit, {, }, |, i, or d - They should not be in a charterer class: []. If anything, (?<![\d]{5}|id), will also work, but is redundant.
First of all, unless you're using the /x flag, each space will attempt to match a space. So take those out.
Second, you're using [...] presumably to group an alternation (|) but square brackets actually indicate a character class, i.e. [\d{5}|id] is equivalent to [id5{}|] and matches any one of those characters, but not more. What you mean is this:
(?<!\d{5}|id),
The final problem might be that many implementations of regex (you haven't specified which you're using) don't support variable-width lookbehind assertions. So, you may need to do something like:
(?<!\d{5}|...id),

Regular expression help - comma delimited string

I don't write many regular expressions so I'm going to need some help on the one.
I need a regular expression that can validate that a string is an alphanumeric comma delimited string.
Examples:
123, 4A67, GGG, 767 would be valid.
12333, 78787&*, GH778 would be invalid
fghkjhfdg8797< would be invalid
This is what I have so far, but isn't quite right: ^(?=.*[a-zA-Z0-9][,]).*$
Any suggestions?
Sounds like you need an expression like this:
^[0-9a-zA-Z]+(,[0-9a-zA-Z]+)*$
Posix allows for the more self-descriptive version:
^[[:alnum:]]+(,[[:alnum:]]+)*$
^[[:alnum:]]+([[:space:]]*,[[:space:]]*[[:alnum:]]+)*$ // allow whitespace
If you're willing to admit underscores, too, search for entire words (\w+):
^\w+(,\w+)*$
^\w+(\s*,\s*\w+)*$ // allow whitespaces around the comma
Try this pattern: ^([a-zA-Z0-9]+,?\s*)+$
I tested it with your cases, as well as just a single number "123". I don't know if you will always have a comma or not.
The [a-zA-Z0-9]+ means match 1 or more of these symbols
The ,? means match 0 or 1 commas (basically, the comma is optional)
The \s* handles 1 or more spaces after the comma
and finally the outer + says match 1 or more of the pattern.
This will also match
123 123 abc (no commas) which might be a problem
This will also match 123, (ends with a comma) which might be a problem.
Try the following expression:
/^([a-z0-9\s]+,)*([a-z0-9\s]+){1}$/i
This will work for:
test
test, test
test123,Test 123,test
I would strongly suggest trimming the whitespaces at the beginning and end of each item in the comma-separated list.
You seem to be lacking repetition. How about:
^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$
I'm not sure how you'd express that in VB.Net, but in Python:
>>> import re
>>> x [ "123, $a67, GGG, 767", "12333, 78787&*, GH778" ]
>>> r = '^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$'
>>> for s in x:
... print re.match( r, s )
...
<_sre.SRE_Match object at 0xb75c8218>
None
>>>>
You can use shortcuts instead of listing the [a-zA-Z0-9 ] part, but this is probably easier to understand.
Analyzing the highlights:
[a-zA-Z0-9 ]+ : capture one or more (but not zero) of the listed ranges, and space.
(?:[...]+,)* : In non-capturing parenthesis, match one or more of the characters, plus a comma at the end. Match such sequences zero or more times. Capturing zero times allows for no comma.
[...]+ : capture at least one of these. This does not include a comma. This is to ensure that it does not accept a trailing comma. If a trailing comma is acceptable, then the expression is easier: ^[a-zA-Z0-9 ,]+
Yes, when you want to catch comma separated things where a comma at the end is not legal, and the things match to $LONGSTUFF, you have to repeat $LONGSTUFF:
$LONGSTUFF(,$LONGSTUFF)*
If $LONGSTUFF is really long and contains comma repeated items itself etc., it might be a good idea to not build the regexp by hand and instead rely on a computer for doing that for you, even if it's just through string concatenation. For example, I just wanted to build a regular expression to validate the CPUID parameter of a XEN configuration file, of the ['1:a=b,c=d','2:e=f,g=h'] type. I... believe this mostly fits the bill: (whitespace notwithstanding!)
xend_fudge_item_re = r"""
e[a-d]x= #register of the call return value to fudge
(
0x[0-9A-F]+ | #either hardcode the reply
[10xks]{32} #or edit the bitfield directly
)
"""
xend_string_item_re = r"""
(0x)?[0-9A-F]+: #leafnum (the contents of EAX before the call)
%s #one fudge
(,%s)* #repeated multiple times
""" % (xend_fudge_item_re, xend_fudge_item_re)
xend_syntax = re.compile(r"""
\[ #a list of
'%s' #string elements
(,'%s')* #repeated multiple times
\]
$ #and nothing else
""" % (xend_string_item_re, xend_string_item_re), re.VERBOSE | re.MULTILINE)
Try ^(?!,)((, *)?([a-zA-Z0-9])\b)*$
Step by step description:
Don't match a beginning comma (good for the upcoming "loop").
Match optional comma and spaces.
Match characters you like.
The match of a word boundary make sure that a comma is necessary if more arguments are stacked in string.
Please use - ^((([a-zA-Z0-9\s]){1,45},)+([a-zA-Z0-9\s]){1,45})$
Here, I have set max word size to 45, as longest word in english is 45 characters, can be changed as per requirement