Matlab regexp is there an or statement? - regex

Hey guys I am trying to find a way to display the letter I by itself but I keep having trouble this is what I have so far.
This is the text file that I open, tolls.txt:
Join Microsoft employees supporting I Inspire Youth Project and other youth causes #GivingHero: http://msft.it/6013jboz
Waze for #WindowsPhone is here: http://msft.it/6016jbp2 I
fid=fopen('tolls.txt');
getLine=fgetl(fid);
while ischar(getLine)
ct='I\s';
How=regexp(getLine,ct,'match');
counter=counter+length(How);
getLine=fgetl(fid);
end
My problem is since I have to incorporate any time there is an I I have to be able to show all the stand alone capital I that have no spaces after it such as in an end of a sentence and before a sentence. So in my bat variable I have bat=I\s but I don't know if there is a or statement I can use to also incorporate \sI.
Hope I was clear about the question thank you for the help in advance.

What you'd need is something like:
ct = '(?<!\w)(I)(?!\w)';
Here (?<!\w) and (?!\w) denote a negative look-behind and a negative look-ahead respectively for a character from the word character class.
More information about the same may be found here.

#RoneyMichael's solution is fine (though possibly overkill), but there is an or statement. Here is how you could look for three distinct patterns – ' I ' or 'I ' or ' I':
ct='(^I[\W]*\s)|(\sI[\W]*\s)|(\sI[\W]*$)';
How=regexp(getLine,ct,'match')
which returns:
How =
' I ' ' I'
The last two patterns specifically match the latter 'I' if it occurs at the beginning or the end of the string, respectively. The '[\W]*' matches zero or more occurrences of non-word characters, i.e., punctuation. It's zero or more because of things like '...', '?!', etc. Alternatively, you could explicitly list allowed punctuation by using something like '[\.\?\!]*' instead (just remember that things such as quotes, parentheses, brackets, etc. can also come at the end of a line). Also, you may want to match '"I' or ''I'. In that case you can simply use
ct='(^[\W]*I[\W]*\s)|(\s[\W]*I[\W]*\s)|(\s[\W]*I[\W]*$)';
There are other logical and conditional operators that you can use in regular expressions.

Related

Regex: match string unless it contains a word [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

Regular Expression replace special characters, only if not part of word

I have the following string:
'United Breaks Guitars': Did It Really Cost The Airline $180 Million? http://ow.ly/htPVk
Currently, my regex pattern looks like this: [^A-Za-z-0-9- - / -$]
I'm not an expert on regex and I've been playing around with this tool to figure things out, but I am stuck.
I'd like to remove characters such as ', ", :, etc. So far with the above pattern the highlighted characters are being removed from my example string:
'United Breaks Guitars' : Did It Really Cost The Airline $180 Million? http://ow.ly/htPVk
The issue above is that I don't want to remove the : and . from the URL. But if the string ends with a period I would like to remove it. Also, the apostrophe ' character should be kept in case it's used to omit characters or as a possession.
Thanks in advance.
Depends on how you define "part of a word", URL isn't much of a word.
If you define "part of a word" as surrounded by non-space characters, then you could use something like:
(?<!\S)[^\w $-]+|[^\w $-]+(?!\S)
(?!\S) is a shorter way of saying (?=\s|$), and the same applies for the lookbehind.

Regular expression to select entire word except first letter, including words such as "Jack's" and "merry-go-round"

I'm trying to use a regular expression to select all of each word except the first character, much as #mahdaeng wanted to do here. The solution offered to his question was to use \B[a-z]. This works fine, except when a word contains some form of punctuation, such as "Jack's" and "merry-go-round". Is there a way to select the entire word including any contained punctuation? (Not including outside punctuation such as "? , ." etc.)
If you can enumerate the acceptable in-word punctuation, you could just expand upon the answer you linked:
\B[a-zA-Z'-]+
A regex really isn't necessary here, since you can just split your word on spaces and deal with each word accordingly. Since you don't mention an underlying language, here's an implementation in Perl:
use strict;
use warnings;
$_="Jack's merry-go-round revolves way too fast!";
my #words=split /\s+/;
foreach my $word(#words)
{
my $stripped_word=substr($word,1);
$stripped_word=~s/[^a-z]$//i; #stripping out end punctuation
print "$stripped_word\n";
}
The output is:
ack's
erry-go-round
evolves
ay
oo
ast
\B[^\s]+
(where ^\s means "not whitespace") should get you what you want assuming the words are whitespace-delimited. If they're also punctuation-delimited, you might need to enumerate the punctuation:
\B[^\s,.?!]+

Regular expression for a list of items separated by comma or by comma and a space

Hey,
I can't figure out how to write a regular expression for my website, I would like to let the user input a list of items (tags) separated by comma or by comma and a space, for example "apple, pie,applepie". Would it be possible to have such regexp?
Thanks!
EDIT:
I would like a regexp for javascript in order to check the input before the user submits a form.
What you're looking for is deceptively easy:
[^,]+
This will give you every comma-separated token, and will exclude empty tokens (if the user enters "a,,b" you will only get 'a' and 'b'), BUT it will break if they enter "a, ,b".
If you want to strip the spaces from either side properly (and exclude whitespace only elements), then it gets a tiny bit more complicated:
[^,\s][^\,]*[^,\s]*
However, as has been mentioned in some of the comments, why do you need a regex where a simple split and trim will do the trick?
Assuming the words in your list may be letters from a to z and you allow, but do not require, a space after the comma separators, your reg exp would be
[a-z]+(,\s*[a-z]+)*
This is match "ab" or "ab, de", but not "ab ,dc"
Here's a simpler solution:
console.log("test, , test".match(/[^,(?! )]+/g));
It doesn't break on empty properties and strips spaces before and after properties.
This thread is almost 7 years old and was last active 5 months ago, but I wanted to achieve the same results as OP and after reading this thread, came across a nifty solution that seems to work well
.match(/[^,\s?]+/g)
Here's an image with some example code of how I'm using it and how it's working
Regarding the regular expression... I suppose a more accurate statement would be to say "target anything that IS NOT a comma followed by any (optional) amount of white space" ?
I often work with coma separated pattern, and for me, this works :
((^|[,])pattern)+
where "pattern" is the single element regexp
This might work:
([^,]*)(, ?([^,]*))*
([^,]*)
Look For Commas within a given string, followed by separating these. in regards to the whitespace? cant you just use commas? remove whitespace?
I needed an strict validation for a comma separated input alphabetic characters, no spaces. I end up using this one is case anyone needed:
/^[a-z]+(,[a-z]+)*$/
Or, to support lower- and uppercase words:
/^[A-Za-z]+(?:,[A-Za-z]+)*$/
In case one need to allow whitespace between words:
/^[A-Za-z]+(?:\s*,\s*[A-Za-z]+)*$/
/^[A-Za-z]+(?:,\s*[A-Za-z]+)*$/
You can try this, it worked for me:
/.+?[\|$]/g
or
/[^\|?]+/g
but replace '|' for the one you need. Also, don't forget about shielding.
something like this should work: ((apple|pie|applepie),\s?)*

How to negate specific word in regex? [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.