get a line with regex - regex

I'm having trouble doing simple things with regex in dot net.
Suppose I want to find all lines that contain the word "pizza". I would think I would do the following:
^ .* pizza .* $
The idea is the first character indicates the start of a line, the dollar sign indicates the end of the line, and the dot-star indicates any number of characters.
This doesn't seem to work.
Then I tried something else that doesn't work either. I thought I would find all routines in my visual basic project that start with "Sub Page_Load" and end with "End Sub". I did a search for:
Sub Page_Load .* End Sub
But this found pretty much EVERY subroutine in the project.
In other words, it didn't limit itself to the Page_Load sub.
So I thought I'd be smart and notice that every End Sub is at the end of a line, so all I have to do is put a $ after it like this:
Sub Page_Load .* End Sub$
But that finds absolutely zero strings.
So what am I doing wrong? (one note, I put extra blanks around .* here so you can see it, but normally the blanks would not be there.

you may need non-greedy approach. try this:
^.*?pizza.*$

So, now complete new answer.
Search for the word "pizza" (not "pizzas")
If you have a Multiline string and want to find a single row, you need to use the Option [Multiline][1]. That changes the behaviour of the anchors ^ and $ to match the start and the end of the row.
To ensure to match only the complete word "pizza" and no partial match, use word boundaries
If you don't use the Singleline option, you don't need to worry about greediness
So your regex would be:
Regex optionRegex = new Regex(#"^.*\bpizza\b.*$", RegexOptions.Multiline);
For the Sub Page_Load.*End Sub thing, you need to match more than one line:
Use the single line option, to allow the . match also newline characters.
You need ungreedy matching behaviour of the quantifier
So your regex would be:
Regex optionRegex = new Regex(#"Sub Page_Load.*?End Sub", RegexOptions.Singleline);

Related

regular expression how do I make it stop looking

I'm trying to write a regex which matches the first 3 lines below (the rest are tests cases which I do NOT want to catch)
Sample text for testing:
10:00:00+10:00/mon,thu
10:00:00+10:00/mon-thu
10:00:00+10:00/mon
10:00:00+10:00/monday-thu
10:00:00+10:00/mon-thursday
10:00:00+10:00/mon,,,thu
10:00:00+10:00/mon,
10:00:00+10:00/mon+thu
10:00:00+10:00/monthu
10:00:00+10:00/
21:00:00+10:00\sat-sun
So far I have come up with
[0-9]{2}[:][0-9]{2}[:][0-9]{2}[+][0-9]{2}[:][0-9]{2}([/][a-z]{3}){1}([,-][a-z]{3})?
but as you can see it makes the matches I want but it also includes cases where there are trailing characters which I do not want and when there are trailing characters it should not be a match.
Add $ to the end of the regexp. This matches the end of the line, so it will prevent matches if there's anything after it.
You should also put ^ at the beginning so it doesn't match if there's anything before the time.

Regex match till end of text

I'm using Regex to match whole sentences in a text containing a certain string. This is working fine as long as the sentence ends with any kind of punctuation. It does not work however when the sentence is at the end of the text without any punctuation.
This is my current expression:
[^.?!]*(?<=[.?\s!])string(?=[\s.?!])[^.?!]*[.?!]
Works for:
This is a sentence with string. More text.
Does not work for:
More text. This is a sentence with string
Is there any way to make this word as intended? I can't find any character class for "end of text".
End of text is matched by the anchor $, not a character class.
You have two separate issues you need to address: (1) the sentence ending directly after string, and (2) the sentence ending sometime after string but with no end-of-sentence punctuation.
To do this, you need to make the match after string optional, but anchor that match to the end of the string. This also means that, after you recognize an (optional) end-of-sentence punctuation mark, you need to match everything that follows, so the end-of-string anchor will match.
My changes: Take everything after string in your original regex and surround it in (?:...)? - the (?:...) being a "non-remembered" group, and the ? making the entire group optional. Follow that with $ to anchor the end of the string.
Within that optional group, you also need to make the end-of-sentence itself optional, by replacing the simple [.?!] with (?:[.?!].*)? - again, the (?:...) is to make a "non-remembered" group, the ? makes the group optional - and the .* allows this to match as much as you want after the end-of-sentence has been found.
[^.?!]*(?<=[.?\s!])string(?:(?=[\s.?!])[^.?!]*(?:[.?!].*)?)?$
The symbol for end-of-text is $ (and, the symbol for beginning-of-text, if you ever need it, is ^).
You probably won't get what you're looking for with by just adding the $ to your punctuation list though (e.g., [.?!$]); you'll find it works better as an alternative choice: ([.?!]|$).
Your regex is way too complex for what you want to achieve.
To match only a word just use
"\bstring\b"
It will match start, end and any non-alphanum delimiters.
It works with the following:
string is at the start
this is the end string
this is a string.
stringing won't match (you don't want a match here)
You should add the language in the question for more information about using.
Here is my example using javascript:
var reg = /^([\w\s\.]*)string([\w\s\.]*)$/;
console.log(reg.test('This is a sentence with string. More text.'));
console.log(reg.test('More text. This is a sentence with string'));
console.log(reg.test('string'))
Note:
* : Match zero or more times.
? : Match zero or one time.
+ : Match one or more times.
You can change * with ? or + if you want more definition.

Meaning of caret (^) in a Regular Expression [duplicate]

I have read recently about JavaScript regular expressions, but I am confused.
The author says that it is necessary to include the caret (^) and dollar symbol ($) at the beginning and end of the all regular expressions declarations.
Why are they needed?
Javascript RegExp() allows you to specify a multi-line mode (m) which changes the behavior of ^ and $.
^ represents the start of the current line in multi-line mode, otherwise the start of the string
$ represents the end of the current line in multi-line mode, otherwise the end of the string
For example: this allows you to match something like semicolons at the end of a line where the next line starts with "var" /;$\n\s*var/m
Fast regexen also need an "anchor" point, somewhere to start it's search somewhere in the string. These characters tell the Regex engine where to start looking and generally reduce the number of backtracks, making your Regex much, much faster in many cases.
NOTE: This knowledge came from Nicolas Zakas's High Performance Javascript
Conclusion: You should use them!
^ represents the start of the input string.
$ represents the end.
You don't actually have to use them at the start and end. You can use em anywhere =) Regex is fun (and confusing). They don't represent a character. They represent the start and end.
This is a very good website
They match the start of the string (^) and end of the string ('$').
You should use them when matching strings at the start or end of the string. I wouldn't say you have to use them, however.
I have tested these.
1. /^a/ matches abb, ab but not ba, bab, bba.
2. /a/ matches abb, ab and ba, bab, bba.
I think that /^a/ matches such strings starting a.
/a/ matches such strings contains a.
Similar to /^a/, /a$/ matches ba, a, but not ab, bab.
Refer http://www.regular-expressions.info/anchors.html .
If you notify wrong(or strange) sentence in above or this to me, I would thank you.
^ anchors the beginning of the RE at the start of the test string, and $ anchors the end of the RE at the end of the test string. If that's what you want, go for it! However, if you're using REs of the form ^.*theRealRE.*$ then you might want to consider dropping the anchors and just using the core of the RE on its own.
Some languages force REs to be anchored at both ends by default.

Using RegEx to mach the beginning of string if end of string is not

I am trying to match lines in a configuration that start with the word "deny" but do not end with the word "log". This seems terribly elementary but I can not find my solution in any of the numerous forums I have looked. My beginners mindset led me to try "^deny.* (?!log$)" Why wouldn't this work? My understanding is that it would find any strings that begin with "deny" followed by any character for 0 or more digits where the end of line is something other than log.
When given a line like deny this log, your ^deny.*(?!log$) regex (I'm omitting the space that was in your sample question) is evaluated as follows:
^deny matches "deny".
.* means "match 0 or more of any character", so it can match " this log".
^(?!log$) means "make sure that the next characters aren't 'log' then the end of the line." In this case, they're not - they're just the end of the line - so the regex matches.
Try this regex instead:
^deny.*$(?<!log)
"Match deny at the beginning of the string, then match to the end of the line, then use a zero-width negative look-behind assertion to check that whatever we just matched at the end of the line is not 'log'."
With all of that said...
Regexes aren't necessarily the best tool for the job. In this case, a simple Boolean operator like
if (/^deny/ and not /log$/)
is probably clearer than a more advanced regex like
if (/^deny.*$(?<!log)/)
(?!log$) is a zero-width negative look-ahead assertion that means don't match if immediately ahead at this point in the string is log and the end of the string, but the .* in your regex has already greedily consumed all the characters right up to the end of the string so there is no way the log could then match.
If your regular expression implementation supports look-behinds you could use a regex such as in Josh Kelley's answer, if you were using javascript you could use
/^deny(?:.{0,2}|.*(?!log)...)$/m
The m flag means multiline mode, which makes ^ and $ match the start and end of every line rather than just the start and end of the string.
Note that three . are positioned after the negative look-ahead so that it has space to match log if it is there. Including these three dots meant it was also necessary to add the .{0,2} option so that strings with from zero to two characters after deny would also match. The (?:a|b) means a non-capturing group where a or b has to match.

Matching a line without either of two words

I was wondering how to match a line without either of two words?
For example, I would like to match a line without neither Chapter nor Part. So neither of these two lines is a match:
("Chapter 2 The Economic Problem 31" "#74")
("Part 2 How Markets Work 51" "#94")
while this is a match
("Scatter Diagrams 21" "#64")
My python-style regex will be like (?<!(Chapter|Part)).*?\n. I know it is not right and will appreciate your help.
Try this:
^(?!.*(Chapter|Part)).*
#MRAB's solution will work, but here's another option:
(?m)^(?:(?!\b(?:Chapter|Part)\b).)*$
The . matches one character at a time, after the lookahead checks that it's not the first character of Chapter or Part. The word boundaries (\b) make sure it doesn't incorrectly match part of a longer word, like Partition.
The ^ and $ are start- and end anchors; they ensure that you match a whole line. $ is better than \n because it also matches the end of the last line, which won't necessarily have a linefeed at the end. The (?m) at the beginning modifies the meaning of the anchors; without that, they only match at the beginning and end of the whole input, not of individual lines.