Regex to match when a string is present twice - regex

I am horrible at RegEx expressions and I just don't use them often enough for me to remember the syntax between uses.
I am using grepWin to search my files. I need to do a search that will return the files that have a given string twice.
So, for example, if I was searching on the word "how", then file one would not match:
Hello
how are you today?
but file two would:
Hello
how are you today?
I am fine, how are you?
Any one know how to make a RegEx that will match that?

something like this (depends on language and your specific task)
\(how.*){2}\
Edit:
according to #CodeJockey
\^(([^h]|h[^o]|ho[^w])*how([^h]|h[^o]|ho[^w])*){2,2}$\
(it become more complicated)
#CodeJockey: Thanks for comments

I don't know what grepWin supports, but here's what I came up with to make something match exactly twice.
/^((?!how).)*how((?!how).)*how((?!how).)*$/
Explanation:
/^ # start of subject
((?!how).)* # any text that does not contain "how"
how # the word "how"
((?!how).)* # any text that does not contain "how"
how # the word "how"
((?!how).)* # any text that does not contain "how"
$/ # end of subject
This ensures that you find two "how"s, but the texts between the "how"s and to either side of them do not contain "how".
Of course, you can substitute any string for "how" in the expression.
If you want to "simplify" by only writing the search expression twice, you can use backreferences thus:
/^(?:(?!how).)*(how)(?:(?!\1).)*\1(?:(?!\1).)*$/
Refiddle with this expression
Explanation:
I added ?: to make the negative lookaheads' text non-capturing. Then I added parentheses around the regular how to make that a capturing subpattern (the first and only one).
I had to include "how" again in the first lookahead because it's a negative lookahead (meaning any capture would not contain "how") and the captured "how" is not captured yet at that point.

This is significantly harder than I originally thought it would be, and requires variable-length lookbehind, which grepWin does not support...
this expression:
(?<!blah.{0,99999})blah(?=.*?blah)(?!.*blah.*blah)
was successfully used in Eclipse, using the "Search > File" dialog to exclude files with one and three instances of blah and to include files with exactly two instances of blah.
Eclipse does not permit a .* in lookbehind, so I used .{0,99999} instead.
It is possible, with the right tool, but It isn't pretty to get it to work with grepWin (see answer above). Can you use other tools (such as Eclipse) and what did you want to do with the files afterwards?

This works for grep || python, it will return a match only if "how" exists twice in a your_file:
grep "how.*how" your_file
in python (re imported):
re.search(r"how.*how","your_text")
It will return everything in between,(the dot means any character and the star means any number of characters), and you can customize your own script.

Related

Select word with regex when previous words are specific (and sometimes variable)

I am trying to highlight (or find) any word that is preceded by another word, being define, and another specific word to be highlighted (as), when define is present, etc. Basically, I need to find words that are found because of other regex searches, but only targetting each word independently.
For example, having the following string:
define MyFile as File
In that case, define is searched using the regex statement \b-?define\b. I also need to find MyFile if it is preceded directly by define. Plus, as needs to be found as well only if it is preceded directly by a word, in this case MyFile, which is preceded by define, and this goes on and on.
How can this be done? I have messed around quite a bit to find how to highlight MyFile correctly, without any success. As for the specific recursive search of as and File, I am clueless.
Keep in mind that all the regex expressions must be separate, since I will use this as a Sublime Text custom syntax highlight match finder.
define\s([\w]+)\sas\s([\w]+)$
This regex code would capture all words after define separated by a space and all words after as separated by space as well
check this regex : https://regex101.com/r/aQ0yO0/2
For not having context of what the data looks like...this is a naive way of doing it but it's pretty intuitive. However, it doesn't use regex. The other examples are good ways to use regex.
seq = "word1 defined as blah blahh blahhh word2 defined as hello helloo"
words_of_interest = []
list_of_words = seq.split(" ")
for i,word in enumerate(list_of_words):
if word == "defined":
words_of_interest.append(list_of_words[i-1])
print words_of_interest
#['word1', 'word2']
The regular expression is always going to encompass the "define" as well. The trick is to use capture groups and refer to them afterwards. The specific way how to do this depends on the "flavor" of your regex.
As I'm not familiar with Sublime's regex, I'm just going to present an example in sed:
$ sed -e 's/define \([A-Za-z]*\)/include \1/g' <<< "define MyFile as File"
include MyFile as File
This example replaces all "define"s with "include"s - and adds whatever was captured by what's inside the group (the regex [A-Za-z]* in this case). Not too useful, but hopefully explanatory :)
The capture group is denoted by the escaped brackets, and (in sed) referenced by the escaped number (representing the index) of the group.
I believe it's capture groups as a concept that you're looking for, rather than any specific regex.

Perl regexp partial match if the string were longer?

I have a tree of nested hashes, each of which contains a name, like nested directories with files. If I get foreign supplied regexps at runtime (which I don't want to analyze) how can I find in which subtrees to look for matches. The path to match might be of the form
"$x{name}/$x{subdir}{name}/$x{subdir}{subdir}{name}"
but, because there can be thousands of hashes, I want to try it only if both of these partially match:
"$x{name}"
"$x{name}/$x{subdir}{name}"
Or even better, if the 1st part matches then try to continue directly with the 2nd and then with the 3rd, sort of like /\G.../g, except the regexp comes from elsewhere in one piece. And I'd need backtracking to also look in all other partially matching subdirs.
PCRE g_match_info_is_partial_match sounds just what I'm looking for, but despite the "Perl" in that name even the 5.18 source doesn't seem to contain this. And I actually want something backward compatible to 5.8.0.
Background to this question is introducing regexp syntax to makepp. We essentially do that for patterns, but due to their trivial syntax, that is easy. Note that we cache what files we find and can cope with more files as they appear. This enables makepp to match files which might be built later, because it puts the rules' outputs into the tree as well.
Perl regexes and PCRE inspire each other, but are not really compatible and are totally not the same. Perl uses a custom regex engine.
Either a regex matches, or it doesn't. If a regex fails, it is impossible to tell where the match failed except when the regex was written in such a way to report the position.
The only viable solution would be to require a list of regexes, one for each level.
Otherwise you could require users to write regexes in such a way that partial matches work as well. In this case, the regex qr|foo/bar\.txt$| would have to be rewritten
qr|\A / # anchor at start
(?: [^/]*/ )* # match as many directories as neccessary
(?: foo/bar\.txt )? # maybe match an ending foo/bar.txt
\z|x # anchor at end
Example:
for ("/a/", "/a/b/", "/a/b/foo/", "/a/b/foo/bar.txt", "/a/b/foo/baz.txt", "/a/bar.txt") {
say qq("$_" -- ), /$regex/ ? "matches" : "doesn't match";
}
Output:
"/a/" -- matches
"/a/b/" -- matches
"/a/b/foo/" -- matches
"/a/b/foo/bar.txt" -- matches
"/a/b/foo/baz.txt" -- doesn't match
"/a/bar.txt" -- doesn't match
Obviously, this doesn't reduce the search space in any way for this regex.
You may be able to spin this in a way that works for your application. Depending on the guarantees your app provides, you can transform the original regex automatically to something that “always” matches.

find all text before using regex

How can I use regex to find all text before the text "All text before this line will be included"?
I have includes some sample text below for example
This can include deleting, updating, or adding records to your database, which would then be reflex.
All text before this line will be included
You can make this a bit more sophisticated by encrypting the random number and then verifying that it is still a number when it is decrypted. Alternatively, you can pass a value and a key instead.
Starting with an explanation... skip to end for quick answers
To match upto a specific piece of text, and confirm it's there but not include it with the match, you can use a positive lookahead, using notation (?=regex)
This confirms that 'regex' exists at that position, but matches the start position only, not the contents of it.
So, this gives us the expression:
.*?(?=All text before this line will be included)
Where . is any character, and *? is a lazy match (consumes least amount possible, compared to regular * which consumes most amount possible).
However, in almost all regex flavours . will exclude newline, so we need to explicitly use a flag to include newlines.
The flag to use is s, (which stands for "Single-line mode", although it is also referred to as "DOTALL" mode in some flavours).
And this can be implemented in various ways, including...
Globally, for /-based regexes:
/regex/s
Inline, global for the regex:
(?s)regex
Inline, applies only to bracketed part:
(?s:reg)ex
And as a function argument (depends on which language you're doing the regex with).
So, probably the regex you want is this:
(?s).*?(?=All text before this line will be included)
However, there are some caveats:
Firstly, not all regex flavours support lazy quantifiers - you might have to use just .*, (or potentially use more complex logic depending on precise requirements if "All text before..." can appear multiple times).
Secondly, not all regex flavours support lookaheads, so you will instead need to use captured groups to get the text you want to match.
Finally, you can't always specify flags, such as the s above, so may need to either match "anything or newline" (.|\n) or maybe [\s\S] (whitespace and not whitespace) to get the equivalent matching.
If you're limited by all of these (I think the XML implementation is), then you'll have to do:
([\s\S]*)All text before this line will be included
And then extract the first sub-group from the match result.
(.*?)All text before this line will be included
Depending on what particular regular expression framework you're using, you may need to include a flag to indicate that . can match newline characters as well.
The first (and only) subgroup will include the matched text. How you extract that will again depend on what language and regular expression framework you're using.
If you want to include the "All text before this line..." text, then the entire match is what you want.
This should do it:
<?php
$str = "This can include deleting, updating, or adding records to your database, which would then be reflex.
All text before this line will be included
You can make this a bit more sophisticated by encrypting the random number and then verifying that it is still a number when it is decrypted. Alternatively, you can pass a value and a key instead.";
echo preg_filter("/(.*?)All text before this line will be included.*/s","\\1",$str);
?>
Returns:
This can include deleting, updating, or adding records to your database, which would then be reflex.

Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
cià
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.

How can I "inverse match" with regex?

I'm processing a file, line-by-line, and I'd like to do an inverse match. For instance, I want to match lines where there is a string of six letters, but only if these six letters are not 'Andrea'. How should I do that?
I'm using RegexBuddy, but still having trouble.
(?!Andrea).{6}
Assuming your regexp engine supports negative lookaheads...
...or maybe you'd prefer to use [A-Za-z]{6} in place of .{6}
Note that lookaheads and lookbehinds are generally not the right way to "inverse" a regular expression match. Regexps aren't really set up for doing negative matching; they leave that to whatever language you are using them with.
For Python/Java,
^(.(?!(some text)))*$
http://www.lisnichenko.com/articles/javapython-inverse-regex.html
In PCRE and similar variants, you can actually create a regex that matches any line not containing a value:
^(?:(?!Andrea).)*$
This is called a tempered greedy token. The downside is that it doesn't perform well.
The capabilities and syntax of the regex implementation matter.
You could use look-ahead. Using Python as an example,
import re
not_andrea = re.compile('(?!Andrea)\w{6}', re.IGNORECASE)
To break that down:
(?!Andrea) means 'match if the next 6 characters are not "Andrea"'; if so then
\w means a "word character" - alphanumeric characters. This is equivalent to the class [a-zA-Z0-9_]
\w{6} means exactly six word characters.
re.IGNORECASE means that you will exclude "Andrea", "andrea", "ANDREA" ...
Another way is to use your program logic - use all lines not matching Andrea and put them through a second regex to check for six characters. Or first check for at least six word characters, and then check that it does not match Andrea.
Negative lookahead assertion
(?!Andrea)
This is not exactly an inverted match, but it's the best you can directly do with regex. Not all platforms support them though.
If you want to do this in RegexBuddy, there are two ways to get a list of all lines not matching a regex.
On the toolbar on the Test panel, set the test scope to "Line by line". When you do that, an item List All Lines without Matches will appear under the List All button on the same toolbar. (If you don't see the List All button, click the Match button in the main toolbar.)
On the GREP panel, you can turn on the "line-based" and the "invert results" checkboxes to get a list of non-matching lines in the files you're grepping through.
I just came up with this method which may be hardware intensive but it is working:
You can replace all characters which match the regex by an empty string.
This is a oneliner:
notMatched = re.sub(regex, "", string)
I used this because I was forced to use a very complex regex and couldn't figure out how to invert every part of it within a reasonable amount of time.
This will only return you the string result, not any match objects!
(?! is useful in practice. Although strictly speaking, looking ahead is not a regular expression as defined mathematically.
You can write an inverted regular expression manually.
Here is a program to calculate the result automatically.
Its result is machine generated, which is usually much more complex than hand writing one. But the result works.
If you have the possibility to do two regex matches for the inverse and join them together you can use two capturing groups to first capture everything before your regex
^((?!yourRegex).)*
and then capture everything behind your regex
(?<=yourRegex).*
This works for most regexes. One problem I discovered was when I had a quantifier like {2,4} at the end. Then you gotta get creative.
In Perl you can do:
process($line) if ($line =~ !/Andrea/);