Select all files that do not have string between 2 other strings - regex

I have a set of files that i need to loop through and find all the files that does not have a specific string between 2 other specific strings. How can i do that?
I tried this but it didnt work:
grep -lri "\(stringA\).*\(?<!stringB\).*\(stringC\)" ./*.sql
EDIT:
the file could have structure as following:
StringA
StringB
StringA
StringC
all i want i s to know if there is any occurences where string A and stringC has no stringC in between.

You can use the -L option of grep to print all files which don't match and look for the specific combination of strings:
grep -Lri "\(stringA\).*\(stringB\).*\(stringC\)" ./*.sql

The short answer is along the lines of:
grep "abc[^(?:def)]*ghi" ./testregex
That's based on a testregex file like so:
abcghiabc
abcdefghi
abcghi
The output will be:
$ grep "abc[^(?:def)]*ghi" ./testregex
abcghiabc
abcghi
Mapped to your use-case, I'd wager this translates roughly to:
grep -lri "stringA[^(?:stringB)]*stringC" ./*.sql
Note that I've removed the ".*" between each string, since that will match the very string that you're attempting to exclude.
Update: The original question now calls out line breaks, so use grep's -z flag:
-z
suppress newline at the end of line, subtituting it for null character. That is, grep knows where end of line is, but sees the input as one big line.
Thus:
grep -lriz "stringA[^(?:stringB)]*stringC" ./*.sql
When I first had to use this approach myself, I wrote up the following explanation...
Specifically: I wanted to match "any character, any number of times,
non-greedy (so defer to subsequent explicit patterns), and NOT
MATCHING THE SEQUENCE />".
The last part is what I'm writing to share: "not matching the sequence
/>". This is the first time I've used character sequences combined
with "any character" logic.
My target string:
<img class="photo" src="http://d3gqasl9vmjfd8.cloudfront.net/49c7a10a-4a45-4530-9564-d058f70b9e5e.png" alt="Iron or Gold" />
My first attempt:
<img.*?class="photo".*?src=".*?".*?/>
This worked in online regex testers, but failed for some reason within
my actual Java code. Through trial and error, I found that replacing
every ".?" with "[^<>]?" was successful. That is, instead of
"non-greedy matching of any character", I could use "non-greedy
matching of any character except < or >".
But, I didn't want to use this, since I've seen alt text which
includes these characters. In my particular case, I wanted to use the
character sequence "/>" as the exclusion sequence -- once that
sequence was encountered, stop the "any character" matching.
This brings me to my lesson:
Part 1: Character sequences can be achieved using (?:regex). That is,
use the () parenthesis as normal for a character sequence, but prepend
with "?:" in order to prevent the sequence from being matched as a
target group. Ergo, "(?:/>)" would match "/>", while "(?:/>)*" would
match "/>/>/>/>".
Part 2: Such character sequences can be used in the same manner as
single characters. That is, "[^(?:/>)]*?" will match any character
EXCEPT the sequence "/>", any number of times, non-greedy.
That's pretty much it. The keywords for searching are "non-capturing
groups" and "negative lookahead|lookbehind", and the latter feature
goes much deeper than I've gone so far, with additional flags that I
don't yet grok. But the initial understanding gave me the tool I
needed for my immediate task, and it's a feature that I've wondered
about for awhile -- thus, I figured I'd share the basic introduction
in case any of you were curious about tucking it away in your toolset.

After playing around with the statement provided by the DreadPirateShawn:
stringA[^(?:stringB)]*stringC
I figured out that it is not a truly valid regex. This statement was excluding every character in the given set and not the full string. So I continued digging.
After some googling and testing the pattern, I came up with the following statement, that seems to fit my needs:
stringA\s*\t*(?:(?!stringB).)*\s*\t*stringC
This pattern matches any text except the provided string between 2 specified strings. It also takes into consideration whitespace characters.
There is more testing to be done, but it seems that this pattern perfectly fits my requirements
UPDATE: Here is a final version of the statement that seems to work for me:
grep -lriz "(set feedback on){0,}[ \t]*(?:(?!set feedback off).)*[ \t]*select sysdate from dual" ./*.sql

Related

How do I regex search in x and y for a, and only include the replacement of y if a was found in x?

I need to search through a larger text file.
This is an example of what I'm searching through.
https://pastebin.com/JFVy2TEt
recipes.addShaped("basemetals:adamantine_arrow", <basemetals:adamantine_arrow> * 4, [[<ore:nuggetAdamantine>], [<basemetals:adamantine_rod>], [<minecraft:feather>]]);
I need to look for lines that match a specific part in the first argument.
For example the "_arrow" part in the above line.
And erase everything that doesn't match on the "_arrow" in the first argument.
And the arguments differ across all of them.
And also with different names in the place where "basemetals:adamantine" is in the above line.
And since the further arguments are all different I can't wrap my head around on how to include the end only when the first thing matches.
Edit: The end goal being to ease sort my 3k+ line text file.
basic, blacksmith, carpenter, chef, chemist, engineer, farmer, jeweler, mage, mason, scribe, tailor
I think what you're trying to do is filter your text file by removing lines that don't fit a set criteria. I've chosen the Atom text editor for this solution (because I'm running Windows OS and can't install gedit, and I want to ensure you have a working example).
To remove only lines that don't have a first argument ending in _arrow, one could do (?!recipes\.addShaped\("[^"]+_arrow")recipes.+\r?\n? and replace with nothing.
As a note: this task is made more difficult by Atom's low regex support. In a more well-supported environment, my answer would probably be ^recipes\.addShaped("[^"]+(?<!_arrow)").+\r?\n? (with multiline mode).
Also, please read "What should I do when someone answers my question?".
Regex explained:
(?! ) is a negative lookahead, which peeks at the succeeding text to ensure it doesn't contain "_arrow" at end of the first argument.
\. is an escaped literal period
[^"] is a character class that signifies a character that is not a ".
+ is a quantifier which tells the regex to match the preceding character or subexpression as many times as possible, with a minimum of one time.
. is a wildcard, representing any character
\r?\n? is used to match any kind of newline, with the ? quantifier making each character optional.
Everything else it literal characters; it represents exactly what it matches.

How to search and replace from the last match of a until b?

I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)
I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)
You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.

Find results with grep and write to file

I would like to get all the results with grep or egrep from a file on my computer.
Just discovered that the regex of finding the string
'+33. ... ... ..' is by the following regex
\+33.[0-9].[0-9].[0-9].[0-9].' Or is this not correct?
My grep command is:
grep '\+31.[0-9].[0.9].[0.9].[0-9]' Samsung\ GT-i9400\ Galaxy\ S\ II.xry >> resultaten.txt
The output file is only giving me as following:
"Binary file Samsung GT-i9400 .xry matches"
..... and no results were given.
Can someone help me please with getting the results and writing to a file?
Firstly, the default behavior of grep is to print the line containing a match. Because binary files do not contain lines, it only prints a message when it finds a match in a binary file. However, this can be overridden with the -a flag.
But then, you end up with the problem that the "lines" it prints are not useful. You probably want to add the -o option to only print the substrings which actually matched.
Finally, your regex isn't correct at all. The lone dot . is a metacharacter which matches any character, including a control character or other non-text character. Given the length of your regex, you are unlikely to catch false positives, but you might want to explain what you want the dot to match. I have replaced it with [ ._-] which matches a space and some punctuation characters which are common in phone numbers. Maybe extend or change it, depending on what interpunction you expect in your phone numbers.
In regular grep, a plus simply matches itself. With grep -E the syntax would change, and you would need to backslash the plus; but in the absence of this option, the backslash is superfluous (and actually wrong in this context in some dialects, including GNU grep, where a backslashed plus selects the extended meaning, which is of course a syntax error at beginning of string, where there is no preceding expression to repeat one or more times; but GNU grep will just silently ignore it, rather than report an error).
On the other hand, your number groups are also wrong. [0-9] matches a single digit, where apparently the intention is to match multiple digits. For convenience, I will use the grep -E extension which enables + to match one or more repetitions of the previous character. Then we also get access to ? to mark the punctuation expressions as optional.
Wrapping up, try this:
grep -Eao '\+33[0-9]+([^ ._-]?[0-9]+){3}' \
'Samsung GT-i9400 Galaxy S II.xry' >resultaten.txt
In human terms, this requires a literal +33 followed by required additional digits, then followed by three number groups of one or more digits, each optionally preceded by punctuation.
This will overwrite resultaten.txt which is usually what you want; the append operation you had also makes sense in many scenarios, so change it back if that's actually what you want.
If each dot in your template +33. ... ... .. represents a required number, and the spaces represent required punctuation, the following is closer to what you attempted to specify:
\+33[0-9]([^ ._-][0-9]{3}){2}[^ ._-][0-9]{2}
That is, there is one required digit after 33, then two groups of exactly three digits and one of two, each group preceded by one non-optional spacing or punctuation character.
(Your exposition has +33 while your actual example has +31. Use whichever is correct, or perhaps allow any sequence of numbers for the country code, too.)
It means that you're find a match but the file you're greping isn't a text file, it's a binary containing non-printable bytes. If you really want to grep that file, try:
strings Samsung\ GT-i9400\ Galaxy\ S\ II.xry | grep '+31.[0-9].[0.9].[0.9].[0-9]' >> resultaten.txt

Regular Expression Using the Dot-Matches-All Mode

Normally the . doesn't match newline unless I specify the engine to do so with the (?s) flag. I tried this regexp on my editor's (UltraEdit v14.10) regexp engine using Perl style regexp mode:
(?s).*i
The search text contains multiple lines and each line contains many 'i' characters.
I expect the above regexp means: search as many characters (because with the '?s' the . now matches anything including newline) as possible (because of the greediness for *) until reaching the character 'i'.
This should mean "from the first character to the last 'i' in the last sentence" (greediness should reach the last sentence, right?).
But with UltraEdit's test, it turns out to be "from the first character to the last 'i' in the first sentence that contains an i". Is this result correct? Did I make any wrong interpretation of my reg expression?
e.g. given this text
aaa
bbb
aiaiaiaiaa
bbbicicid
it is
aaa
bbb
aiaiaiai
matched. But I expect:
aaa
bbb
aiaiaiaiaa
bbbicici
Your regex is correct, and so are your expectations of its performance.
This is a long-known bug in UltraEdit's regex implementation which I have written repeatedly to support about. As far as I know, it still hasn't been fixed. The problem appears to lie in the fact that UE's regex implementation is essentially line-based, and additional lines are taken into the match only if necessary. So .* will match greedily on the current line, but it will not cross a newline boundary if it doesn't have to in order to achieve a match.
There are some other subtle bugs with line endings. For example, lookbehind doesn't work across newlines, either.
Write to IDM support, or change to an editor with decent regex support. I did both.
Yes you are right this looks like a bug.
Your interpretation is correct. If you are in Perl mode and not Posix.
However it should apply to posix as well.
Altough defining the modifiers like you do is very rare.
Mostly you provide a string with delimiters and the modifier afterwards like /.*i/s
But this doesn't matter because your way is correct too. And if it wouldnt be supported, it wouldn't match the first newline either.
So yes, this is definately a bug in your program.
You're right that that regex should match the entire string (all 4 lines). My guess is that UltraEdit is attempting to do some sort of optimization by working line by line, and only accumulating new lines "when necessary".

Explain this Regular Expression please

Regular Expressions are a complete void for me.
I'm dealing with one right now in TextMate that does what I want it to do...but I don't know WHY it does what I want it to do.
/[[:alpha:]]+|( )/(?1::$0)/g
This is used in a TextMate snippet and what it does is takes a Label and outputs it as an id name. So if I type "First Name" in the first spot, this outputs "FirstName".
Previously it looked like this:
/[[:alpha:]]+|( )/(?1:_:/L$0)/g (it might have been \L instead)
This would turn "First Name" into "first_name".
So I get that the underscore adds an underscore for a space, and that the /L lowercases everything...but I can't figure out what the rest of it does or why.
Someone care to explain it piece by piece?
EDIT
Here is the actual snippet in question:
<column header="$1"><xmod:field name="${2:${1/[[:alpha:]]+|( )/(?1::$0)/g}}"/></column>
This regular expression (regex) format is basically:
/matchthis/replacewiththis/settings
The "g" setting at the end means do a global replace, rather than just restricting the regex to a particular line or selection.
Breaking it down further...
[[:alpha:]]+|( )
That matches an alpha numeric character (held in parameter $0), or optionally a space (held in matching parameter $1).
(?1::$0)
As Roger says, the ? indicates this part is a conditional. If a match was found in parameter $1 then it is replaced with the stuff between the colons :: - in this case nothing. If nothing is in $1 then the match is replaced with the contents of $0, i.e. any alphanumeric character that is not a space is output unchanged.
This explains why the spaces are removed in the first example, and the spaces get replaced with underscores in your second example.
In the second expression the \L is used to lowercase the text.
The extra question in the comment was how to run this expression outside of TextMate. Using vi as an example, I would break it into multiple steps:
:0,$s/ //g
:0,$s/\u/\L\0/g
The first part of the above commands tells vi to run a substitution starting on line 0 and ending at the end of the file (that's what $ means).
The rest of the expression uses the same sorts of rules as explained above, although some of the notation in vi is a bit custom - see this reference webpage.
I find RegexBuddy a good tool for me in dealing with regexs. I pasted your 1st regex in to Buddy and I got the explanation shown in the bottom frame:
I use it for helping to understand existing regexs, building my own, testing regexs against strings, etc. I've become better # regexs because of it. FYI I'm running under Wine on Ubuntu.
it's searching for any alpha character that appears at least once in a row [[:alpha:]]+ or space ( ).
/[[:alpha:]]+|( )/(?1::$0)/g
The (?1 is a conditional and used to strip the match if group 1 (a single space) was matched, or replace the match with $0 if group 1 wasn't matched. As $0 is the entire match, it gets replaced with itself in that case. This regex is the same as:
/ //g
I.e. remove all spaces.
/[[:alpha:]]+|( )/(?1:_:/\L$0)/g
This regex is still using the same condition, except now if group 1 was matched, it's replaced with an underscore, and otherwise the full match ($0) is used, modified by \L. \L changes the case of all text that comes after it, so \LABC would result in abc; think of it as a special control code.