Awk indicating the first character not to be # - regex

Is there a way to specify the first character not to be something?
There are many ways to limit what it can be but I don't recall a way to say what it can't be.
for example if ! meant not to be
root 4
awk {/[!#][Rr][Oo][Oo][Tt]/{ }}

The symbol for "not" in a bracket expression is the caret (or "circumflex") ^, but it must be the first character inside the brackets in order to have this meaning. The example given in the comments above is [^#], which means one character that is not #. So the regular expression /[^#]/ would match any string that does not have a # anywhere in it. This is not all of what you asked for:
Is there a way to specify the first character not to be something?
One thing that makes regular expressions hard for some people to read is that many symbols have different meanings based on context. The caret ^ is also used to indicate the beginning of a line. With a regex in awk, you can specify that the first character on the line (the first thing after the beginning of the line ^) is not a # with:
awk '/^[^#]/{ ... }'
This would execute the block of code { ... } for every line of input that does not start with # at the beginning of the line. Note that this would, however, match a line that starts with other characters, and then has a # somewhere in it. /^[^#]/ would also not match an empty line, since there is no character for [^#] to consume. As you can see, there are many nuances and subtleties to consider as you tailor your regex for your needs. For more, look up awk regex, POSIX regex, or just type man -s7 regex in your terminal.

Related

Can someone breakdown this regular expression?

While looking for a way to format 'ifconfig' output and display only the network interfaces names, I found a regular expression that worked like a charm for OS X.
ifconfig -a | sed -E 's/[[:space:]:].*//;/^$/d'
How can I breakdown this regular expression so I can understand it?
Here is the sed command
s/[[:space:]:].*//;/^$/d
There is a semicolon in the middle, so it's actually two commands:
s/[[:space:]:].*//
/^$/d
First command is a substitution. What to substitute? It's between the 1st 2 slashes.
[[:space:]:].*
Character class [] of any kind of whitespace or a colon, followed by zero or more * of any character .. This matches everything in a line after the first whitespace or colon.
Substitute with what? Between the 2nd two slashes: s/...//: Nothing. The matched strings are deleted from each line.
This leaves the interface names which start their lines, the other lines remain too, but they are empty, as they start with whitespace.
How to remove these empty lines? That's the second command:
/^$/d
Find empty lines that match regex with nothing between start of line ^ and end of line $. Then delete them with command d.
All that's left are the interface names.
This is more a sequence of commands than it is a regular expression, but I suppose breaking the sequence down may be instructive.
Read the manpage on ifconfig to find this
Optionally, the -a flag may be used instead of an interface name. This
flag instructs ifconfig to display information about all interfaces in
the system. The -d flag limits this to interfaces that are down, and
-u limits this to interfaces that are up. When no arguments are given,
-a is implied.
That's one part done. The pipe (|) sends what ifconfig would normally print to the standard output to the standard input of sed instead.
You're passing sed the option -E. Again, man sed is your friend and tells you that this option means
Interpret regular expressions as extended (modern) regular
expressions rather than basic regular expressions (BRE's). The
re_format(7) manual page fully describes both formats.
This isn't all you need though... The first string that you're giving sed lets it know which operation to perform.
Search the same manual for the word "substitute" to reach this
paragraph:
[2addr]s/regular expression/replacement/flags
Substitute the replacement string for the first instance of
the regular expression in the pattern space. Any character other than
backslash or newline can be used instead of a slash to delimit the RE
and the replacement. Within the RE and the replacement, the RE
delimiter itself can be used as a literal character if it is preceded
by a backslash.
Now we can run man 7 re_format to decode the first command s/[[:space:]:].*// which means "for each line passed to standard input, substitute the part matching the extended regular expression [[:space:]:].* with the empty string"
[[:space:]:] = match either a : or any character in the character class [:space:]
.* = match any character (.), zero or more times (*)
To understand the second command look for the [2addr]d part of the sed manual page.
[2addr]d
Delete the pattern space and start the next cycle.
Let's then look at the next command /^$/d which says "for each line passed to standard input, delete it if it corresponds to the extended regex ^$"
^$ = a line that contains no characters between its start (^) and its end ($)
We've discussed how to start with man pages and follow the clues to "decode" commands you see in everyday life.
Thanks Benjamin and Xufox for the resources. After taking a look, this is my conclusion:
s/[[:space:]:].*//;
[[:space:]:] this will search for spaces and/or : and begin the execution of the command, and this and anything that comes afterwards(hence the '.*') will be substituted by nothing (because the next thing is //, which in between should be what we would want to substitute for, which in this case is nothing.).
;
marks the end of the first command
and then we have
/^$/d
where ^$ means search for all empty spaces and d to delete them.
This is half wrong. Take a look at the other answer which gives you the complete and correct response! Thanks guys.

reuse last matched character of regex in sed

Many of you with a certain leaning towards proper formatting will know the pain of having a lot of space characters insted of a tab character in the beginning of indented lines after another person edited a file and added lines. I seem to be unable to teach my colleagues how to use vim's integrated line pasting function, so I'm searching for some simple ways to automatically correct lines beginning with a certain pattern. ;)
I'm using a regex to find the corresponding lines, but I can't work out how to "reuse" the last matched character in sed when using "find and replace". The regex matching the lines is
'^\ *[A-Z]'
I would like to replace those space characters, but keep the uppercase letter. My idea would be something like
sed 's|^\ *[A-Z]|\t$|g'
or so, but I guess that would replace the whole line with a single tab character since $ usually matches the line ending?
Is there a simple way to reuse parts of the matched regex in sed?
How about simply not including the first non-space character in the match in the first place?
This matches all spaces at the beginning of a line:
^ *
Edit (quote from the comments):
obviously I don't want to replace spaces in front of other characters than uppercase letters
A look-ahead could do that, but unfortunatey sed does not support them. But you can use the next best thing, an expression that determines which lines sed operates on:
sed '|^ *[A-Z]| s|^ *|\t|'
Of course a back-reference would do it as well:
sed 's|^ *\([A-Z]\)|\t\1|'

Find results with grep and write to file

I would like to get all the results with grep or egrep from a file on my computer.
Just discovered that the regex of finding the string
'+33. ... ... ..' is by the following regex
\+33.[0-9].[0-9].[0-9].[0-9].' Or is this not correct?
My grep command is:
grep '\+31.[0-9].[0.9].[0.9].[0-9]' Samsung\ GT-i9400\ Galaxy\ S\ II.xry >> resultaten.txt
The output file is only giving me as following:
"Binary file Samsung GT-i9400 .xry matches"
..... and no results were given.
Can someone help me please with getting the results and writing to a file?
Firstly, the default behavior of grep is to print the line containing a match. Because binary files do not contain lines, it only prints a message when it finds a match in a binary file. However, this can be overridden with the -a flag.
But then, you end up with the problem that the "lines" it prints are not useful. You probably want to add the -o option to only print the substrings which actually matched.
Finally, your regex isn't correct at all. The lone dot . is a metacharacter which matches any character, including a control character or other non-text character. Given the length of your regex, you are unlikely to catch false positives, but you might want to explain what you want the dot to match. I have replaced it with [ ._-] which matches a space and some punctuation characters which are common in phone numbers. Maybe extend or change it, depending on what interpunction you expect in your phone numbers.
In regular grep, a plus simply matches itself. With grep -E the syntax would change, and you would need to backslash the plus; but in the absence of this option, the backslash is superfluous (and actually wrong in this context in some dialects, including GNU grep, where a backslashed plus selects the extended meaning, which is of course a syntax error at beginning of string, where there is no preceding expression to repeat one or more times; but GNU grep will just silently ignore it, rather than report an error).
On the other hand, your number groups are also wrong. [0-9] matches a single digit, where apparently the intention is to match multiple digits. For convenience, I will use the grep -E extension which enables + to match one or more repetitions of the previous character. Then we also get access to ? to mark the punctuation expressions as optional.
Wrapping up, try this:
grep -Eao '\+33[0-9]+([^ ._-]?[0-9]+){3}' \
'Samsung GT-i9400 Galaxy S II.xry' >resultaten.txt
In human terms, this requires a literal +33 followed by required additional digits, then followed by three number groups of one or more digits, each optionally preceded by punctuation.
This will overwrite resultaten.txt which is usually what you want; the append operation you had also makes sense in many scenarios, so change it back if that's actually what you want.
If each dot in your template +33. ... ... .. represents a required number, and the spaces represent required punctuation, the following is closer to what you attempted to specify:
\+33[0-9]([^ ._-][0-9]{3}){2}[^ ._-][0-9]{2}
That is, there is one required digit after 33, then two groups of exactly three digits and one of two, each group preceded by one non-optional spacing or punctuation character.
(Your exposition has +33 while your actual example has +31. Use whichever is correct, or perhaps allow any sequence of numbers for the country code, too.)
It means that you're find a match but the file you're greping isn't a text file, it's a binary containing non-printable bytes. If you really want to grep that file, try:
strings Samsung\ GT-i9400\ Galaxy\ S\ II.xry | grep '+31.[0-9].[0.9].[0.9].[0-9]' >> resultaten.txt

Select all files that do not have string between 2 other strings

I have a set of files that i need to loop through and find all the files that does not have a specific string between 2 other specific strings. How can i do that?
I tried this but it didnt work:
grep -lri "\(stringA\).*\(?<!stringB\).*\(stringC\)" ./*.sql
EDIT:
the file could have structure as following:
StringA
StringB
StringA
StringC
all i want i s to know if there is any occurences where string A and stringC has no stringC in between.
You can use the -L option of grep to print all files which don't match and look for the specific combination of strings:
grep -Lri "\(stringA\).*\(stringB\).*\(stringC\)" ./*.sql
The short answer is along the lines of:
grep "abc[^(?:def)]*ghi" ./testregex
That's based on a testregex file like so:
abcghiabc
abcdefghi
abcghi
The output will be:
$ grep "abc[^(?:def)]*ghi" ./testregex
abcghiabc
abcghi
Mapped to your use-case, I'd wager this translates roughly to:
grep -lri "stringA[^(?:stringB)]*stringC" ./*.sql
Note that I've removed the ".*" between each string, since that will match the very string that you're attempting to exclude.
Update: The original question now calls out line breaks, so use grep's -z flag:
-z
suppress newline at the end of line, subtituting it for null character. That is, grep knows where end of line is, but sees the input as one big line.
Thus:
grep -lriz "stringA[^(?:stringB)]*stringC" ./*.sql
When I first had to use this approach myself, I wrote up the following explanation...
Specifically: I wanted to match "any character, any number of times,
non-greedy (so defer to subsequent explicit patterns), and NOT
MATCHING THE SEQUENCE />".
The last part is what I'm writing to share: "not matching the sequence
/>". This is the first time I've used character sequences combined
with "any character" logic.
My target string:
<img class="photo" src="http://d3gqasl9vmjfd8.cloudfront.net/49c7a10a-4a45-4530-9564-d058f70b9e5e.png" alt="Iron or Gold" />
My first attempt:
<img.*?class="photo".*?src=".*?".*?/>
This worked in online regex testers, but failed for some reason within
my actual Java code. Through trial and error, I found that replacing
every ".?" with "[^<>]?" was successful. That is, instead of
"non-greedy matching of any character", I could use "non-greedy
matching of any character except < or >".
But, I didn't want to use this, since I've seen alt text which
includes these characters. In my particular case, I wanted to use the
character sequence "/>" as the exclusion sequence -- once that
sequence was encountered, stop the "any character" matching.
This brings me to my lesson:
Part 1: Character sequences can be achieved using (?:regex). That is,
use the () parenthesis as normal for a character sequence, but prepend
with "?:" in order to prevent the sequence from being matched as a
target group. Ergo, "(?:/>)" would match "/>", while "(?:/>)*" would
match "/>/>/>/>".
Part 2: Such character sequences can be used in the same manner as
single characters. That is, "[^(?:/>)]*?" will match any character
EXCEPT the sequence "/>", any number of times, non-greedy.
That's pretty much it. The keywords for searching are "non-capturing
groups" and "negative lookahead|lookbehind", and the latter feature
goes much deeper than I've gone so far, with additional flags that I
don't yet grok. But the initial understanding gave me the tool I
needed for my immediate task, and it's a feature that I've wondered
about for awhile -- thus, I figured I'd share the basic introduction
in case any of you were curious about tucking it away in your toolset.
After playing around with the statement provided by the DreadPirateShawn:
stringA[^(?:stringB)]*stringC
I figured out that it is not a truly valid regex. This statement was excluding every character in the given set and not the full string. So I continued digging.
After some googling and testing the pattern, I came up with the following statement, that seems to fit my needs:
stringA\s*\t*(?:(?!stringB).)*\s*\t*stringC
This pattern matches any text except the provided string between 2 specified strings. It also takes into consideration whitespace characters.
There is more testing to be done, but it seems that this pattern perfectly fits my requirements
UPDATE: Here is a final version of the statement that seems to work for me:
grep -lriz "(set feedback on){0,}[ \t]*(?:(?!set feedback off).)*[ \t]*select sysdate from dual" ./*.sql

How to read this command to remove all blanks at the end of a line

I happened across this page full of super useful and rather cryptic vim tips at http://rayninfo.co.uk/vimtips.html. I've tried a few of these and I understand what is happening enough to be able to parse it correctly in my head so that I can possibly recreate it later. One I'm having a hard time getting my head wrapped around though are the following two commands to remove all spaces from the end of every line
:%s= *$== : delete end of line blanks
:%s= \+$== : Same thing
I'm interpreting %s as string replacement on every line in the file, but after that I am getting lost in what looks like some gnarly variation of :s and regex. I'm used to seeing and using :s/regex/replacement. But the above is super confusing.
What do those above commands mean in english, step by step?
The regex delimiters don't have to be slashes, they can be other characters as well. This is handy if your search or replacement strings contain slashes. In this case I don't know why they use equal signs instead of slashes, but you can pretend that the equals are slashes:
:%s/ *$//
:%s/ \+$//
Does that make sense? The first one searches for a space followed by zero or more spaces, and the second one searches for one or more spaces. Each one is anchored at the end of the line with $. And then the replacement string is empty, so the spaces are deleted.
I understand your confusion, actually. If you look at :help :s you have to scroll down a few pages before you find this note:
*E146*
Instead of the '/' which surrounds the pattern and replacement string, you
can use any other character, but not an alphanumeric character, '\', '"' or
'|'. This is useful if you want to include a '/' in the search pattern or
replacement string. Example:
:s+/+//+
I do not know vim syntax, but it looks to me like these are sed-style substitution operators. In sed, the / (in s/REGEX/REPLACEMENT/) can be uniformly replaced with any other single character. Here it appears to be =. So if you mentally replace = with /, you'll get
:%s/ *$//
:%s/ \+$//
which should make more sense to you.