regex mixed case excluding specific case - regex

I need a regex able to match:
a) All combinations of lower-/upper-cases of a certain word
b) Except a couple of certain case-combinations.
I must search the bash thru thousands of source-code files, occurrences of miss-spelled variables.
Specifically, the word I'm searching for is FrontEnd which in our coding-style guide can be written exactly in 2 ways depending on the context:
FrontEnd (F and E upper)
frontend (all lower)
So I need to "catch" any occurences that do not follow our coding standards as:
frontEnd
FRONTEND
fRonTenD
I have been reading many tutorials of regex for this specific example and I cannot find a way to say "match this pattern BUT do not match if it is exactly this one or this other one".
I guess it would be similar to trying to match "any number between 000000 to 999999, except exactly the number 555555 or the number 123456", I suppose the logic is similar (of course I don't knot to do this either :) )
Thnx
Additional comment:
I cannot use grep piped to grep -v because I could miss lines; for example if I do:
grep -i frontend | grep -v FrontEnd | grep -v frontend
would miss a line like this:
if( frontEnd.name == 'hello' || FrontEnd.value == 3 )
because the second occurence would hide the whole line. Therefore I'm searching for a regex to use with egrep capable to do the exact match I need.

You won't be able to do this easily with egrep because it doesn't support lookaheads. It's probably easiest to do this with perl.
perl -ne 'print if /(?!frontend|FrontEnd)(?i)frontend/;'
To use just pipe the text through stdin
How this works:
perl -ne 'print if /(?!frontend|FrontEnd)(?i)frontend/;'
^ ^^ ^ ^ ^ ^ ^ ^ ^ The pattern that matches both the correct and incorrect versions.
| || | | | | | | This switch turns on case insensitive matching for the rest of the regular expression (use (?-i) to turn it off) (perl specific)
| || | | | | | The pattern that match the correct versions.
| || | | | | Negative forward look ahead, ensures that the good stuff won't be matched
| || | | | Begin regular expression match, returns true if match
| || | | Begin if statement, this expression uses perl's reverse if semantics (expression1 if expression2;)
| || | Print content of $_, which is piped in by -n flag
| || Evaluate perl code from command line
| | Wrap code in while (<>) { } takes each line from stdin and puts it in $_
| Perl command, love it or hate it.

This really should be a comment, but is there any reason you cannot use sed? I'm thinking something like
sed 's/frontend/FrontEnd/ig' input.txt
That is, of course, assuming you want to correct the deviant versions...

Related

How to shorter regular expression?

First off, I'm relatively new to regular expressions: I've built a regex that I'm using with sed that works fine for me, it looks like:
sed 's/^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9].[0-9][0-9][0-9][0-9][0-9][0-9] | info | tst.33.12.carmen | !: //g' but I'm pretty sure all the repetitive character occurrences could be simplified. How would I do this?
I want to replace:
20180630 180212.407107 | info | tst.33.12.carmen | !: from a line of text (timestamp in the front could be any numbers, strings behind the first '|' are constant)
EDIT: Since OP has put sample of input now so adding this solution.
sed -E 's/^[0-9]{8} [0-9]{6}\.[0-9]{6} \| info \| tst\.[0-9]{2}\.[0-9]{2}\.carmen \| \!:$//' Input_file
Test of code's working:
Let's say following is the Input_file:
cat Input_file
20180630 180212.407107 | info | tst.33.12.carmen | !:
fdfjwhfwifrwvf
vwkdnvkwkvwnvwv
20180630 180212.407107 | info | tst.33.12.carmen | !:
dwbvwbvwvbb
Now after running above code following will be the output then.
sed -E 's/^[0-9]{8} [0-9]{6}\.[0-9]{6} \| info \| tst\.[0-9]{2}\.[0-9]{2}\.carmen \| \!:$//' Input_file
fdfjwhfwifrwvf
vwkdnvkwkvwnvwv
dwbvwbvwvbb
With sed's -E option you could use like following but fair warning that it is opted from your solution and never tested since no samples were produced in your post.
sed -E 's/^[0-9]{8} [0-9]{5}.[0-9]{5} | info | tst.33.12.carmen | !: //g'
If you don't care about matching the exact format of your prefix, but just want to accept some combination of digits, dots and spaces, you can simplify the first part to:
[ .0-9]*
The complete sed expression then looks like:
sed 's/^[ .0-9]*| info | tst\.[0-9]*\.[0-9]*\.carmen | !:$//' file

awk sed perl, replacing specific pattern within a range of lines

I'm working in verilog and need to edit a specific line within a unique block, but am unsure of how to proceed
file.v
...
block1 block1(
.port1(port1),
.port2(port2),
);
block2 block2(
.(port2)(port2),
.(port3)(port3)
);
....
I need to somehow remove the " , " for port2 in block1. without modifying block2. There are also multiple blocks else where that contains port2.
block1 block1(
.port1(port1),
.port2(port2)
);
I've been trying ranges of awk and sed lines, but not getting the results to modify the file successfully. Any suggestions or solutions is much appreciated
This will remove any comma that occurs just before the end of a block (whitespace then );):
perl -0777 -pe 's/,(?=\s*\);)//g'
Notes:
-0777 causes perl to slurp all the input in as a single string. This is required because
we know there's newlines in between so we don't want to read line-by-line
there might be empty lines between the comma and the parentheses so reading by "paragraph" won't work either.
-p causes perl to print the input after modifications.
the regex is the trickiest part
it finds a comma and then looks ahead to match zero or more whitespace characters (includes spaces, tabs, newlines, etc) followed by a close parenthesis and a semicolon.
the lookahead text is not part of the matched text (lookaheads are known as "zero width assertions") -- the matched text will be just the comma
if there's a match, replace the comma with an empty string.
the g flag says do this globally in the string.
This might do the job for you
sed '/block1 block1/,/);/{s/\((port2)\),/\1/}' file.v
how about:
awk -v RS="" '/block1/{sub("port2),","port2)")}7' file
I guess you want to remove commas located after a closing paren ()) followed by a newline and a closing paren and a semicolon ();)?
In this case this might work for you:
sed -r ':a;N;s/\),\n\s*\);/)\n);/;P;D;ba'
| | | |---------| |---| | | |
| | | | | | | -- branch to label "a"
| | | | | | -- delete up to first newline of pattern space
| | | | | -- print up to first newline of pattern space
| | | | -- replace pattern
| | | -- search pattern
| | -- substitute
| -- read next line into pattern space (append)
-- branch label "a"

Regex for lines containing an odd number of pipe characters

I'm cleaning up a LaTeX file, and I'm in a situation where I need to distinguish absolute value |x| from the set "such that" symbol i.e. {x | x < 0}.
The first step for me is to find all lines containing an odd number of | characters (i.e. the pipe symbol).
In principle, I know how to do this, but I've tried the following regex command with no luck.
egrep '^[^\|]*\|([^\|]*\|[^\|]*\|)*[^\|]*$'
The idea is that a matching line contains, in order:
The line start
0 or more non-pipe characters
Exactly one pipe character
0 or more copies of text containing exactly 2 pipes
The line end
However, for some reason this isn't working.
I run the command on the following file:
\[
S = \{ x | x < 0}
y = |x|
\]
and none of the lines match.
I suspect I'm making a silly mistake somewhere, possibly to do with escaping the pipe characters,
but I'm stumped as to what's wrong.
Can anybody tell me either how to fix this, or provide an alternate expression which matches lines containing an odd number of pipe characters?
Inside the [], | is not a special character so should not be escaped by \. Try:
egrep '^[^|]*\|([^|]*\|[^|]*\|)*[^|]*$'
Better to use awk for this purpose:
awk -F '|' '!(NF%2)'
TESTING:
echo "a|bc|d|erg" | awk -F '|' '!(NF%2)'
OUTPUT:
a|bc|d|erg
echo "abc|d|ergxy" | awk -F '|' '!(NF%2)'
OUTPUT:
how about:
awk -F'|' 'NF&&(NF-1)%2' file
example:
kent$ cat file
|foo|bar
| | | | |
||||||
|||||||
kent$ awk -F'|' 'NF&&(NF-1)%2' file
| | | | |
|||||||
Perl, which is cross platform (Windows too) and generally installed everywhere these days, is my axe of choice:
perl -ne 'print if (s/\|/\|/g) %2 == 1' file
script.sed
#!/bin/sed -nf
# Save to hold
h
# Delete all non | chars
s#[^|]##g
# Odd match
/^\(||\)*|$/ {
# Fetch hold
g
s#^#odd\t:#
}
# Even match
/^\(||\)\+$/ {
# Fetch hold
g
s#^#even\t:#
}
# No match
/^$/ {
# Fetch hold
g
s#^#none\t:#
}
# Print
p
data.txt
do|odd
do|odd|match|me
|even match|me
do|even match|me
do|even match|also|me|please
no-match
shell
sed -nf script.sed data.txt
stdout
odd :do|odd
odd :do|odd|match|me
even :|even match|me
even :do|even match|me
even :do|even match|also|me|please
none :
none :no-match

Regex: replacing a string with prefix capture except for a given prefix

I want to replace a string, keeping the prefix, except when it contains a specific prefix.
For instance, any string like "(*)-bar" must be replaced with "(*)-blah" except when "(*)" matches "baz":
foo-bar => should return foo-blah
baz-bar => should remain baz-bar
The best I have so far trims the last letter of the prefix when replacing:
echo "foo-bar" | sed s/"[^(baz)]-bar"/$1-blah/
Use negative lookbehind:
s/(?<!baz)-bar/-blah/
Most sed implementations don't have this advanced regexp feature, but it should work in more modern languages, such as perl.
With sed :
$ echo "foo-bar" | sed '/^foo-baz/!s/^foo-.*$/foo-blah/'
foo-blah
$ echo "foo-baz" | sed '/^foo-baz/!s/^foo-.*$/foo-blah/'
foo-baz
If I decompose :
echo "foo-baz" | sed '/^foo-baz/!s/^foo-.*$/foo-blah/'
| ||| |
+ regex +|+ substitution part +
|
negation of regex

Extract multiple occurrences on the same line using sed/regex

I am trying to loop through each line in a file and find and extract letters that start with ${ and end with }. So as the final output I am expecting only SOLDIR and TEMP(from inputfile.sh).
I have tried using the following script but it seems it matches and extracts only the second occurrence of the pattern TEMP. I also tried adding g at the end but it doesn't help. Could anybody please let me know how to match and extract both/multiple occurrences on the same line ?
inputfile.sh:
.
.
SOLPORT=\`grep -A 4 '\[LocalDB\]' \${SOLDIR}/solidhac.ini | grep \${TEMP} | awk '{print $2}'\`
.
.
script.sh:
infile='inputfile.sh'
while read line ; do
echo $line | sed 's%.*${\([^}]*\)}.*%\1%g'
done < "$infile"
May I propose a grep solution?
grep -oP '(?<=\${).*?(?=})'
It uses Perl-style lookaround assertions and lazily matches anything between '${' and '}'.
Feeding your line to it, I get
$ echo "SOLPORT=\`grep -A 4 '[LocalDB]' \${SOLDIR}/solidhac.ini | grep \${TEMP} | awk '{print $2}'\`" | grep -oP '(?<=\${).*?(?=})'
SOLDIR
TEMP
This might work for you (but maybe only for your specific input line):
sed 's/[^$]*\(${[^}]\+}\)[^$]*/\1\t/g;s/$[^{$]\+//g'
Extracting multiple matches from a single line using sed isn't as bad as I thought it'd be, but it's still fairly esoteric and difficult to read:
$ echo 'Hello ${var1}, how is your ${var2}' | sed -En '
# Replace ${PREFIX}${TARGET}${SUFFIX} with ${PREFIX}\a${TARGET}\n${SUFFIX}
s#\$\{([^}]+)\}#\a\1\n#
# Continue to next line if no matches.
/\n/!b
# Remove the prefix.
s#.*\a##
# Print up to the first newline.
P
# Delete up to the first newline and reprocess what's left of the line.
D
'
var1
var2
And all on one line:
sed -En 's#\$\{([^}]+)\}#\a\1\n#;/\n/!b;s#.*\a##;P;D'
Since POSIX extended regexes don't support non-greedy quantifiers or putting a newline escape in a bracket expression I've used a BEL character (\a) as a sentinel at the end of the prefix instead of a newline. A newline could be used, but then the second substitution would have to be the questionable s#.*\n(.*\n.*)##, which might involve a pathological amount of backtracking by the regex engine.