RegEx, Substituting a variable number of replacements - regex

Hopefully I'm missing something obvious.
I've got a file that contains some lines like:
| A | B | C |
|-----------|
Ignore this line
| And | Ignore | This |
| D | E | F | G |
|---------------|
I want to find the |----| lines, remove those... and replace all of the | characters with a ^ in the preceding line. e.g.
^ A ^ B ^ C ^
Ignore this line
| And | Ignore | This |
^ D ^ E ^ F ^ G ^
So far I've got:
perl -0pe 's/^(\|.*\|)\n\|-+\|/$1/mg'
This takes input from stdin (some other modifications have already happened with sed)... and it's using -0 and /m to support multiline replacements.
The match seems to be correct, and it removes the |----| lines, but I can't see how I can do the | to ^ substitution with the $1 (or \1) backreference.
I can't remember where I did it before, but another language allowed me to use ${1/A/B} to substitute A to B, but that's upsetting perl.
And I've been wondering if this is where /e or /ee could be used, but I'm not familiar enough with perl on how to do that.

You can use
perl -0pe 's{^(.*)\R\|-+\|$\R?}{$1 =~ s,\|,^,gr}gme' t
Details:
^(.*)\R\|-+\|$\R? - matches all occurrences (see the g flag at the end)
^ - start of a line (note the m flag that makes ^ match start of a line and $ match end of a line)
(.*) - Group 1: whole line
\R - a line break sequence
\| - | char
-+ - one or more - chars
\| - a | char
$ - end of line
\R? - an optional line break sequence.
Once the match is found, all | are replaced with ^ using $1 =~ s,\|,^,gr, that replaces inside the Group 1 value. This syntax is enabled with the e flag.

I could see this being done using 2 substitutions:
\|(?=.*[\r\n]+\|-+\|$)
https://regex101.com/r/x7d15d/1/
And then:
^\|-+\|(?:[\r\n]+|$)
https://regex101.com/r/ZdEzuM/1/

With one pattern that checks the next line in a lookahead assertion:
perl -0pe 's/\|(?=.*\R\|-+\|$)(?:\R.*)?/^/gm' file
If you absolutely want to use an evaluation, you can put a transliteration in the replacement part with this pattern:
perl -0pe 's#^(.*)\R\|-+\|$#$1=~y/|/^/r#gme' file

Related

Regex to check the exact number of occurace of multiple characters

Wish to match a regex expression that matches with a sting when there is an exact number of occurrences of '3', '2', and '1' in a given string.
For instance, having a string "(((3x2x2+1)x2x2+1)x2+1)", I wish to have a regex expression to match exactly one occurrence of '3', five occurrences of '2', and three occurrences of '1'. If there would be more or less '3's or '2's or '1's, the regex shouldn't match.
I have a solution with positive and negative look aheads to do it in one regex. However, i must warn that it can become very messy very quickly. In this problem i would advise to simply count the occurrence of each number in your string without regexes.
That being said,
let's consider the following regex :
^(?=.*(?:1[^1]*){2})(?!.*(?:1[^1]*){3}).*$
^ and .*$ are there to say we want to match the whole string
?: just means we don't want to capture the group
?= is a positive lookahead, which means that our match must satisfy the condition .*(?:1[^1]*){2}
?! is a negative lookahead, which means that our match must NOT satisfy the condition .*(?:1[^1]*){3}
To summarize, you want to ALWAYS match the whole string if the positive lookahead condition is respected (digit 1 is present 2 times) and the negative lookahead condition is not (digit 1 present 3 times)
So in the example above 1*5*1 is matched, 11 is matched but 1*1*1 is not
So now, let's say you want your string to have exactly 3 '1', 1 '2', and 2 '3',
it will look like this
^(?=.*(?:1[^1]*){3})(?!.*(1[^1]*){4})(?=.*(?:2[^2]*){1})(?!.*(2[^2]*){2})(?=.*(?:3[^3]*){2})(?!.*(3[^3]*){3}).*$
It will match (1*2*1*3*1*3) but not (1*2*1*3*1*3*1) or (1*2*1*3*1)
Note that it matches (1*2*1*3*1*37) because there is a 3 in 37
Then again, i would advise against using this solution if you have too many numbers as you need to write a positive and a negative lookahead for each of your number.
I'd do it in 3 steps, as follows:
Mac_3.2.57$echo "(((3x2x2+1)x2x2+1)x2+1)" | egrep '^([^3]*3[^3]*){1}$' | egrep '^([^2]*2[^2]*){5}$' | egrep '^([^1]*1[^1]*){3}$'
(((3x2x2+1)x2x2+1)x2+1)
Mac_3.2.57$echo "(((3x2x2+1)x2x2+1)x2+0)" | egrep '^([^3]*3[^3]*){1}$' | egrep '^([^2]*2[^2]*){5}$' | egrep '^([^1]*1[^1]*){3}$'
Mac_3.2.57$echo "(((3x2x2+1)x2x2+1)x2+1+1)" | egrep '^([^3]*3[^3]*){1}$' | egrep '^([^2]*2[^2]*){5}$' | egrep '^([^1]*1[^1]*){3}$'
Mac_3.2.57$echo "(((3x2x2+1)x2x2+1)x0+1)" | egrep '^([^3]*3[^3]*){1}$' | egrep '^([^2]*2[^2]*){5}$' | egrep '^([^1]*1[^1]*){3}$'
Mac_3.2.57$echo "(((3x2x2+1)x2x2+1)x2+2+1)" | egrep '^([^3]*3[^3]*){1}$' | egrep '^([^2]*2[^2]*){5}$' | egrep '^([^1]*1[^1]*){3}$'
Mac_3.2.57$echo "(((0x2x2+1)x2x2+1)x2+1)" | egrep '^([^3]*3[^3]*){1}$' | egrep '^([^2]*2[^2]*){5}$' | egrep '^([^1]*1[^1]*){3}$'
Mac_3.2.57$echo "(((3X3x2x2+1)x2x2+1)x2+1)" | egrep '^([^3]*3[^3]*){1}$' | egrep '^([^2]*2[^2]*){5}$' | egrep '^([^1]*1[^1]*){3}$'
Mac_3.2.57$echo "(((2x2+1)x2x2+1)x2+1+3)" | egrep '^([^3]*3[^3]*){1}$' | egrep '^([^2]*2[^2]*){5}$' | egrep '^([^1]*1[^1]*){3}$'
(((2x2+1)x2x2+1)x2+1+3)
Mac_3.2.57$
You can check for each one separately and then combine the results.
\A[^3]*(3[^3]*){1}\z
\A[^2]*(2[^2]*){5}\z
\A[^1]*(1[^1]*){3}\z
Depending on your regular expressions engine, you may need to use ^ and $ instead of \A and \z.
I initially thought that it may be possible to combine them, but that would still match if there is one '3' and one '2', for example. You'll probably need code similar to this:
match = input.matches(/[123]/)
match = match && input.matches(\A[^3]*(3[^3]*){1}\z) if input.include?('3')
match = match && input.matches(\A[^2]*(2[^2]*){5}\z) if input.include?('2')
match = match && input.matches(\A[^1]*(1[^1]*){3}\z) if input.include?('1')
Note that the code above assumes that it's ok to have some but not all of these characters as long as the existing ones match the requirements.
It just needs a series of lookaheads to verify an exact number of specific characters.
This is a short version.
^(?=[^3]*3[^3]*$)(?=[^2]*(?:2[^2]*){5}$)(?=[^1]*(?:1[^1]*){3}$).+
https://regex101.com/r/2CJ2U6/1
^
(?= [^3]* 3 [^3]* $ ) # 1 of three
(?= # 5 of two
[^2]*
(?: 2 [^2]* ){5}
$
)
(?= # 3 of one
[^1]*
(?: 1 [^1]* ){3}
$
)
.+

Regex for matching all queries that ends with W (no space) (uppercase) also another regex that matches all queries that ends with one digit (no space)

I want it to match
https://www.google.com/search?q=abcW&oq=abcW&ie=UTF-8
The other regex should match
https://www.google.com/search?q=abc4&oq=abc4&ie=UTF-8
Mac_3.2.57$echo "https://www.google.com/search?q=abcW&oq=abcW&ie=UTF-8" | grep 'search?q=.*W&'
https://www.google.com/search?q=abcW&oq=abcW&ie=UTF-8
Mac_3.2.57$echo "https://www.google.com/search?q=abc4&oq=abc4&ie=UTF-8" | grep 'search?q=.*[0-9]&'
https://www.google.com/search?q=abc4&oq=abc4&ie=UTF-8
Mac_3.2.57$

PowerShell -replace with multiple occurrences next to each other in the line

I have a | delimited file and I have some data where for null values it has a space. So, in my data file I'll have something like this:
2080| | | | | | | | | | | | | |2000225
I tried this:
-replace '\| \|', '||'
but it matches pairs of | and still leaves the space when it's done between |. I'm just not really good with regex and totally new to Powershell.
2080|| || || ....|2000225
I'm not sure if recursion would solve this or if I'm going to need to write a short Java program to do it.
You can use the regex-based -replace operator as follows:
PS> ' |2080| | | | | | | | | | | | | |2000225| ' -replace ' (\||$)', '$1'
|2080||||||||||||||2000225|
This assumes that no non-empty fields have trailing spaces - if they do, their (last) trailing space will be removed; to avoid this, use the appropriate solution from Wiktor Stribiżew's helpful answer.
Regex (\||$) matches a single space char. followed by either a literal | (escaped as \|) or (|) the end of the string ($); $1 in the replacement string then replaces whatever the 1st capture group ((...)) matched; that is, if the space char. was followed by literal |, it is effectively replaced with just |; if it was followed by the end of the string, it is effectively removed.
A slight simplification is to use a positive lookahead assertion ((?=...)), as also used in Wiktor's answer, which captures the space character only, and therefore allows omission of the substitution-text -replace operand, which defaults to the empty string and therefore effectively removes the spaces:
PS> ' |2080| | | | | | | | | | | | | |2000225| ' -replace ' (?=\||$)'
|2080||||||||||||||2000225|
Using -replace with a regex based search, you may....
Remove all whitespace between two | chars:
$text -replace '(?<=\|)\s+(?=\|)'
To only remove spaces in between | and start/end of string
$text -replace '(?<=\||^)\s+(?=\||$)'
$text -replace '(?<![^|])\s+(?![^|])'
Remove all whitespace characters that are either followed with | or end of string
$text -replace '\s+(?=\||$)'
$text -replace '\s+(?![^|])'
Output: 2080||||||||||||||2000225. See the regex demo.
Details
\s+ - 1 or more whitespace characters
(?=\||$) - a positive lookahead that requires a | char (\|) or (|) end of string ($) immediately to the right of the current location.
(?![^|]) - a negative lookahead that fails the match if there is a char other than | immediately to the right of the current location.
You don't need to run a recursive function to do that. Just run it twice. The problem is that once you match | |, you are past the start of the next occurence. In the first pass, you leave all the ocurrences of | | | (so after the first match <| |> |, you will have | as starting point for new matches, which doesn't match) for the second one... of if you have more, you left without matching all the even occurences that are stuck together. If you run it only a second time, you'll match and change all those matches you left the first time. Run it a second time and you'll see that it works.
Just do:
PS> ' |2080| | | | | | | | | | | | | |2000225| ' -replace '| |', '||' -replace '| |', '||'
|2080||||||||||||||2000225|
You won't need more.

awk sed perl, replacing specific pattern within a range of lines

I'm working in verilog and need to edit a specific line within a unique block, but am unsure of how to proceed
file.v
...
block1 block1(
.port1(port1),
.port2(port2),
);
block2 block2(
.(port2)(port2),
.(port3)(port3)
);
....
I need to somehow remove the " , " for port2 in block1. without modifying block2. There are also multiple blocks else where that contains port2.
block1 block1(
.port1(port1),
.port2(port2)
);
I've been trying ranges of awk and sed lines, but not getting the results to modify the file successfully. Any suggestions or solutions is much appreciated
This will remove any comma that occurs just before the end of a block (whitespace then );):
perl -0777 -pe 's/,(?=\s*\);)//g'
Notes:
-0777 causes perl to slurp all the input in as a single string. This is required because
we know there's newlines in between so we don't want to read line-by-line
there might be empty lines between the comma and the parentheses so reading by "paragraph" won't work either.
-p causes perl to print the input after modifications.
the regex is the trickiest part
it finds a comma and then looks ahead to match zero or more whitespace characters (includes spaces, tabs, newlines, etc) followed by a close parenthesis and a semicolon.
the lookahead text is not part of the matched text (lookaheads are known as "zero width assertions") -- the matched text will be just the comma
if there's a match, replace the comma with an empty string.
the g flag says do this globally in the string.
This might do the job for you
sed '/block1 block1/,/);/{s/\((port2)\),/\1/}' file.v
how about:
awk -v RS="" '/block1/{sub("port2),","port2)")}7' file
I guess you want to remove commas located after a closing paren ()) followed by a newline and a closing paren and a semicolon ();)?
In this case this might work for you:
sed -r ':a;N;s/\),\n\s*\);/)\n);/;P;D;ba'
| | | |---------| |---| | | |
| | | | | | | -- branch to label "a"
| | | | | | -- delete up to first newline of pattern space
| | | | | -- print up to first newline of pattern space
| | | | -- replace pattern
| | | -- search pattern
| | -- substitute
| -- read next line into pattern space (append)
-- branch label "a"

regex mixed case excluding specific case

I need a regex able to match:
a) All combinations of lower-/upper-cases of a certain word
b) Except a couple of certain case-combinations.
I must search the bash thru thousands of source-code files, occurrences of miss-spelled variables.
Specifically, the word I'm searching for is FrontEnd which in our coding-style guide can be written exactly in 2 ways depending on the context:
FrontEnd (F and E upper)
frontend (all lower)
So I need to "catch" any occurences that do not follow our coding standards as:
frontEnd
FRONTEND
fRonTenD
I have been reading many tutorials of regex for this specific example and I cannot find a way to say "match this pattern BUT do not match if it is exactly this one or this other one".
I guess it would be similar to trying to match "any number between 000000 to 999999, except exactly the number 555555 or the number 123456", I suppose the logic is similar (of course I don't knot to do this either :) )
Thnx
Additional comment:
I cannot use grep piped to grep -v because I could miss lines; for example if I do:
grep -i frontend | grep -v FrontEnd | grep -v frontend
would miss a line like this:
if( frontEnd.name == 'hello' || FrontEnd.value == 3 )
because the second occurence would hide the whole line. Therefore I'm searching for a regex to use with egrep capable to do the exact match I need.
You won't be able to do this easily with egrep because it doesn't support lookaheads. It's probably easiest to do this with perl.
perl -ne 'print if /(?!frontend|FrontEnd)(?i)frontend/;'
To use just pipe the text through stdin
How this works:
perl -ne 'print if /(?!frontend|FrontEnd)(?i)frontend/;'
^ ^^ ^ ^ ^ ^ ^ ^ ^ The pattern that matches both the correct and incorrect versions.
| || | | | | | | This switch turns on case insensitive matching for the rest of the regular expression (use (?-i) to turn it off) (perl specific)
| || | | | | | The pattern that match the correct versions.
| || | | | | Negative forward look ahead, ensures that the good stuff won't be matched
| || | | | Begin regular expression match, returns true if match
| || | | Begin if statement, this expression uses perl's reverse if semantics (expression1 if expression2;)
| || | Print content of $_, which is piped in by -n flag
| || Evaluate perl code from command line
| | Wrap code in while (<>) { } takes each line from stdin and puts it in $_
| Perl command, love it or hate it.
This really should be a comment, but is there any reason you cannot use sed? I'm thinking something like
sed 's/frontend/FrontEnd/ig' input.txt
That is, of course, assuming you want to correct the deviant versions...