Why is this regex missing quotes? - regex

I'm trying to use sed to increment a version number in a conf file. The version number is of this form:
MENDER_ARTIFACT_NAME = "release-6".
Using the following:
sed -r 's/(.*)(release\-)([0-9]*)(.*)/echo "\1\2$((\3+1))\4"/ge'
The result, is this:
MENDER_ARTIFACT_NAME = release-7
I.E. it works, but it misses the quotes. I've checked the regex docs, and (.*) should match all non newline characters, any number of times, so the first should match everything, including the quote, before release-6, and the second should match everything, including the quote, after release-6. Instead, it seems to drop the quotes completely. What am I doing wrong?

As per documentation:
the e flag executes the substitution result as a shell command...
which means quotation marks are there for showing a bunch of characters. I.e try echo MENDER_ARTIFACT_NAME = "release-6". You should add escaped quotation marks in echo statement manually:
sed -r 's/^(.*)(release\-)([0-9]+)/echo "\1\\"\2$((\3+1))\\""/ge'

Related

How to locate a mismatched text delimiter

I'm trying to remove double quotes that appear within a string coming from a dB because it's causing an stream error in another application. I can't clean up the dB to remove these, so I need to replace the character on the fly.
I've tried using sed, ssed, and perl all without success. This regular expression is locating the problem quotes, but when I plug it into sed to replace them with a single quote my output still contains the double quote.
sed "s/(\?<\!\t|^)\"(\?\!\t|$)/'/g" test.txt
I'm on Mac, if this looks a bit odd.
The regex is valid, but when I test on a tab-delimited file containing this:
"foo" "rea"son" "text's"
My output is identical to the above. Any idea what I'm doing wrong?
Thanks
I assume you want to turn all occurrences of " that are not on a field boundary (i.e. either preceded or succeeded by either a tab or the beginning/end of the string) by '.
This can be done using perl and the following substitution:
s/(?<=[^\t])"(?=[^\t\n])/'/g;
(With sed this is not directly possible as it does not support look-behind / look-ahead assertions.)
To use this code on the command line, it needs to be escaped for whatever shell you're using. Assuming bash or a similar sh-like shell:
perl -pe 's/(?<=[^\t])"(?=[^\t\n])/'\''/g' test.txt
Here I use '...' to quote most of the code. To get a single ' into the quoted string, I leave the quoted area ...', add an escaped single quote \', and switch back into a single-quoted string '.... That's why a literal ' turns into '\'' on the command line.

Perl: regular expression: capturing group

In a code file, I want to remove any (one or more) consecutive white lines (lines that may include only zero or more spaces/tabs and then a newline) that go between a code text and the concluding } of a block. This concluding } may have spaces for indentation before it, so I want to keep them.
Here is what I try to do:
perl -i -0777 -pe 's/\s+\n([ ]*)\}/\n($1)\}/g' file
For example, if my code file looks like (□ is the space character):
□□□□while (true) {\n
□□□□□□□□print("Yay!");□□□□□□\n
□□□□□□□□□□□□□□□□\n
□□□□}\n
Then I want it to become:
□□□□while (true) {\n
□□□□□□□□print("Yay!");\n
□□□□}\n
However it does not do the change I expected. Any idea what I am doing wrong here?
The only issues I can see with your regex are
you don't need the parenthesis around the matching variable,
and
the use of a character class when extracting the match is
redundant (unless you want to match tabs as well as spaces).
So, you could try
s/\s+\n( *)\}/\n$1\}/g
instead.
This works as expected when run on your test input.
To tidy it up even more, you could try the following.
s/\s+(\n *\})/$1/g
If there might be tabs as well as spaces, you can use a character class. (You do not need to include '|' inside the character class).
s/\s+(\n[ \t]*\})/$1/g
perl -pi -0777 -e's/^\s*\n(?=\s*})//mg' yourfile
(Remove whitespace from the beginning of a line through a newline that precedes a line with } as the first non-whitespace.)
Try using this regex instead, which uses a positive look-ahead assertion. This way you only capture the part that you want to remove, and then replace it with nothing:
s/\s+(?=\n[ ]*\})//g
You can try the following one liner
perl -0777 -pe 's/\s*\n*(\s*\n)/$1/g' test

sed is returning more than I need

Every line of the input file will match one of the patterns:
"SCnnnn"
"SC-nnnn"
"SC_nnnn"
( n=[0-9], SC is literal but may be upper or lowercase and will be followed immediately by 1-4 digits delimited at the end by an alphanumeric, space or other non-numeric character)
Somewhere in the line there will also be a file extension (matching ".abc") where abc = upper|lower alphanumeric in any position.
I want to extract the first pattern and print this together with the extracted file extension for each line. This is what I have so far:
sed -E -n 's/([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p' infile
Here's a sample input line:
SCSCSCSCSCSCSCSCSC1867SCBrSCSCSCSC&SCBlSCkSCSCBSCrSCbSCckSC.xyz
with required output being:
SC1867.xyz
but what I am getting is:
SCSCSCSCSCSCSCSCSC1867.xyz
Can someone please tell me why this is returning the "SC"s before the part I want? I know it's something to do with greediness, but I can't get my head around it.
(Everything works fine where my "SCnnnn" match is at the beginning of the line.)
I am open to other tools - e.g. awk - if they offer a more straightforward solution.
EDIT: I think I found a solution - at least it appears to work:
sed -E -n 's/.*([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p'
It's actually not necessarily the greediness that is at play here. The reason this is happening is because sed is replacing a part of a line and then printing the whole line (the suffix of p on your s// command does this).
To more clearly see what's happening, make infile contain a more obvious string like 0o0o0o0o0o0o0o0oSC1867lalalalalalfalalala.xyz and run your first command. The following is the result
[user#localhost ~]$ sed -E -n 's/([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p' infile
0o0o0o0o0o0o0o0oSC1867.xyz
As a slow-mo: sed finds your [Ss][Cc] characters beginning after the 0o0o0s and dutifully replaces the string you have described with the desired substitution; namely, it maintains the SC_-like part and four digits, then deletes everything after the numbers until the suffix. The problem is seen when the p command prints out the partially-changed line, including all of the unwanted 0oze.
Alternately
As an alternate solution, not involving printing partially changed lines but instead matching an entire line and altering it to your purpose, the following command extracted the correct answer to stdout for a file containing your example string:
[user#localhost ~]$ sed -e 's/^.*\([Ss][Cc][-_]\?[0-9]\{4\}\).*\(\.[a-Z]\{3\}\)$/\1\2/' infile
SC1867.xyz
To break that regex down a bit: the regex begins with a beginning of line (^), consumes all characters (.*) until it sees an SC (upper or lower, [Ss][Cc]), then it checks for an optional hyphen or underscore ([-_]\?), followed by exactly four digits ([0-9]\{4\}). Then, all characters are consumed until a dot (\.) is seen, followed by exactly three alphanumerical characters ([a-Z]\{3\}) and an end of line ($). The two expressions not consumed by a wildcard are saved to registers and concatenated (\1\2).
... sed -E 's/^.*([Ss][Cc][-_]?[0-9]{4}).*(\.[a-Z]{3})$/\1\2/' infile works too, if you don't enjoy backslashes as much as I do.

sed: remove all non-alphanumeric characters inside quotations only

Say I have a string like this:
Output:
I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"
I want to only remove non-alphanumeric characters inside the quotations except commas, periods, or spaces:
Desired Output:
I have some-non-alphanumeric % characters remain here, I "also, have some .here"
I have tried the following sed command matching a string and deleting inside the quotes, but it deletes everything that is inside the quotes including the quotes:
sed '/characters/ s/\("[^"]*\)\([^a-zA-Z0-9\,\. ]\)\([^"]*"\)//g'
Any help is appreciated, preferably using sed, to get the desired output. Thanks in advance!
You need to repeat your substitution multiple times to remove all non-alphanumeric characters. Doing such a loop in sed requires a label and use of the b and t commands:
sed '
# If the line contains /characters/, just to label repremove
/characters/ b repremove
# else, jump to end of script
b
# labels are introduced with colons
:repremove
# This s command says: find a quote mark and some stuff we do not want
# to remove, then some stuff we do want to remove, then the rest until
# a quote mark again. Replace it with the two things we did not want to
# remove
s/\("[a-zA-Z0-9,. ]*\)[^"a-zA-Z0-9,. ][^"a-zA-Z0-9,. ]*\([^"]*"\)/\1\2/
# The t command repeats the loop until we have gotten everything
t repremove
'
(This will work even without the [^"a-zA-Z0-9,. ]*, but it'll be slower on lines that contain many non-alphanumeric characters in a row)
Though the other answer is right in that doing this in perl is much easier.
Sed is not the right tools for this. Here is the one through Perl.
perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g' file
Example:
$ echo 'I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"' | perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g'
I have some-non-alphanumeric % characters remain here, I "also, have some .here"
Regex Demo

How to use sed to find and surround all words after an equals sign with quotation marks

For example
Input
<url wiki=https://wiki.archlinux.org/index.php/Main_Page forums=https://bbs.archlinux.org/>
Desired output
<url wiki="https://wiki.archlinux.org/index.php/Main_Page" forums="https://bbs.archlinux.org/">
Here's a sample command to start with, that replaces = with "=". In my simple mind, there would be some way of making it search for the next word after equals, and enclosing that. I'm not sure if that exists, but any help is appreciated.
echo "<url wiki=https://wiki.archlinux.org/index.php/Main_Page forums=https://bbs.archlinux.org/>" | sed 's/=/"&"/'
Your suggested regex (sed 's/=/"&"/') would replace the = with "=", which is not what you need. This seems to work correctly on your sample data:
sed -e 's/=\([^" >][^ >]*\)/="\1"/g'
Replace an equals sign followed by a non-blank, non-quote, non-greater-than and a string of other non-blank, non-greater-than characters with an equals sign, a double quote, the remembered string and another double quote, globally on each line.
sed 's/\(=\)\([^ >]*\)/\1"\2"/g'
Capture the sequence of non space string after the = sign and enclose it with the quotation mark