sed: remove all non-alphanumeric characters inside quotations only - regex

Say I have a string like this:
Output:
I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"
I want to only remove non-alphanumeric characters inside the quotations except commas, periods, or spaces:
Desired Output:
I have some-non-alphanumeric % characters remain here, I "also, have some .here"
I have tried the following sed command matching a string and deleting inside the quotes, but it deletes everything that is inside the quotes including the quotes:
sed '/characters/ s/\("[^"]*\)\([^a-zA-Z0-9\,\. ]\)\([^"]*"\)//g'
Any help is appreciated, preferably using sed, to get the desired output. Thanks in advance!

You need to repeat your substitution multiple times to remove all non-alphanumeric characters. Doing such a loop in sed requires a label and use of the b and t commands:
sed '
# If the line contains /characters/, just to label repremove
/characters/ b repremove
# else, jump to end of script
b
# labels are introduced with colons
:repremove
# This s command says: find a quote mark and some stuff we do not want
# to remove, then some stuff we do want to remove, then the rest until
# a quote mark again. Replace it with the two things we did not want to
# remove
s/\("[a-zA-Z0-9,. ]*\)[^"a-zA-Z0-9,. ][^"a-zA-Z0-9,. ]*\([^"]*"\)/\1\2/
# The t command repeats the loop until we have gotten everything
t repremove
'
(This will work even without the [^"a-zA-Z0-9,. ]*, but it'll be slower on lines that contain many non-alphanumeric characters in a row)
Though the other answer is right in that doing this in perl is much easier.

Sed is not the right tools for this. Here is the one through Perl.
perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g' file
Example:
$ echo 'I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"' | perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g'
I have some-non-alphanumeric % characters remain here, I "also, have some .here"
Regex Demo

Related

How to locate a mismatched text delimiter

I'm trying to remove double quotes that appear within a string coming from a dB because it's causing an stream error in another application. I can't clean up the dB to remove these, so I need to replace the character on the fly.
I've tried using sed, ssed, and perl all without success. This regular expression is locating the problem quotes, but when I plug it into sed to replace them with a single quote my output still contains the double quote.
sed "s/(\?<\!\t|^)\"(\?\!\t|$)/'/g" test.txt
I'm on Mac, if this looks a bit odd.
The regex is valid, but when I test on a tab-delimited file containing this:
"foo" "rea"son" "text's"
My output is identical to the above. Any idea what I'm doing wrong?
Thanks
I assume you want to turn all occurrences of " that are not on a field boundary (i.e. either preceded or succeeded by either a tab or the beginning/end of the string) by '.
This can be done using perl and the following substitution:
s/(?<=[^\t])"(?=[^\t\n])/'/g;
(With sed this is not directly possible as it does not support look-behind / look-ahead assertions.)
To use this code on the command line, it needs to be escaped for whatever shell you're using. Assuming bash or a similar sh-like shell:
perl -pe 's/(?<=[^\t])"(?=[^\t\n])/'\''/g' test.txt
Here I use '...' to quote most of the code. To get a single ' into the quoted string, I leave the quoted area ...', add an escaped single quote \', and switch back into a single-quoted string '.... That's why a literal ' turns into '\'' on the command line.

How to replace spaces after a certain pattern with commas?

I am new to coding and I'm trying to format some bioinformatics data. I am trying to remove all the spaces after GT:GL:GOF:GQ:NR:NV with commas, but not anything outside of the format xx:xx:xx:xx:xx (like the example). I know I need to use sed with regex option but I'm not very familiar with how to use it. I've never actually used sed before and got confused trying so any help would be appreciated. Sorry if I formatted this poorly (this is my first post).
EDIT 2: I got actual data from the file this time which may help solve the problem. Removed the bad example.
New Example: I pulled this data from my actual file (this is just two samples), and it is surrounded by other data. Essentially the line has a bunch of data followed by "GT:GL:GOF:GQ:NR:NV ", after this there is more data in the format shown below, and finally there is some more random data. Unfortunately I can't post a full line of the data because it is extremely long and will not fit.
Input
0/1:-1,-1,-1:146:28:14,14:4,0 0/1:-1,-1,-1:134:6:2,2:1,0
Output
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
With Basic Regular Expressions, you can use character classes and backreferences to accomplish your task, e.g.
$ sed 's/\([0-9][0-9]*:[0-9][0-9]*\)[ ]\([0-9][0-9]*:[0-9][0-9]*\)/\1,\2/g' file
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 1:12:314,213:132:13:31,14:31:31 AB GT BB
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 10:13:12,41:41:1:13,13:131:1:1 AB GT RT
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 1:12:314,213:132:13:31,14:31:31 AB GT
Which basically says:
find and capture any [0-9][0-9]* one or more digits,
separated by a :, and
followed by [0-9][0-9]* one or more digits -- as capture group 1,
match a space following capture group 1 followed by capture group 2 (which is the same as capture group 1),
then replace the space separating the capture groups with a comma reinserting the capture group text using backreference 1 and 2 (e.g. \1 and \2), finally
make the replacement global (e.g. g) to replace all matching occurrences.
Edit Based On New Input Posted
If you still need all of the original commas added, and you now want to add a comma between ,0 0/ (where there is a comma precedes a single-digit followed by the space to be replaced with a comma, followed by a single-digit and a forward-slash), then all you need to do is make your capture groups conditional (on either capturing the original data as above -or- capturing this new segment. You do that by including an OR (e.g. \| in basic regex terms) between the conditions.
For instance by adding \|,[0-9] at the end of the first capture group and \|[0-9][/] at the end of the second, e.g.
$ sed 's/\([0-9][0-9]*:[0-9][0-9]*\|,[0-9]\)[ ]\([0-9][0-9]*:[0-9][0-9]*\|[0-9][/]\)/\1,\2/g' file
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
If you have other caveats in your file, I suggest you post several complete lines of input, and if they are too long, then create a zip, gzip, bzip or xz file and post it to a site like pastebin and add the link to your question.
If all you really care about now is the space in ,0 0/, then you can shorten the sed command to:
$ sed 's/\(,[0-9]\)[[:space:]]\([0-9][/]\)/\1,\2/g' file
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
(note: I've included [[:space:]] to handle any whitespace (space, tab, ...) instead of just the literal [ ] (space) in the new example)
Let me know if this fixes the issue.
I'm assuming that the xx:xx:xx or xx:xx:xx:xx can have any number of parts, since some have 3, and some have 4.
This is quite difficult to do reliably with sed, as it does not support lookarounds, which seem like they might be needed for this example.
You can try something like:
perl -pe 's/(?<=\d) (?=\d+(:\d+){2,})/,/g' input.txt
If you've got your heart set on sed, you can try this, but it may miss some cases:
sed -r 's/(:[0-9]+) ([0-9]+:)/\1,\2/g' input.txt
Could you please try following. This will take care of printing those values also which are NOT coming in match of regex. Also we would have made regex mentioned in match a bit shorter by doing it as [0-9]+\.{4} etc since this is tested on old awk so couldn't test it.
awk '
BEGIN{
OFS=","
}
match($0,/GT:GL:GOF:GQ:NR:NV [0-9]+:[0-9]+:[0-9]+:[0-9]+:[0-9]+/){
value=substr($0,RSTART!=1?1:RSTART,RSTART+RLENGTH-1)
value1=substr($0,RSTART+RLENGTH+1)
gsub(/[[:space:]]+/,",",value1)
print value,value1
next
}
1
' Input_file
You may also achieve your desired result without regex, using awk:
awk '{printf "%s", $1FS$2FS$3FS$4FS$5","$6","$7; for (i=8;i<=NF;i++) printf "%s", FS$i; print ""}' input.txt
Basically, it outputs from field 1 to 5 with the default field separator ("space"), then from field 5 to 7 with the comma separator, then from field 8 onwards with default separator again.
perl myscript.pl '0/1:-1,-1,-1:146:28:14,14:4,0 0/1:-1,-1,-1:134:6:2,2:1,0'
myscript.pl,
#!/usr/local/ActivePerl-5.20/bin/env perl
my $input = $ARGV[0];
$input =~ s/ /\,/g;
print $input, "\n";
__DATA__
output
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
This will remove all spaces, not just the space in question

Why is this regex missing quotes?

I'm trying to use sed to increment a version number in a conf file. The version number is of this form:
MENDER_ARTIFACT_NAME = "release-6".
Using the following:
sed -r 's/(.*)(release\-)([0-9]*)(.*)/echo "\1\2$((\3+1))\4"/ge'
The result, is this:
MENDER_ARTIFACT_NAME = release-7
I.E. it works, but it misses the quotes. I've checked the regex docs, and (.*) should match all non newline characters, any number of times, so the first should match everything, including the quote, before release-6, and the second should match everything, including the quote, after release-6. Instead, it seems to drop the quotes completely. What am I doing wrong?
As per documentation:
the e flag executes the substitution result as a shell command...
which means quotation marks are there for showing a bunch of characters. I.e try echo MENDER_ARTIFACT_NAME = "release-6". You should add escaped quotation marks in echo statement manually:
sed -r 's/^(.*)(release\-)([0-9]+)/echo "\1\\"\2$((\3+1))\\""/ge'

Perl: regular expression: capturing group

In a code file, I want to remove any (one or more) consecutive white lines (lines that may include only zero or more spaces/tabs and then a newline) that go between a code text and the concluding } of a block. This concluding } may have spaces for indentation before it, so I want to keep them.
Here is what I try to do:
perl -i -0777 -pe 's/\s+\n([ ]*)\}/\n($1)\}/g' file
For example, if my code file looks like (□ is the space character):
□□□□while (true) {\n
□□□□□□□□print("Yay!");□□□□□□\n
□□□□□□□□□□□□□□□□\n
□□□□}\n
Then I want it to become:
□□□□while (true) {\n
□□□□□□□□print("Yay!");\n
□□□□}\n
However it does not do the change I expected. Any idea what I am doing wrong here?
The only issues I can see with your regex are
you don't need the parenthesis around the matching variable,
and
the use of a character class when extracting the match is
redundant (unless you want to match tabs as well as spaces).
So, you could try
s/\s+\n( *)\}/\n$1\}/g
instead.
This works as expected when run on your test input.
To tidy it up even more, you could try the following.
s/\s+(\n *\})/$1/g
If there might be tabs as well as spaces, you can use a character class. (You do not need to include '|' inside the character class).
s/\s+(\n[ \t]*\})/$1/g
perl -pi -0777 -e's/^\s*\n(?=\s*})//mg' yourfile
(Remove whitespace from the beginning of a line through a newline that precedes a line with } as the first non-whitespace.)
Try using this regex instead, which uses a positive look-ahead assertion. This way you only capture the part that you want to remove, and then replace it with nothing:
s/\s+(?=\n[ ]*\})//g
You can try the following one liner
perl -0777 -pe 's/\s*\n*(\s*\n)/$1/g' test

How to substitute a string even it contains regex meta characters using Shell or Perl?

I want to substitue a word which maybe contains regex meta characters to another word, for example, substitue the .Precilla123 as .Precill, I try to use below solution:
sed 's/.Precilla123/.Precill/g'
but it will change below line
"Precilla123";"aaaa aaa";"bbb bbb"
to
.Precill";"aaaa aaa";"bbb bbb"
This side effect is not I wanted. So I try to use:
perl -pe 's/\Q.Precilla123\E/.Precill/g'
The above solution can disable interpreted regex meta characters, it will not have the side effect.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
Can anybody help this? Many thanks.
Please note that the word I want to substitute is NOT hard coded, it comes from a input file, you can consider it as variable.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
This is not true.
If the value that you want to replace is in a Perl variable, then quotemeta will work on the variable's contents just fine, including the characters $ and #:
echo 'pre$foo to .$foobar' | perl -pe 'my $from = q{.$foo}; s/\Q$from\E/.to/g'
Outputs:
pre$foo to .tobar
If the words that you want to replace are in an external file, then simply load that data in a BEGIN block before composing your regular expressions for replacement.
sed 's/\.Precilla123/.Precill/g'
Escape the meta character with \.
Be carrefull, mleta charactere are not the same for search pattern that are mainly regex []{}()\.*+^$ where replacement is limited to &\^$ (+ the separator that depend of your first char after the s in both pattern)