How to locate a mismatched text delimiter - regex

I'm trying to remove double quotes that appear within a string coming from a dB because it's causing an stream error in another application. I can't clean up the dB to remove these, so I need to replace the character on the fly.
I've tried using sed, ssed, and perl all without success. This regular expression is locating the problem quotes, but when I plug it into sed to replace them with a single quote my output still contains the double quote.
sed "s/(\?<\!\t|^)\"(\?\!\t|$)/'/g" test.txt
I'm on Mac, if this looks a bit odd.
The regex is valid, but when I test on a tab-delimited file containing this:
"foo" "rea"son" "text's"
My output is identical to the above. Any idea what I'm doing wrong?
Thanks

I assume you want to turn all occurrences of " that are not on a field boundary (i.e. either preceded or succeeded by either a tab or the beginning/end of the string) by '.
This can be done using perl and the following substitution:
s/(?<=[^\t])"(?=[^\t\n])/'/g;
(With sed this is not directly possible as it does not support look-behind / look-ahead assertions.)
To use this code on the command line, it needs to be escaped for whatever shell you're using. Assuming bash or a similar sh-like shell:
perl -pe 's/(?<=[^\t])"(?=[^\t\n])/'\''/g' test.txt
Here I use '...' to quote most of the code. To get a single ' into the quoted string, I leave the quoted area ...', add an escaped single quote \', and switch back into a single-quoted string '.... That's why a literal ' turns into '\'' on the command line.

Related

Why is this regex missing quotes?

I'm trying to use sed to increment a version number in a conf file. The version number is of this form:
MENDER_ARTIFACT_NAME = "release-6".
Using the following:
sed -r 's/(.*)(release\-)([0-9]*)(.*)/echo "\1\2$((\3+1))\4"/ge'
The result, is this:
MENDER_ARTIFACT_NAME = release-7
I.E. it works, but it misses the quotes. I've checked the regex docs, and (.*) should match all non newline characters, any number of times, so the first should match everything, including the quote, before release-6, and the second should match everything, including the quote, after release-6. Instead, it seems to drop the quotes completely. What am I doing wrong?
As per documentation:
the e flag executes the substitution result as a shell command...
which means quotation marks are there for showing a bunch of characters. I.e try echo MENDER_ARTIFACT_NAME = "release-6". You should add escaped quotation marks in echo statement manually:
sed -r 's/^(.*)(release\-)([0-9]+)/echo "\1\\"\2$((\3+1))\\""/ge'

Perl: regular expression: capturing group

In a code file, I want to remove any (one or more) consecutive white lines (lines that may include only zero or more spaces/tabs and then a newline) that go between a code text and the concluding } of a block. This concluding } may have spaces for indentation before it, so I want to keep them.
Here is what I try to do:
perl -i -0777 -pe 's/\s+\n([ ]*)\}/\n($1)\}/g' file
For example, if my code file looks like (□ is the space character):
□□□□while (true) {\n
□□□□□□□□print("Yay!");□□□□□□\n
□□□□□□□□□□□□□□□□\n
□□□□}\n
Then I want it to become:
□□□□while (true) {\n
□□□□□□□□print("Yay!");\n
□□□□}\n
However it does not do the change I expected. Any idea what I am doing wrong here?
The only issues I can see with your regex are
you don't need the parenthesis around the matching variable,
and
the use of a character class when extracting the match is
redundant (unless you want to match tabs as well as spaces).
So, you could try
s/\s+\n( *)\}/\n$1\}/g
instead.
This works as expected when run on your test input.
To tidy it up even more, you could try the following.
s/\s+(\n *\})/$1/g
If there might be tabs as well as spaces, you can use a character class. (You do not need to include '|' inside the character class).
s/\s+(\n[ \t]*\})/$1/g
perl -pi -0777 -e's/^\s*\n(?=\s*})//mg' yourfile
(Remove whitespace from the beginning of a line through a newline that precedes a line with } as the first non-whitespace.)
Try using this regex instead, which uses a positive look-ahead assertion. This way you only capture the part that you want to remove, and then replace it with nothing:
s/\s+(?=\n[ ]*\})//g
You can try the following one liner
perl -0777 -pe 's/\s*\n*(\s*\n)/$1/g' test

sed: remove all non-alphanumeric characters inside quotations only

Say I have a string like this:
Output:
I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"
I want to only remove non-alphanumeric characters inside the quotations except commas, periods, or spaces:
Desired Output:
I have some-non-alphanumeric % characters remain here, I "also, have some .here"
I have tried the following sed command matching a string and deleting inside the quotes, but it deletes everything that is inside the quotes including the quotes:
sed '/characters/ s/\("[^"]*\)\([^a-zA-Z0-9\,\. ]\)\([^"]*"\)//g'
Any help is appreciated, preferably using sed, to get the desired output. Thanks in advance!
You need to repeat your substitution multiple times to remove all non-alphanumeric characters. Doing such a loop in sed requires a label and use of the b and t commands:
sed '
# If the line contains /characters/, just to label repremove
/characters/ b repremove
# else, jump to end of script
b
# labels are introduced with colons
:repremove
# This s command says: find a quote mark and some stuff we do not want
# to remove, then some stuff we do want to remove, then the rest until
# a quote mark again. Replace it with the two things we did not want to
# remove
s/\("[a-zA-Z0-9,. ]*\)[^"a-zA-Z0-9,. ][^"a-zA-Z0-9,. ]*\([^"]*"\)/\1\2/
# The t command repeats the loop until we have gotten everything
t repremove
'
(This will work even without the [^"a-zA-Z0-9,. ]*, but it'll be slower on lines that contain many non-alphanumeric characters in a row)
Though the other answer is right in that doing this in perl is much easier.
Sed is not the right tools for this. Here is the one through Perl.
perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g' file
Example:
$ echo 'I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"' | perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g'
I have some-non-alphanumeric % characters remain here, I "also, have some .here"
Regex Demo

How to substitute a string even it contains regex meta characters using Shell or Perl?

I want to substitue a word which maybe contains regex meta characters to another word, for example, substitue the .Precilla123 as .Precill, I try to use below solution:
sed 's/.Precilla123/.Precill/g'
but it will change below line
"Precilla123";"aaaa aaa";"bbb bbb"
to
.Precill";"aaaa aaa";"bbb bbb"
This side effect is not I wanted. So I try to use:
perl -pe 's/\Q.Precilla123\E/.Precill/g'
The above solution can disable interpreted regex meta characters, it will not have the side effect.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
Can anybody help this? Many thanks.
Please note that the word I want to substitute is NOT hard coded, it comes from a input file, you can consider it as variable.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
This is not true.
If the value that you want to replace is in a Perl variable, then quotemeta will work on the variable's contents just fine, including the characters $ and #:
echo 'pre$foo to .$foobar' | perl -pe 'my $from = q{.$foo}; s/\Q$from\E/.to/g'
Outputs:
pre$foo to .tobar
If the words that you want to replace are in an external file, then simply load that data in a BEGIN block before composing your regular expressions for replacement.
sed 's/\.Precilla123/.Precill/g'
Escape the meta character with \.
Be carrefull, mleta charactere are not the same for search pattern that are mainly regex []{}()\.*+^$ where replacement is limited to &\^$ (+ the separator that depend of your first char after the s in both pattern)

Extract info from the large file, use newline character in awk regex

I need to extract some information from the very large file.
I want to extract specific lines using regular expressions.
What is the fastest way to do this?
I'm coding in c++ on linux.
I want to use grep, but seems my regex is not working as expected.
For example \s, \w are not working properly.
In man grep is written that \wand [:alnum:] are synonyms, so, \w should work properly but it shouldn't.
I need to use newline characters in my regex, so, I'couldn't use grep, therefore, I decided to use awk.
How should I use newline character in awk regex?
Let's consider we have a file (test.txt) with the content below:
HELLO worl_d5 ; some statement HELLO world1 ; some
statement hi hi some statement ...
And I want to get only these lines:
HELLO worl_d5 ; some statement HELLO world1 ; some
statement
I.e., I want to find lines that start with HELLO word followed by the space character(s), then some alphanumeric( or containing /) word followed by the space character(s) and then, a single ;. But I want to get this kind of lines when they are followed by the some statement line only.
I wrote:
awk '/HELLO[[:space:]]([[:alnum:]]|\/)+[[:space:]];\n[[:space:]]*some[[:space:]] statement [[:space:]];/ { print }' test.txt
But I couldn't get needed results.
Or just provide an example where newline is used in regex.
I solved this by using pcregrep and newline just worked fine!
pcregrep -M '(HELLO[[:space:]]([[:alnum:]]|\/|_)+[[:space:]];)[\r\n]([[:space:]]*some[[:space:]]statement[[:space:]];)' test.txt