Substitute words not in double quotes - regex

$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
basic
I want unix sed command such that only basic that is not in quotes should be changed.[change basic to ring]
Expected output:
$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring

If we disallow escaping quotes, then any basic that is not within " is preceded by an even number of ". So this should do the trick:
sed -r 's/^([^"]*("[^"]*){2}*)basic/\1ring/' file
And as ДМИТРИЙ МАЛИКОВ mentioned, adding the --in-place option will immediately edit the file, instead of returning the new contents.
How does this work?
We anchor the regular expression to the beginning of each line with ". Then we allow an arbitrary number of non-" characters (with [^"]*). Then we start a new subpattern "[^"]* that consists of one " and arbitrarily many non-" characters. We repeat that an even number of times (with {2}*). And then we match basic. Because we matched all of that stuff in the line before basic we would replace that as well. That's why this part is wrapped in another pair of parentheses, thus capturing the line and writing it back in the replacement with \1 followed by ring.
One caveat: if you have multiple basic occurrences in one line, this will only replace the last one that is not enclosed in double quotes, because regex matches cannot overlap. A solution would be a lookbehind, but since this would be a variable-length lookbehind, which is only supported by the .NET regex engine. So if that is the case in your actual input, run the command multiple times until all occurrences are replaced.

$> sed -r 's/^([^\"]*)(basic)([^\"]*)$/\1ring\3/' file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If you wanna edit file in place use --in-place option.

This might work for you (GNU sed):
sed -r 's/^/\n/;ta;:a;s/\n$//;t;s/\n("[^"]*")/\1\n/;ta;s/\nbasic/ring\n/;ta;s/\n([^"]*)/\1\n/;ta' file

Not a sed solution, but it substitutes words not in quotes
Assuming that there is no escaped quotes in strings, i.e. "This is a trap \" hehe", awk might be able to solve this problem
awk -F\" 'BEGIN {OFS=FS}
{
for(i=1; i<=NF; i++){
if(i%2)
gsub(/basic/,"ring",$i)
}
print
}' inputFile
Basically the words that are not in quotes are in odd-numbered fields, and the word "basic" is replaced by "ring" in these fields.
This can be written as a one-liner, but for clarity's sake I've written it in multiple lines.

If basic is at the beginning of line:
sed -e 's/^basic/ring/' file0

Related

Replace newline in quoted strings in huge files

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....
Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.
In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.
With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

can sed replace words in pattern substring match in one line?

original line in file sed.txt:
outer_string_PATTERN_string(PATTERN_And_PATTERN_PATTERN_i)PATTERN_outer_string(i_PATTERN_inner)_outer_string
only need to replace PATTERN to pattern which in brackets, not lowercase, it could replace to other word.
expect result:
outer_string_PATTERN_string(pattern_And_pattern_pattern_i)PATTERN_outer_string(i_pattern_inner)_outer_string
I could use ([^)]*) pattern to find the substring which would be replace some worlds in. But I can't use this pattern to index the substring's position, and it will replace the whole line's PATTERN to pattern.
:/tmp$ sed 's/([^)]*)/---/g' sed.txt
outer_string_PATTERN_string---PATTERN_outer_string---_outer_string
:/tmp$ sed '/([^)]*)/s/PATTERN/pattern/g' sed.txt
outer_string_pattern_string(pattern_And_pattern_pattern_i)pattern_outer_string(i_pattern_inner)_outer_string
I also tried to use the regex group in sed to capture and replace the words, but I can't figure out the command.
Can sed implement that? And how to achieve that? THX.
Can sed implement that?
It can be done using GNU sed and basic regular expressions
(BRE):
sed '
s/)/)\n/g
:1
s/\(([^)]*\)PATTERN\([^)]*)\n\)/\1pattern\2/
t1
s/\n//g
' < file
where
1st s inserts a newline after each )
2nd s replaces the last (* is greedy) PATTERN inside ()s with pattern
t loops back if a substitution was made
3rd s strips all inserted newlines
EDIT
2nd substitute command edited according to OP's suggestion
since there is no need to match \n inside ().
Can sed implement that?
Yes. But you do not want to do it in sed. Use other programming language, like Python, Perl, or awk.
how to achieve that?
Implementing non-greedy regex is not simple in sed. Basically, generally, it consists of:
taking chunk of the input
process the chunk
put it in hold space
shuffle hold with pattern space - extract what been already processed, what's not
repeat
shuffle with hold space
output
Anyway, the following script:
#!/bin/bash
sed <<<'outer_string_PATTERN_string(PATTERN_i_PATTERN_PATTERN_i)PATTERN_outer_string(i_PATTERN_inner)_outer_string' '
:loop;
/\([^(]*\)\(([^)]*)\)\(.*\)/{
# Lowercase the second part.
s//\1\L\2\E\n\3/;
# Mix with hold space.
G;
s/\(.*\)\n\(.*\)\n\(.*\)/\3\1\n\2/;
# Put processed stuff into hold spcae
h; s/\n.*//; x;
# Process the other stuff again.
s/.*\n//;
bloop;
};
# Is hold space empty?
x; /^$/!{
# Pattern space has trailing stuff - add it.
G; s/\n//;
# We will print it.
h;
# Clear hold space
s/.*//
};x;
'
outputs:
PATTERN_outer_string(i_pattern_inner)outer_string_PATTERN_string(pattern_i_pattern_pattern_i)_outer_string
As an alternative, it is easier to do this in gnu awk with RS that matches (...) substring:
awk -v RS='\\([^)]+)' '{gsub(/PATTERN/, "pattern", RT); ORS=RT} 1' file
outer_string_PATTERN_string(pattern_i_pattern_pattern_i)PATTERN_outer_string(i_pattern_inner)_outer_string
Steps:
RS='\\([^)]+)' captures a (...) string as record separator
gsub function then replaces PATTERN with pattern in matched text i.e. RT
ORS=RT sets ORS as the new modified RT
1 prints each record to stdout
Another alternative solution using lookahead assertion in a perl regex:
perl -pe 's/PATTERN(?=[^()]*\))/pattern/g' file
Solved by this:
:/tmp$ sed 's/(/\n(/g' sed.txt | sed 's/)/)\n/g' | sed '/([^)]*)/s/PATTERN/pattern/g' | sed ':a;N;$!ba;s/\n//g'
outer_string_PATTERN_string(pattern_And_pattern_pattern_i)PATTERN_outer_string(i_pattern_inner)_outer_string
make pattern () in a new line
find the () lines and replace the PATTERN to pattern
merge multiple lines in one line
thanks for How can I replace a newline (\n) using sed?

sed: Replacing a double quote in a quoted field within a delmited record

Given an optionally quoted, pipe delimited file with the following records:
"foo"|"bar"|123|"9" Nails"|"2"
"blah"|"blah"|456|"Guns "N" Roses"|"7"
"brik"|"brak"|789|""BB" King"|"0"
"yin"|"yang"|789|"John "Cougar" Mellencamp"|"5"
I want to replace any double quotes not next to a delimiter.
I used the following and it almost works. With one exception.
sed "s/\([^|]\)\"\([^|]\)/\1'\2/g" a.txt
The output looks like this:
"foo"|"bar"|123|"9' Nails"|"2"
"blah"|"blah"|456|"Guns 'N" Roses"|"7"
"brik"|"brak"|789|"'BB' King"|"0"
"yin"|"yang"|789|"John 'Cougar' Mellencamp"|"5"
It doesn't catch the second set of quotes if they are separated by a single character as in Guns "N" Roses. Does anyone know why that is and how it can be fixed? In the mean time I'm just piping the output to a second regex to handle the special case. I'd prefer to do this in one pass since some of the files can be largish.
Thanks in advance.
You can use substitution twice in sed:
sed -r "s/([^|])\"([^|])/\1'\2/g; s/([^|])\"([^|])/\1'\2/g" file
"foo"|"bar"|123|"9' Nails"|"2"
"blah"|"blah"|456|"Guns 'N' Roses"|"7"
"brik"|"brak"|789|"'BB' King"|"0"
"yin"|"yang"|789|"John 'Cougar' Mellencamp"|"5"
sed kind of implements a "while" loop:
sed ':a; s/\([^|]\)"\([^|]\)/\1'\''\2/g; ta' file
The t command loops to the label a if the previous s/// command replaced something. So that will repeat the replacement until no other matches are found.
Also, perl handles your case without looping, thanks to zero-width look-ahead:
perl -pe 's/[^|]\K"(?!\||$)/'\''/g'
But it doesn't handle consecutive double quotes, so the loop:
perl -pe 's//'\''/g while /[^|]\K"(?!\||$)/' file
You may like to use \x27 instead of the awkward '\'' method to insert a single quote in a single quoted string. Works with perl and GNU sed.

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks
You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'
Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise
That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input
I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.
or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input

Why doesn't this regex work?

The regex:
^ *x *\=.*$
means "match a literal x preceded by an arbitrary count of spaces, followed by an arbitrary count of spaces, then an equal sign and then anything up to the end of line." Sed was invoked as:
sed -r -e 's|^ *x *\=.*$||g' file
However it doesn't find a single match, although it should. What's wrong with the regex?
To all: thanks for the answers and effort! It seems that the problem was in tabs present in input file, which are NOT matched by the space specifier ' '. However the solution with \s works regardless of present tabs!
^\s*x\s*=.*$
Maybe you must escape some chars, figure it out one by one.
BTW: Regex tags should really have three requirements:
what is the input string, what is the output string and your platform/language.
sed processes the file line-by-line, and executes the given program for each. The simplest program that does what you want is
sed -re '/^ *x *=.*$/!d' file
"/^ *x *=.*$/" selects each line that matches the pattern.
"!" negates the result.
"d" deletes the line.
sed will by default print all lines unless told otherwise. This effectively prints lines that matches the pattern.
One alternative way of writing it is:
sed -rne '/^ *x *=.*$/p' file
"/^ *x *=.*$/" selects each line that matches the pattern.
"p" prints the line.
The difference here is that I used the "-n" switch to suppress the automatic printing of lines, and instead print only the lines I have selected.
You can also use "grep" for this task:
grep -E '^ *x *=.*$' file
or maybe '^[ ]*x[ ]*='. It's a bit more compatible, but will not match tabs or the like. And, if you don't need groups, why bother about the rest of the line?