Use sed to replace a group match - regex

I have a simple regular expression that creates a group match for any semicolon contained within double quotes. I'm trying to use sed on Mac OS X to replace the semicolon with 'SEMICOLON'.
However, it's not working.
Here's the command I used:
sed -i.bu "s|.*?(;).*?|SEMICOLON|g" output/html/index.html
The result is that nothing is matched and nothing is replaced.
Desired behavior:
Input
"The man sat; the man cried;" cats; dogs;
Output
"The man satSEMICOLON the man criedSEMICOLON" cats; dogs;
UPDATE:
Thanks for your help everyone. So my example wasn't very good. In reality, I process a JavaScript file that's been condensed to one line, and make sure each JavaScript statement has its own line. The problem is that the JavaScript is mostly translated text, so trying to make a simple regex that would insert a newline after each ; was difficult, because I obviously don't want a newline added if the semicolon is in quotes.
Long story short... I realized I was trying to reinvent the wheel, and decided to use js-beautify to pretty print the file. It's doing a little more than I need... but it's the best solution for now.
Thanks again!

Let's take this as a test file:
$ cat file
"The man sat; the man cried;" cats; dogs;
1; 2; "man;"; 3; ";dog";
Try this sed command:
$ sed -E ':a; s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/; ta' file
"The man satSEMICOLON the man criedSEMICOLON" cats; dogs;
1; 2; "manSEMICOLON"; 3; "SEMICOLONdog";
How it works:
:a
This creates a label a that we can refer to later.
s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/
This replaces the last ; that is inside double-quotes with SEMICOLON. Let's look at ^(([^"]*"[^"]*")*[^"]*"[^"]*); in more detail:
^ matches at the beginning of a string.
([^"]*"[^"]*")* matches from the beginning of the line through any number of complete quoted strings.
Because, in sed, regular expressions are greedy (more precisely, leftmost-longest), this will try to match as many complete quoted strings as it can.
[^"]*"[^"]*; matches any non-quotes that follow the complete quoted strings (as above), followed the next quote character, followed by any number of non-quote characters, followed by ;.
Since the above regex minus the final ; is itself inside parens, it is saved as group 1. We replace the matched text with group 1 followed by SEMICOLON.
ta
If the last command resulted in a substitution (in other words, we found a ; that needed to be replaced), then jump back to label a and repeat.
Discussion
Let's consider:
sed "s|.*?(;).*?|SEMICOLON|g"
In Python and elsewhere, .*? is a non-greedy match. Sed, however, has no such concept. For that matter, by default, sed uses Basic Regular Expressions (BRE) in which ? just means a literal question mark.
Also, it is asking for trouble to put sed commands in double-quotes as this invites the shell to modify it.
So, since BRE are obsolete, let's (1) switch to Extended Regular Expressions (ERE) using the -E switch, (2) put the command in single-quotes, and (3) change .*? to .*:
$ sed -E 's|.*(;).*|SEMICOLON|g' file
SEMICOLON
(Compatibility note: if you are on a very old linux system, you may need to replace -E with -r.)
.*(;).* matches everything up to the last semicolon on the line, followed by the semicolon, followed by whatever follows the last semicolon. In other words, if the line contains a semicolon, .*(;).* matches the whole line. That is why the output is just SEMICOLON.
Also, (;) matches a semicolon and saves it in group 1. Since we never use group 1 anywhere, this does nothing for us. We would get the same result with:
$ sed -E 's|.*;.*|SEMICOLON|g' file
SEMICOLON
If we remove the .*, then every ; will be replaced:
$ sed -E 's|;|SEMICOLON|g' file
"The man satSEMICOLON the man criedSEMICOLON" catsSEMICOLON dogsSEMICOLON
If we want to replace the last ; in the first quoted string, we could use:
$ sed -E 's|^([^"]*"[^"]*);|\1SEMICOLON|g' file
"The man sat; the man criedSEMICOLON" cats; dogs;
If we want to replace all ; that are within any quoted string on the line, then we are back to the command at the top.
Strings spanning across lines
Let's consider a test file with a string spanning 2 lines:
$ cat file2
"man;" cat "dog
;"; ";man";
If you have GNU sed:
$ sed -Ez ':a; s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/; ta' file2
"manSEMICOLON" cat "dog
SEMICOLON"; "SEMICOLONman";
In general for any POSIX sed:
$ sed -E 'H;1h;$!d;x; :a; s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/; ta' file2
"manSEMICOLON" cat "dog
SEMICOLON"; "SEMICOLONman";

sed is for simple s/old/new that is all. With any awk:
$ awk 'match($0,/"[^"]+"/) {
str = substr($0,RSTART,RLENGTH)
gsub(/;/,"SEMICOLON",str)
$0 = substr($0,1,RSTART-1) str substr($0,RSTART+RLENGTH)
} 1' file
"The man satSEMICOLON the man criedSEMICOLON" cats; dogs;
That's assuming you actually want all semicolons in the quoted string treated the same way. If not, whatever it is you want to do is an easy tweak, e.g. if you want that last semicolon after cried removed instead of replaced as shown in your sample output:
$ awk 'match($0,/"[^"]+"/) {
str = substr($0,RSTART+1,RLENGTH-2)
sub(/;$/,"",str)
gsub(/;/,"SEMICOLON",str)
$0 = substr($0,1,RSTART) str substr($0,RSTART+RLENGTH-1)
} 1' file
"The man satSEMICOLON the man cried" cats; dogs;

Related

Replace newline in quoted strings in huge files

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....
Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.
In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.
With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

Sed - Replacing text string containing parentheses, commas and unspecified number of whitespaces

Assume an alphanumeric text string that contains a section comprising a keyword, parentheses, and commas as well as a line break and an unspecified number of whitespaces immediately following some or all of the commas. How do I replace such a section from the text string with a simple comma in bash (preferentially using sed)?
Example:
$ cat have.txt
foo (keyword(00001..00002),keyword(00003..00004),
keyword(00005..00006),keyword(00007..00008)) foo
$ cat want.txt
foo (keyword(00001..00002,00003..00004,00005..00006,00007..00008)) foo
Attempt:
$ sed 's/),keyword(/,/g' have.txt
foo (keyword(00001..00002,00003..00004),
keyword(00005..00006,00007..00008)) foo
(And, yes, I know that whitespaces can be captured via [[:space:]].)
With GNU sed:
sed -z 's/),\s*keyword(/,/g' file
With -z, you will be able to match linebreaks, the \s* will match zero or more whitespace including those linebreaks.
To actually modify the file use
sed -z -i 's/),\s*keyword(/,/g' file
cat >> edrep.txt << EOF
2s/ //
%s/(//g
%s/)//g
%s/keyword//g
1s/00001/\(keyword\(00001/
2s/8/8))/
1,2j
wq
EOF
ed -s file.txt < edrep.txt
I know this one has already been answered, but I felt like posting this solution anyway. People here will probably consider it hackish; but the point is that you can climb around text files like a monkey with ed, and I don't know of any other program which lets you do that to the same degree.

Adding a line using sed

Can't seem to find the right way to do this, despite checking my regex in a reg checker.
Given a text file containing, amongst others, this entry:
zone "example.net" {
type master;
file "/etc/bind/zones/db.example.net";
allow-transfer { x.x.x.x;y.y.y.y; };
also-notify { x.x.x.x;y.y.y.y; };
};
I want to add lines after the also-notify line, for that domain specifically.
So using this sed command string:
sed '/"example\.net".*?also-notify.*?};/a\nxxxxxxx/s' named.conf.local
I thought should work to add 'xxxxxxx' after the line. But nope. What am I doing wrong?
With POSIX sed, you can use the a for append command with an escaped literal new line:
$ sed '/^[[:blank:]]*also-notify/ a\
NEW LINE' file
With GNU sed, a is slightly more natural since the new line is assumed:
$ gsed '/^[[:blank:]]*also-notify/ a NEW LINE' file
The issue with the sed in your example is two fold.
The first is any sed regex cannot be for a multi-line match as in example\.net".*?also-notify.*?. That is more of a perl type match. You would need to use a range operator for the start as in:
$ sed '/"example\.net/,/also-notify/{
/^[[:blank:]]*also-notify/ a\
NEW LINE
}' file
The second issue is the \n in the appended text. With POSIX sed, the \n is not supported in any context. With GNU sed, the new line is assumed and the \n is out of context (if immediately after the a) and interpreted as an escaped literal n. You can use \n with GNU sed after 1 character but not immediately after. In POSIX sed, leading spaces of the appended line will always be stripped.
Following awk may help on this.
awk -v new_lines="new_line here" '/also-notify/{flag=1;print new_lines} /^};/{flag=""} !flag' Input_file
In case you want to edit Input_file itself then append > temp_file && mv temp_file Input_file to above code too. Also print new_lines here new_lines is a variable you could print the new liens directly too in there.
You're pretty close already. Just use a range (/pattern/,/pattern/{ #commands }) to select the text you want to operate on and then use /pattern/a/\ ... to add the line you want.
/"example\.net"/,/also-notify/{
/also-notify/a\
\ this is the text I want to add.
}
sed trims leading space on text to be appended. Adding a backslash \ at the start of the line prevents this.
In Bash, this would look like something like:
sed -e '/"example\.net"/,/also-notify/{
/also-notify/a\
\ this is the text I want to add.
}' named.conf.local
Also note that sed uses an older dialect of regular expressions that doesn't support non-greedy quantifies like *?.

Grep with reg ex

Trying to use regex with grep in the command line to give me lines that start with either a whitespace or lowercase int followed by a space. From there, they must end with either a semi colon or a o.
I tried
grep ^[\s\|int]\s+[\;\o]$ fileName
but I don't get what I'm looking for. I also tried
grep ^\s*int\s+([a-z][a-zA-Z]*,\s*)*[a-z]A-Z]*\s*;
but nothing.
Let's consider this test file:
$ cat file
keep marco
polo
int keep;
int x
If I understand your rules correctly, two of the lines in the above should be kept and the other two discarded.
Let's try grep:
$ grep -E '^(\s|int\s).*[;o]$' file
keep marco
int keep;
The above uses \s to mean space. \s is supported by GNU grep. For other greps, we can use a POSIX character class instead. After reorganizing the code slightly to reduce typing:
grep -E '^(|int)[[:blank:]].*[;o]$' file
How it works
In a Unix shell, the single quotes in the command are critical: they stop the shell from interpreting or expanding any character inside the single quotes.
-E tells grep to use extended regular expressions. Thus reduces the need for backslashes.
Let's examine the regular expression, one piece at a time:
^ matches at the beginning of a line.
(\s|int\s) This matches either a space or int followed by a space.
.* matches zero or more of any character.
[;o] matches any character in the square brackets which means that it matches either ; or o.
$ matches at the end of a line.

process a delimited text file with sed

I have a ";" delimited file:
aa;;;;aa
rgg;;;;fdg
aff;sfg;;;fasg
sfaf;sdfas;;;
ASFGF;;;;fasg
QFA;DSGS;;DSFAG;fagf
I'd like to process it replacing the missing value with a \N .
The result should be:
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;\N
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
I'm trying to do it with a sed script:
sed "s/;\(;\)/;\\N\1/g" file1.txt >file2.txt
But what I get is
aa;\N;;\N;aa
rgg;\N;;\N;fdg
aff;sfg;\N;;fasg
sfaf;sdfas;\N;;
ASFGF;\N;;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
You don't need to enclose the second semicolon in parentheses just to use it as \1 in the replacement string. You can use ; in the replacement string:
sed 's/;;/;\\N;/g'
As you noticed, when it finds a pair of semicolons it replaces it with the desired string then skips over it, not reading the second semicolon again and this makes it insert \N after every two semicolons.
A solution is to use positive lookaheads; the regex is /;(?=;)/ but sed doesn't support them.
But it's possible to solve the problem using sed in a simple manner: duplicate the search command; the first command replaces the odd appearances of ;; with ;\N, the second one takes care of the even appearances. The final result is the one you need.
The command is as simple as:
sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
It duplicates the previous command and uses the ; between g and s to separe them. Alternatively you can use the -e command line option once for each search expression:
sed -e 's/;;/;\\N;/g' -e 's/;;/;\\N;/g'
Update:
The OP asks in a comment "What if my file have 100 columns?"
Let's try and see if it works:
$ echo "0;1;;2;;;3;;;;4;;;;;5;;;;;;6;;;;;;;" | sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
0;1;\N;2;\N;\N;3;\N;\N;\N;4;\N;\N;\N;\N;5;\N;\N;\N;\N;\N;6;\N;\N;\N;\N;\N;\N;
Look, ma! It works!
:-)
Update #2
I ignored the fact that the question doesn't ask to replace ;; with something else but to replace the empty/missing values in a file that uses ; to separate the columns. Accordingly, my expression doesn't fix the missing value when it occurs at the beginning or at the end of the line.
As the OP kindly added in a comment, the complete sed command is:
sed 's/;;/;\\N;/g;s/;;/;\\N;/g;s/^;/\\N;/g;s/;$/;\\N/g'
or (for readability):
sed -e 's/;;/;\\N;/g;' -e 's/;;/;\\N;/g;' -e 's/^;/\\N;/g' -e 's/;$/;\\N/g'
The two additional steps replace ';' when they found it at beginning or at the end of line.
You can use this sed command with 2 s (substitute) commands:
sed 's/;;/;\\N;/g; s/;;/;\\N;/g;' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
Or using lookarounds regex in a perl command:
perl -pe 's/(?<=;)(?=;)/\\N/g' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
The main problem is that you can't use several times the same characters for a single replacement:
s/;;/..../g: The second ; can't be reused for the next match in a string like ;;;
If you want to do it with sed without to use a Perl-like regex mode, you can use a loop with the conditional command t:
sed ':a;s/;;/;\\N;/g;ta;' file
:a defines a label "a", ta go to this label only if something has been replaced.
For the ; at the end of the line (and to deal with eventual trailing whitespaces):
sed ':a;s/;;/;\\N;/g;ta; s/;[ \t\r]*$/;\\N/1' file
this awk one-liner will give you what you want:
awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N"}7' file
if you really want the line: sfaf;sdfas;\N;\N;\N , this line works for you:
awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N";sub(/;$/,";\\N")}7' file
sed 's/;/;\\N/g;s/;\\N\([^;]\)/;\1/g;s/;[[:blank:]]*$/;\\N/' YourFile
non recursive, onliner, posix compliant
Concept:
change all ;
put back unmatched one
add the special case of last ; with eventually space before the end of line
This might work for you (GNU sed):
sed -r ':;s/^(;)|(;);|(;)$/\2\3\\N\1\2/g;t' file
There are 4 senarios in which an empty field may occur: at the start of a record, between 2 field delimiters, an empty field following an empty field and at the end of a record. Alternation can be employed to cater for senarios 1,2 and 4 and senario 3 can be catered for by a second pass using a loop (:;...;t). Multiple senarios can be replaced in both passes using the g flag.