Sed - Replacing text string containing parentheses, commas and unspecified number of whitespaces

Sed - Replacing text string containing parentheses, commas and unspecified number of whitespaces - regex

Assume an alphanumeric text string that contains a section comprising a keyword, parentheses, and commas as well as a line break and an unspecified number of whitespaces immediately following some or all of the commas. How do I replace such a section from the text string with a simple comma in bash (preferentially using sed)?
Example:
$ cat have.txt
foo (keyword(00001..00002),keyword(00003..00004),
keyword(00005..00006),keyword(00007..00008)) foo
$ cat want.txt
foo (keyword(00001..00002,00003..00004,00005..00006,00007..00008)) foo
Attempt:
$ sed 's/),keyword(/,/g' have.txt
foo (keyword(00001..00002,00003..00004),
keyword(00005..00006,00007..00008)) foo
(And, yes, I know that whitespaces can be captured via [[:space:]].)

With GNU sed:
sed -z 's/),\s*keyword(/,/g' file
With -z, you will be able to match linebreaks, the \s* will match zero or more whitespace including those linebreaks.
To actually modify the file use
sed -z -i 's/),\s*keyword(/,/g' file

cat >> edrep.txt << EOF
2s/ //
%s/(//g
%s/)//g
%s/keyword//g
1s/00001/\(keyword\(00001/
2s/8/8))/
1,2j
wq
EOF
ed -s file.txt < edrep.txt
I know this one has already been answered, but I felt like posting this solution anyway. People here will probably consider it hackish; but the point is that you can climb around text files like a monkey with ed, and I don't know of any other program which lets you do that to the same degree.

Related

How to add a new line to the beginning and end of every line matching a pattern?

How can sed be used to add a \n to the beginning and to the end of every line matching the pattern %%###? This is how far I got:
If foo.txt contains
foobar
%%## Foo
%%### Bar
foobar
then sed 's/^%%###/\n&\n/g' foo.txt gives
foobar
%%## Foo
%%###
Bar
foobar
instead of
foobar
%%## Foo
%%### Bar
foobar
Note: This seems related to this post
Update: I'm actually looking for case where lines starting with the pattern are considered only.

With GNU sed:
sed 's/.*%%###.*/\n&\n/' file
Output:
foobar
%%## Foo
%%### Bar
foobar
&: Refer to that portion of the pattern space which matched
If you want to edit your file "in place" use sed's option -i.

It is cumbersome to directly add newlines via sed. But here is one option if you have perl available:
$ foo.txt | perl -pe 's/(.*%%###.*)/\n$1\n/'
Here we capture every matching line, which is defined as any line containing the pattern %%### anywhere, and then we replace with that line surrounded by leading and trailing newlines.

This might work for you (GNU sed):
sed '/%%###/!b;G;i\\' file
For those lines that meet the criteria, append a newline from the hold space (the hold space by default contains a newline) and insert an empty line.
Another way:
sed -e '/%%###/!b;i\\' -e 'a\\' file
This time insert and then append empty lines.
N.B. The i and a must be followed by a newline, this can be achieved by putting them in separate -e invocations.
A third way:
sed '/%%###/!b;G;s/.*\(.\)/\1&/' file
As in the first way, append a newline from the hold space, then copy it i.e. the last character of the amended current line, and prepend it to the current line.
Yet another way:
sed '/%%###/{x;p;x;G}' file
Swap to the hold space, print the newline, swap back and append the newline.
N.B. If the hold space may not be empty (a previous, x,h,H,g or G command may have changed it) the buffer may be cleared before it is printed (p) by using the zap command z.
And of course:
sed '/%%###/s/^\|$/\n/g' file

Personally I'd use a string instead of regexp comparison since there's no repexp characters in the text you want to match and if there were you wouldn't want them treated as such, and just print newlines around the string instead of modifying the string itself:
awk '{print ($1=="^%%###" ? ORS $0 ORS : $0)}' file
The above will work with any awk in every shell on every UNIX box and is easily modified if you don't want multiple blank lines between consecutive %%### lines or don't want blank lines added if the existing surrounding lines are already blank or you need to do anything else.

Use sed to replace a group match

I have a simple regular expression that creates a group match for any semicolon contained within double quotes. I'm trying to use sed on Mac OS X to replace the semicolon with 'SEMICOLON'.
However, it's not working.
Here's the command I used:
sed -i.bu "s|.*?(;).*?|SEMICOLON|g" output/html/index.html
The result is that nothing is matched and nothing is replaced.
Desired behavior:
Input
"The man sat; the man cried;" cats; dogs;
Output
"The man satSEMICOLON the man criedSEMICOLON" cats; dogs;
UPDATE:
Thanks for your help everyone. So my example wasn't very good. In reality, I process a JavaScript file that's been condensed to one line, and make sure each JavaScript statement has its own line. The problem is that the JavaScript is mostly translated text, so trying to make a simple regex that would insert a newline after each ; was difficult, because I obviously don't want a newline added if the semicolon is in quotes.
Long story short... I realized I was trying to reinvent the wheel, and decided to use js-beautify to pretty print the file. It's doing a little more than I need... but it's the best solution for now.
Thanks again!

Let's take this as a test file:
$ cat file
"The man sat; the man cried;" cats; dogs;
1; 2; "man;"; 3; ";dog";
Try this sed command:
$ sed -E ':a; s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/; ta' file
"The man satSEMICOLON the man criedSEMICOLON" cats; dogs;
1; 2; "manSEMICOLON"; 3; "SEMICOLONdog";
How it works:
:a
This creates a label a that we can refer to later.
s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/
This replaces the last ; that is inside double-quotes with SEMICOLON. Let's look at ^(([^"]*"[^"]*")*[^"]*"[^"]*); in more detail:
^ matches at the beginning of a string.
([^"]*"[^"]*")* matches from the beginning of the line through any number of complete quoted strings.
Because, in sed, regular expressions are greedy (more precisely, leftmost-longest), this will try to match as many complete quoted strings as it can.
[^"]*"[^"]*; matches any non-quotes that follow the complete quoted strings (as above), followed the next quote character, followed by any number of non-quote characters, followed by ;.
Since the above regex minus the final ; is itself inside parens, it is saved as group 1. We replace the matched text with group 1 followed by SEMICOLON.
ta
If the last command resulted in a substitution (in other words, we found a ; that needed to be replaced), then jump back to label a and repeat.
Discussion
Let's consider:
sed "s|.*?(;).*?|SEMICOLON|g"
In Python and elsewhere, .*? is a non-greedy match. Sed, however, has no such concept. For that matter, by default, sed uses Basic Regular Expressions (BRE) in which ? just means a literal question mark.
Also, it is asking for trouble to put sed commands in double-quotes as this invites the shell to modify it.
So, since BRE are obsolete, let's (1) switch to Extended Regular Expressions (ERE) using the -E switch, (2) put the command in single-quotes, and (3) change .*? to .*:
$ sed -E 's|.*(;).*|SEMICOLON|g' file
SEMICOLON
(Compatibility note: if you are on a very old linux system, you may need to replace -E with -r.)
.*(;).* matches everything up to the last semicolon on the line, followed by the semicolon, followed by whatever follows the last semicolon. In other words, if the line contains a semicolon, .*(;).* matches the whole line. That is why the output is just SEMICOLON.
Also, (;) matches a semicolon and saves it in group 1. Since we never use group 1 anywhere, this does nothing for us. We would get the same result with:
$ sed -E 's|.*;.*|SEMICOLON|g' file
SEMICOLON
If we remove the .*, then every ; will be replaced:
$ sed -E 's|;|SEMICOLON|g' file
"The man satSEMICOLON the man criedSEMICOLON" catsSEMICOLON dogsSEMICOLON
If we want to replace the last ; in the first quoted string, we could use:
$ sed -E 's|^([^"]*"[^"]*);|\1SEMICOLON|g' file
"The man sat; the man criedSEMICOLON" cats; dogs;
If we want to replace all ; that are within any quoted string on the line, then we are back to the command at the top.
Strings spanning across lines
Let's consider a test file with a string spanning 2 lines:
$ cat file2
"man;" cat "dog
;"; ";man";
If you have GNU sed:
$ sed -Ez ':a; s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/; ta' file2
"manSEMICOLON" cat "dog
SEMICOLON"; "SEMICOLONman";
In general for any POSIX sed:
$ sed -E 'H;1h;$!d;x; :a; s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/; ta' file2
"manSEMICOLON" cat "dog
SEMICOLON"; "SEMICOLONman";

sed is for simple s/old/new that is all. With any awk:
$ awk 'match($0,/"[^"]+"/) {
str = substr($0,RSTART,RLENGTH)
gsub(/;/,"SEMICOLON",str)
$0 = substr($0,1,RSTART-1) str substr($0,RSTART+RLENGTH)
} 1' file
"The man satSEMICOLON the man criedSEMICOLON" cats; dogs;
That's assuming you actually want all semicolons in the quoted string treated the same way. If not, whatever it is you want to do is an easy tweak, e.g. if you want that last semicolon after cried removed instead of replaced as shown in your sample output:
$ awk 'match($0,/"[^"]+"/) {
str = substr($0,RSTART+1,RLENGTH-2)
sub(/;$/,"",str)
gsub(/;/,"SEMICOLON",str)
$0 = substr($0,1,RSTART) str substr($0,RSTART+RLENGTH-1)
} 1' file
"The man satSEMICOLON the man cried" cats; dogs;

How to extract base name and append it to path

Basically, I want to transform ./foo/bar to ./foo/bar/bar. I initially tried sed but the regex I came up with uses lookaround ((?<=s\/)(.*)) or the \K escape sequence (.*\/\K.*), which sed does not support.

For the general case of "how do I append /name to /path/to/name", you can use parameter expansion:
$ var='./foo/bar'
$ echo "$var/${var##*/}"
./foo/bar/bar
${var##*/} expands to the value of $var with the longest possible match of */ (anything ending with a slash) removed from its beginning.
If you have a file with one of these entries per line, you could do something like this with sed:
$ cat infile
./foo/bar
./foo/bar/baz
./path/to/file
$ sed 's/\([^/]*\)$/\1\/\1/' infile
./foo/bar/bar
./foo/bar/baz/baz
./path/to/file/file
This captures everything after the last / on each line and appends a slash and the captured sequence to the end of the line.

Grep with reg ex

Trying to use regex with grep in the command line to give me lines that start with either a whitespace or lowercase int followed by a space. From there, they must end with either a semi colon or a o.
I tried
grep ^[\s\|int]\s+[\;\o]$ fileName
but I don't get what I'm looking for. I also tried
grep ^\s*int\s+([a-z][a-zA-Z]*,\s*)*[a-z]A-Z]*\s*;
but nothing.

Let's consider this test file:
$ cat file
keep marco
polo
int keep;
int x
If I understand your rules correctly, two of the lines in the above should be kept and the other two discarded.
Let's try grep:
$ grep -E '^(\s|int\s).*[;o]$' file
keep marco
int keep;
The above uses \s to mean space. \s is supported by GNU grep. For other greps, we can use a POSIX character class instead. After reorganizing the code slightly to reduce typing:
grep -E '^(|int)[[:blank:]].*[;o]$' file
How it works
In a Unix shell, the single quotes in the command are critical: they stop the shell from interpreting or expanding any character inside the single quotes.
-E tells grep to use extended regular expressions. Thus reduces the need for backslashes.
Let's examine the regular expression, one piece at a time:
^ matches at the beginning of a line.
(\s|int\s) This matches either a space or int followed by a space.
.* matches zero or more of any character.
[;o] matches any character in the square brackets which means that it matches either ; or o.
$ matches at the end of a line.

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks

You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'

Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise

That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input

I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.

or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Sed - Replacing text string containing parentheses, commas and unspecified number of whitespaces - regex

With GNU sed: sed -z 's/),\skeyword(/,/g' file With -z, you will be able to match linebreaks, the \s will match zero or more whitespace including those linebreaks. To actually modify the file use sed -z -i 's/),\s*keyword(/,/g' file

Related

How to add a new line to the beginning and end of every line matching a pattern?

Use sed to replace a group match

How to extract base name and append it to path

Grep with reg ex

using sed to copy lines and delete characters from the duplicates

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Sed - Replacing text string containing parentheses, commas and unspecified number of whitespaces - regex

With GNU sed: sed -z 's/),\s*keyword(/,/g' file With -z, you will be able to match linebreaks, the \s* will match zero or more whitespace including those linebreaks. To actually modify the file use sed -z -i 's/),\s*keyword(/,/g' file

Related

How to add a new line to the beginning and end of every line matching a pattern?

Use sed to replace a group match

How to extract base name and append it to path

Grep with reg ex

using sed to copy lines and delete characters from the duplicates

Categories

Resources

With GNU sed: sed -z 's/),\skeyword(/,/g' file With -z, you will be able to match linebreaks, the \s will match zero or more whitespace including those linebreaks. To actually modify the file use sed -z -i 's/),\s*keyword(/,/g' file