Substitute second occurrence of a pattern in a line using awk

Substitute second occurrence of a pattern in a line using awk - regex

I need to replace second occurrence of a pattern (that matches the last field) with another and also keep a count of all such changes done in a file.
Example: try.txt
Hi
Change apple orange guava mango banana orange
It's hot outside
Change tom greg fred harry steve fred
George is a cool guy
Change mary lucy becky karly jill karly
thank you
In all the lines that has pattern "Change", I want to replace the last word, for example "orange" in second line, with say, pear. Note that first orange should not be changed. I also want to put a suffix that shows number of changes happened in the file.
I tried following, but it was changing both the occurences (1st orange and 2nd orange, 1st fred and 2nd fred, 1st karly and 2nd karly), whereas I wanted to change only the second occurence.
awk 'BEGIN {cntr=0} {if (/Change/) {gsub($NF,"pear"); OFS=""; print $0,cntr; cntr++} else {print}}' try.txt
The output is:
Hi
Change apple pear guava mango banana pear0
It's hot outside
Change tom greg pear harry steve pear1
George is a cool guy
Change mary lucy becky pear jill pear2
thank you
Desired output is:
Hi
Change apple orange guava mango banana pear0
It's hot outside
Change tom greg fred harry steve pear1
George is a cool guy
Change mary lucy becky karly jill pear2
thank you
When gsub is replaced with sub, it's changing only first occurrence. Any help is appreciated.

This one-liner works for your input:
awk '/Change/{$NF="peal"(i++)}7' file
This line will overwrite the OFS, however, if you want to keep OFS (continuous spaces for example) untouched, you can do:
awk '/Change/{sub(/\S+$/,"peal"(i++))}7' file

I think I found a work-around:
awk 'BEGIN {cntr=0} {if (/Change/) {$NF=$NF_cntr; sub($NF,"pear"); OFS=""; print $0,cntr; OFS=" "; cntr++} else {print}}' try.txt
The output was as I desired.
But I still would like to hear from community for better ways of achieving it.
Thanks

Related

Regex list/column to comma delimited

I have a column of data, in this case captured from a website. I would like to convert this list into a comma separated list using regex either in the terminal or gedit, etc..
My list:
Liam
Noah
William
James
Oliver
Benjamin
What I want is:
Liam, Noah, William, James, Oliver, Benjamin
or
(Liam, Noah, William, James, Oliver, Benjamin)
or similar.
What I have tried is ^([A-Za-z]+)$("$1",) . I think it finds each name but it is not replacing anything.
It would also be great if something like this worked with numbers as well. Like,
10
20
30
pie
to
10,20,30,pie

Like this:
perl -i -pe 's/\n/, /' file
Output:
Liam, Noah, William, James, Oliver, Benjamin,
Or better:
perl -0ne 'my #a = (split /\n/, $_); print join (", ", #a) . "\n"' file
Output:
Liam, Noah, William, James, Oliver, Benjamin

Sed remove only first occurence of a string

I have several string in my text file witch have this case:
Brisbane, Queensland, Australia|BNE
I know how to use the SED command, to replace any character by another one. This time I want to replace the characters coma-space by a pipe, only for the first match to not affect the country name at the same time.
I need to convert it to something like that:
Brisbane|Queensland, Australia|BNE
As you can see, only the first coma-space was replaced, not the second one and I keep the country name "Queensland, Australia" complete. Can someone help me to achieve this, thanks.
Here is a sample of my file:
Brisbane, Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
If you do: sed 's/, /|/' file.txt doesn't work.
The output should be like that:
Brisbane|Queensland, Australia|BNE

Simply don't use the g option. Your sed command should look like this:
sed 's/, /|/'
The s command will by default only the replace the first occurrence of a string in the pattern buffer - unless you pass the g option.

Since you have not posted the output of your test file, we can only guess what you need. And here is may guess:
awk -F", *" 'NF>2{$0=$1"|"$2 OFS $3}1' OFS=", " file
Brisbane|Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
As you see it counts fields to see if it needs | or not. If it neds | then reconstruct the line.

How to find lines with multiple occurrences of a(ny) word in a file?

I want to find lines that have multiple occurrences of a(ny) word. For example, if the input text is
John is a teacher, who is not highly paid.
abc abcde
James lives in Detroit.
abc abc abcde
Paul has 2 dogs and 2 cats.
The output should be
John is a teacher, who is not highly paid.
abc abc abcde
Paul has 2 dogs and 2 cats.
First line has is repeated, second line has abc repeated and last line has 2 repeated.

^(?=.*\b(\w+)\b.*\b\1\b).*$
Try this.See demo.
https://www.regex101.com/r/rG7gX4/6
Use this with grep -P

Here is a simple way to do it in awk
awk '{f=0;delete a;for (i=1;i<=NF;i++) if (a[$i]++) f=1} f' file
John is a teacher, who is not highly paid.
abc abc abcde
Paul has 2 dogs and 2 cats.
It loops trough every word and count them in array a
If any word found more than once, set flag f
If flag f is true, do default action, print line.
To see how many:
awk '{f=0;delete a;for (i=1;i<=NF;i++) if (a[$i]++) f=1} f {for (i in a) if (a[i]>1) printf "%sx\"%s\"-",a[i],i;print $0}' file
2x"is"-John is a teacher, who is not highly paid.
2x"abc"-abc abc abcde
2x"2"-Paul has 2 dogs and 2 cats.
Some improvement: Ignore case. Remove . and ,.
awk '{f=0;delete a;for (i=1;i<=NF;i++) {w=tolower($i);sub(/[.,]/,"",w);if (a[w]++) f=1}} f' file

How to remove unique values from an HTML select list with linux program like sed, awk, or grep?

I copy the HTML from a select boxes, and trying to figure out a quick way to remove the HTML so I am left with a list of names. Generally it's not a problem, but these have unique values. I would prefer using a program like grep, sed, awk or vi. Right now I have to go through manually and edit each line. Any help would be great, thank you!
<option value="DL_54292">(DL)finance</option>
<option value="DL_54274">(DL)sales</option>
<option value="510496">Ben Smith</option
<option value="510507">Christopher Jones</option>
<option value="510513">Dawn James</option>
<option value="510533">Joe Wilson</option>
<option value="551825">Mark Jackson</option>
<option value="510562">Ronnie Libby</option>
Edit: Output format suggested by Fede.
Trying to get a simple text list, with line feed or carriage return.
finance
sales
Ben Smith
Christopher Jones
Dawn James
Joe Wilson
Mark Jackson
Ronnie Libby

Use grep to get the texts between the tags,
$ grep -oP '(?<=>)[^<>]+' file
(DL)finance
(DL)sales
Ben Smith
Christopher Jones
Dawn James
Joe Wilson
Mark Jackson
Ronnie Libby

Since you mentioned vi, you can use this line
:%s_^<option value=".*">\(.*\)</option>$_\1_gi
%s -> substitute in all the file
^ -> start of line
.* -> any characters
\(.*\) -> any characters, remember those.
$ -> end of line
\1 -> first remembered match
gi -> ingnore case and take all matches in line
_ -> substitution separator
:s is search and replace, s_foo_bar replaces foo by bar in current line

awk can do this:
awk -F"<|>" '{print $3}'
(DL)finance
(DL)sales
Ben Smith
Christopher Jones
Dawn James
Joe Wilson
Mark Jackson
Ronnie Libby
If I should be true to your output request the data in parentheses should be gone too:
awk -F"<|>" '{sub(/[^)]*)/,"",$3);print $3}'
finance
sales
Ben Smith
Christopher Jones
Dawn James
Joe Wilson
Mark Jackson
Ronnie Libby

If you don't mind using Notepad++, then you can use this regex:
.*>(.*)<.*
And replace with \1

Regex code for address separated by commas

How can I extract the state text which is before third comma only using the regex code?
54 West 21st Street Suite 603, New York,New York,United States, 10010
I've managed to extract the rest how I wanted but this one is a problem.
Also, how can I extract the "United States" please?

It looks like you want to use capturing groups:
.*,.*,(.*),(.*),.*
The first capturing group will be "New York" and the second will be "United States" (try it on Rubular).
Or you can split by commas (which will probably be even simpler) as #Jerry points out, assuming the language/tool you're using supports that.

You can use this regex:
(?:[^,]*,){2}([^,]*)
And use captured group # 1 for your desired String.

TL;DR
A lot depends on your regular expression engine, and whether you really need a regular expression or field-splitting. You can do field-splitting in Ruby and Awk (among others), but sed and grep only do regular expressions. See some examples below to get you started.
Ruby
str = '54 West 21st Street Suite 603, New York,New York,United States, 10010'
str.match /(?:.*?,){2}([^,]+)/
$1
#=> "New York"
GNU sed
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
sed -rn 's/([^,]+,){2}([^,]+).*/\2/p'
GNU awk
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
awk -F, '{print $3}'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Substitute second occurrence of a pattern in a line using awk - regex

This one-liner works for your input: awk '/Change/{$NF="peal"(i++)}7' file This line will overwrite the OFS, however, if you want to keep OFS (continuous spaces for example) untouched, you can do: awk '/Change/{sub(/\S+$/,"peal"(i++))}7' file

I think I found a work-around: awk 'BEGIN {cntr=0} {if (/Change/) {$NF=$NF_cntr; sub($NF,"pear"); OFS=""; print $0,cntr; OFS=" "; cntr++} else {print}}' try.txt The output was as I desired. But I still would like to hear from community for better ways of achieving it. Thanks

Related

Regex list/column to comma delimited

Sed remove only first occurence of a string

How to find lines with multiple occurrences of a(ny) word in a file?

How to remove unique values from an HTML select list with linux program like sed, awk, or grep?

Regex code for address separated by commas

Categories

Resources