Change value on 11th column based on 9th column using sed - regex

I have a text file that has a white space separated values. Column 9th has field that needs to be matched(ice), but column 11th needs substitution based on the match. Example :
a b c d e f g h ice j k l m
Intended output :
a b c d e f g h ice j keep l m
I'm trying use this :
sed -i -r 's/ice [^ ]*/ice keep/' test.log
But it give this :
a b c d e f g h ice keep k l m
Please help me. I'm not familiar with sed and regex.

This is more suitable for awk or any other tool that understands columns:
awk '{if ($9=="ice") {$11="keep"} print}' inputfile
Fields in awk are delimited by space by default. $9 would denote the 9th field. If the 9th field is ice, change the 11th to keep.
For your input, it'd produce:
a b c d e f g h ice j keep l m
You could do it using sed too, but it's not quite straight-forward:
sed -r 's/^(([^ ]+ ){8}ice \S+ )(\S+)/\1keep/' inputfile
Here ([^ ]+ ){8} would match the first 8 fields. (([^ ]+ ){8}ice \S+ ) matches 10 fields with the 9th being ice and captures it into a group that is substituted later making use of a backreference, \1.

This might work for you (GNU sed):
sed -r '/^((\S+\s+){8}ice\s+\S+\s)\S+/s//\1keep/' file
This matches the 9th non-spaced value to ice and then changes the 11th non-spaced value to keep.

in your sample it work but it does not take count of column number, do you realy need the column reference or just the content ?
sed '/ice/ s/k/keep/' YourFile

Related

Regex - Match a string and not match another string in the same line

I am learning regular expressions. I was trying to print lines in a file that contain a particular string and do not contain another string.
I have a few lines in the file like
k 1 : abcd
jkjkj
l 1 : efgh
kjkjk
m 1 : abok
lklk
My intention is to match lines with 1 : and not match ab on the same line.
My desired output should be 1 : efgh (This line matches 1 : and this line doesnot contain ab).
For this I have tried with regular expression ^((?!ab).*1 :*)*$. But it does not work. Can some one point out where is the issue in my expression?
as mentioned in the comments, the shell does not support lookahead.
You could pipe your text through another program like grep to get your desired regex flavor (ie perl)
cat test.txt | grep --perl '1\s:(?!.*ab)'
returns
l 1 : efgh
If you need the whole line, use awk:
awk !/ab/' && '/1[[:space:]]:/ inputfile > outputfile
It outputs lines not containing ab and containing 1 + space + :.
To get a part of a line:
sed -E -n '/ab/!s/.*(1 :.*)/\1/p' inputfile > outputfile
Skip all lines containing ab, and extract capturing group value with -n + p option/flag.

Regex contain match that should not match

Given this ; delimited string
hap;; z
z ;d;hh
z;d;hh ;gfg;fdf ;ppp
ap;jj
lo mo;z
d;23
;;io;
b yio;b;12
b
a;b;bb;;;34
I am looking to get columns $1 $2 $3 from any line that contains ap or b or o m in column 1
Using this regex
^(?:(.*?(?:ap|b|o m).*?)(?:;([^\r\n;]*))?(?:;([^\r\n;]*))?(?:;.*)?|.*)$
as shown in this demo one can see that line 11 should not be matching, but it does.
Can not use negated character class to match the before and after sections of column 1, as far as I understand.
Any help making line 11, not match?
You may consider this perl one-liner that works like awk:
perl -F';' -MEnglish -ne 'BEGIN {$OFS=";"} print $F[0],$F[1],$F[2] if $F[0] =~ /ap|b|o m/' file
An awk would be even more simpler:
awk 'BEGIN {FS=OFS=";"} $1 ~ /ap|b|o m/{print $1,$2,$3}' file
hap;; z
ap;jj;
lo mo;z;
b yio;b;12
b ;;
Here is a regex that match your data:
^([^;\n]*(?:ap|b|o m)[^;]*);((?(1)[^;]*));?((?(1)[^;]*))$
You can see it in action.

awk all lines between two words in one file with multiple occurrences

I am trying to get all lines in some SQL code between WHERE and GROUP, I have the below, which gets me the first occurrence of text between WHERE and GROUP, but there are multiple occurrences of the same I am after
awk '/WHERE/{p=1} p; /GROUP/{exit}' filename.txt
Output
WHERE something
Some SQL code
GROUP BY something
There are multiple sections of the code that start with WHERE and end with GROUP BY with in the file I would like to output
Can anybody help?
It is better in awk to do something along these lines:
awk '/WHERE/{f=1} f; /GROUP/{f=0}' file
The awk range operator , works similarly to sed. However, it is difficult to modify and you limit what awk can do.
Once your awk habit includes using a flag (rather than a range) it will be easier to print between marks such as:
$ echo "a
b
c
---
d
e
f
---
g
h" | awk '/^---$/{f= ! f; next} f'
d
e
f
Which is impossible with the range operator.
awk '/WHERE/,/GROUP/' filename.txt

Matching the last K occurrences of a pattern in a line

Is it possible using sed/awk to match the last k occurrences of a pattern in a line?
For simplicity's sake, say I just want to match the last 3 commas in each line, for example (note that the two lines have a different number of total commas):
10, 5, "Sally went to the store, and then , 299, ABD, F, 10
10, 6, If this is the case, and also this happened, then, 299, A, F, 9
I want to match only the commas starting from 299 until the end of the line in both bases.
Motivation: I'm trying to convert a CSV file with stray commas inside one of the fields to tab-delimited. Since the number of proper columns is fixed, my thinking was to replace the first couple commas with tabs up until the troublesome field (which is straightforward), and then go backwards from the end of the line to replace again. This should convert all proper delimiter commas to tabs, while leaving commas intact in the problematic field.
There's probably a smarter way to do this, but I figured this would be a good sed/awk teaching point anyways.
another sed alternative. Replace last 3 commas with tabs
$ rev file | sed 's/,/\t/;s/,/\t/;s/,/\t/' | rev
10, 5, "Sally went to the store, and then , 299 ABD F 10
with GNU sed, you can simply write
$ sed 's/,/\t/g5' file
10, 5, "Sally went to the store, and then , 299 ABD F 10
replace all starting from 5th.
You can use Perl to add the missing double quote into each line:
perl -aF, -ne '$F[-5] .= q("); print join ",", #F' < input > output
or, to turn the commas into tabs:
perl -aF'/,\s/' -ne 'splice #F, 2, -4, join ", ", #F[ 2 .. $#F - 4 ]; print join "\t", #F' < input > output
-n reads the input line by line.
-a splits the input into the #F array on the pattern specified by -F.
The first solution adds the missing quote to the fifth field from the right; the second one replaces the items from the third to the fifth from right with those elements joined by ", ", and separates the resulting array with tabs.
To fix the CSV, I would do this:
echo '10, 5, "Sally went to the store, and then , 299, ABD, F, 10' |
perl -lne '
#F = split /, /; # field separator is comma and space
#start = splice #F, 0, 2; # first 2 fields
#end = splice #F, -4, 4; # last 4 fields
$string = join ", ", #F; # the stuff in the middle
$string =~ s/"/""/g; # any double quotes get doubled
print join(",", #start, "\"$string\"", #end);
'
outputs
10,5,"""Sally went to the store, and then ",299,ABD,F,10
One regex that matches each of the three last commas separately would require a negative lookahead, which sed does not support.
You can use the following sed-regex to match the last three fields and the commas directly before them all at once:
,[^,]*,[^,]*,[^,]*$
$ matches the end of the line.
[^,] matches anything but ,.
Groups allow you to re-use the field values in sed:
sed -r 's/,([^,]*),([^,]*),([^,]*)$/\t\1\t\2\t\3/'
For awk, have a look at How to print last two columns using awk.
There's probably a smarter way to do this
In case all your wanted commas are followed by a space and the unwanted commas are not, how about
sed 's/,[^ ]/./g'
This transforms a, b, 12,3, c into a, b, 12.3, c.
Hi I guess this is doing the job
echo 'a,b,c,d,e,f' | awk -F',' '{i=3; for (--i;i>=0;i--) {printf "%s\t", $(NF-i) } print ""}'
Returns
d e f
But you need to ensure you have more than 3 arguments
This will do what you're asking for with GNU awk for the 3rd arg to match():
$ cat tst.awk
{
gsub(/\t/," ")
match($0,/^(([^,]+,){2})(.*)((,[^,]+){3})$/,a)
gsub(/,/,"\t",a[1])
gsub(/,/,"\t",a[4])
print a[1] a[3] a[4]
}
$ awk -f tst.awk file
10 5 "Sally went to the store, and then , 299 ABD F 10
10 6 If this is the case, and also this happened, then, 299 A F 9
but I'm not convinced what you're asking for is a good approach so YMMV.
Anyway, note the first gsub() making sure you have no tabs on the input line - that is crucial if you want to convert some commas to tabs to use tabs as output field separators!

Using awk to find a domain name containing the longest repeated word

For example, let's say there is a file called domains.csv with the following:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line
5,letswelcomewelcomeyou.org
How do I do that?
1 Meaning "immediately repeated", i.e., abcabc, but not abcXabc.
A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.
I'ved added one line to the example input file for the case of multiple longest words:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
And this gets the lines with the longest repeated sequence:
cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile
Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:
Remove the first column with the line number to avoid matches for lines numbers with repeating digits:
$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org
Get all lines with a repeated sequence, extract just that repeated sequence:
... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel
Get the length of each of those lines:
... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel
Sort by the first column, numerically, descending:
...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll
Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:
... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel
Pipe this into grep, using the -f - argument to read stdin as a file:
... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
Limitations
While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.
Alternative solution with more awk, less sort
As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:
$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile
Let's look at that awk step in more detail, with expanded variable names for clarity:
{
# New longest match: throw away stored longest matches, reset index
if (length() > max_len) {
max_len = length()
delete arr_longest
idx = 1
}
# Add line to longest matches
if (length() >= max_len)
arr_longest[idx++] = $0
}
# Print all the longest matches
END {
for (idx in arr_longest)
print arr_longest[idx]
}
Benchmarking
I've timed the two solutions on the top one million domains file mentioned in the comments:
First solution (with sort and two awk steps):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.742s
user 1m57.873s
sys 0m0.045s
Second solution (just one awk step, no sort):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.603s
user 1m56.514s
sys 0m0.045s
And the Perl solution by Casimir et Hippolyte:
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 0m5.249s
user 0m5.234s
sys 0m0.000s
What we learn from this: ask for a Perl solution next time ;)
Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.
Portability remark
Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.
A way with perl:
perl -F, -ane 'if (#m=$F[1]=~/(?=(.+)\1)/g) {
#m=sort { length $b <=> length $a} #m;
$cl=length #m[0];
if ($l<$cl) { #res=($_); $l=$cl; } elsif ($l==$cl) { push #res, ($_); }
}
END { print #res; }' file
The idea is to find all longest overlapping repeated strings for each position in the second field, then the match array is sorted and the longest substring becomes the first item in the array (#m[0]).
Once done, the length of the current repeated substring ($cl) is compared with the stored length (of the previous longest substring). When the current repeated substring is longer than the stored length, the result array is overwritten with the current line, when the lengths are the same, the current line is pushed into the result array.
details:
command line option:
-F, set the field separator to ,
-ane (e execute the following code, n read a line at a time and puts its content in $_, a autosplit, using the defined FS, and puts fields in the #F array)
The pattern:
/
(?= # open a lookahead assertion
(.+)\1 # capture group 1 and backreference to the group 1
) # close the lookahead
/g # all occurrences
This is a well-know pattern to find all overlapping results in a string. The idea is to use the fact that a lookahead doesn't consume characters (a lookahead only means "check if this subpattern follows at the current position", but it doesn't match any character). To obtain the characters matched in the lookahead, all that you need is a capture group.
Since a lookahead matches nothing, the pattern is tested at each position (and doesn't care if the characters have been already captured in group 1 before).