Sed : removing lines where pattern result also appears elsewhere - regex

Let's suppose I have this sample
foo/bar/123-465.txt
foo/bar/456-781.txt
foo/bar/102-445.txt
foo/bar/123-721.txt
I want to remove every line where the regex /[0-9]*- result also appears on another line. In other terms : I want to remove every line where the file prefix is present more than once in my file.
Therefore only keeping :
foo/bar/456-781.txt
foo/bar/102-445.txt
I bet sed can do this, but how ?

Ok I misunderstood your problem, here's how to do it:
grep -vf <(grep -o '/[0-9]*-' file | sort | uniq -d) file
In action:
cat file
foo/bar/123-465.txt
foo/bar/456-781.txt
foo/bar/102-445.txt
foo/bar/123-721.txt
grep -vf <(grep -o '/[0-9]*-' file | sort | uniq -d) file
foo/bar/456-781.txt
foo/bar/102-445.txt

awk '
match($0, "[0-9]*-") {
id=substr($0, RSTART, RLENGTH)
if (store[id])
dup[id] = 1
store[id] = $0
}
END {
for(id in store) {
if(! dup[id]) {
print store[id]
}
}
}
'

You can use the following awk script:
example.awk:
{
# Get value of interest (before the -)
prefix=substr($3,0,match($3,/\-/)-1)
# Increment counter for this value (starting at 0)
counter[prefix]++
# Buffer the current line
buffer[prefix]=$0
}
# At the end print every line which's value of interest appeared just once
END {
for(index in counter)
if(counter[index]==1)
print buffer[index]
}
Execute it like this:
awk -F\ -f example.awk input.file

Related

stop condition for emulating "grep -oE" with awk

I'm trying to emulate GNU grep -Eo with a standard awk call.
What the man says about the -o option is:
-o --only-matching
     Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.
For now I have this code:
#!/bin/sh
regextract() {
[ "$#" -ge 2 ] || return 1
__regextract_ere=$1
shift
awk -v FS='^$' -v ERE="$__regextract_ere" '
{
while ( match($0,ERE) && RLENGTH > 0 ) {
print substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART+1)
}
}
' "$#"
}
My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input+regex that would need the former but I feel like it might exist. Any idea?
Here's a POSIX awk version, which works with a* (or any POSIX awk regex):
echo abcaaaca |
awk -v regex='a*' '
{
while (match($0, regex)) {
if (RLENGTH) print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + (RLENGTH > 0 ? RLENGTH : 1))
if ($0 == "") break
}
}'
Prints:
a
aaa
a
POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.
If you can consider a gnu-awk solution then using RS and RT may give identical behavior of grep -Eo.
# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20
Using grep -Eo:
grep -Eo '[[:alnum:]]+' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
Using gnu-awk with RS and RT using same regex:
awk -v RS='[[:alnum:]]+' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
More examples:
grep -Eo '\<[[:digit:]]+' file
11
39
20
awk -v RS='\\<[[:digit:]]+' 'RT != "" {print RT}' file
11
39
20
Thanks to the various comments and answers I think that I have a working, robust, and (maybe) efficient code now:
tested on AIX/Solaris/FreeBSD/macOS/Linux
#!/bin/sh
regextract() {
[ "$#" -ge 1 ] || return 1
[ "$#" -eq 1 ] && set -- "$1" -
awk -v FS='^$' '
BEGIN {
ere = ARGV[1]
delete ARGV[1]
}
{
tail = $0
while ( tail != "" && match(tail,ere) ) {
if (RLENGTH) {
print substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH)
} else
tail = substr(tail,RSTART+1)
}
}
' "$#"
}
regextract "$#"
notes:
I pass the ERE string along the file arguments so that awk doesn't pre-process it (thanks #anubhava for pointing that out); C-style escape sequences will still be translated by the regex engine of awk though (thanks #dan for pointing that out).
Because assigning $0 does reset the values of all fields,
I chose FS = '^$' for limiting the overhead
Copying $0 in a separate variable nullifies the overhead induced by assigning $0 in the while loop (thanks #EdMorton for pointing that out).
a few examples:
# Multiple matches in a single line:
echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX
# Passing the regex string to awk as a parameter versus a file argument:
echo '[a]' | regextract_as_awk_param '\[a]'
a
echo '[a]' | regextract '\[a]'
[a]
# The regex engine of awk translates C-style escape sequences:
printf '%s\n' '\t' | regextract '\t'
printf '%s\n' '\t' | regextract '\\t'
\t
Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt content be
1A2A3
then
grep -Eo A* file.txt
gives output
A
A
your while's condition is match($0,ERE) && RLENGTH > 0, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART was set to 1), thus body of while will be done zero times.

SED regex find (and remove) option from a command text

I have a config file with param=option[,option...], using standard bash utilities, perhaps the the help of sed, remove one option from the list.
#
param=aa,bb,cc
param=aa,bb
param=bb,cc
param=bb
#
in this example, I want to remove 'bb' (and the separator) from all lines, and in the last case, because 'bb' was the sole option, remove the complete line, so the final result will be
#
param=aa,cc
param=aa
param=cc
#
option 'bb' can be alone or at the start, center or end of the list. Obviously, 'bb' embedded on another option name (ie xxbb, bbxx, etc) should not be considered.
edit: fix typo, addn'l example
Here is a sed version to remove bb parameter from any position and delete the line if bb is the only parameter:
First the input file:
#
param=aa,bb,cc
param=aa,bb
param=bb,cc
param=bb
#
Now run this sed:
sed -E '/^param=/{/=bb$/d; s/,bb(,|$)/\1/; s/=bb,/=/;}' file
This will give:
#
param=aa,cc
param=aa
param=cc
#
To use inline editing use:
sed -i.bak -E '/^param=/{/=bb$/d; s/,bb(,|$)/\1/; s/=bb,/=/;}' file
Note: The solutions below do not address updating the input file; a simple (though not fully robust) approach is to use
awk '...' file > file.$$ && mv file.$$ file
A POSIX-compliant awk solution that should work robustly:
awk -F'=' '
$1 != "param" { print; next }
{
sub(/,bb,/, ",", $2)
sub(/(^|,)bb$/, "", $2)
if ($2 != "") print $1 FS $2
}
' file
GNU awk allows for a simpler solution, using its (nonstandard) gensub() function:
awk -F'=' '
$1 != "param" { print; next }
{
newList = gensub(/(^|,)bb(,|$)/, "\\2", 1, $2)
if (newList != "") print $1 FS newList
}
' file
A (POSIX-compliant) field-based alternative (more verbose, but perhaps easier to generalize):
awk -F'=' '
$1 != "param" { print; next }
{
n = split($2, opts, ","); optList = ""
for (i=1; i<=n; ++i) {
if (opts[i] != "bb") {
optList = optList (optList == "" ? "" : ",") opts[i]
}
}
if (optList != "") print $1 FS optList
}
' file
Let's say your Input_file is as follows:
param=aa,bb,cc
param=aa,bb
param=bb
Then the following code:
awk -F"=" '$2=="bb"{next} {sub(/,bb/,"");print}' Input_file
outputs:
param=aa,cc
param=aa
I'd use a temporary format to be able to find the occurrences easier. And to remove lines I would suggest using grep:
sed 's/=/=,/;s/$/,/;s/,bb,/,/;s/=,/=/;s/,$//;/=$/d'
the s/=/=,/ converts it to:
param=,aa,bb,cc
param=,aa,bb
param=,bb
than s/$/,/ to:
param=,aa,bb,cc,
param=,aa,bb,
param=,bb,
than s/,bb,/,/
param=,aa,cc,
param=,aa,
param=,
and s/=,/=/;s/,$// will remove the commata at the begining and end again
removing empty options can be done with grep -v '=$', or some more advanced sed magic (so it can be still used with sed -i)
EDIT:
the "sed magic" is just appending '/=$/d'
tested this one, and it works fine:
sed -i 's/=/=,/;s/$/,/;s/,bb,/,/;s/=,/=/;s/,$//;/=$/d' filename
or
sed 's/=/=,/;s/$/,/;s/,bb,/,/;s/=,/=/;s/,$//;/=$/d' filename_in > filename_out

filtering some text from line using sed linux

I have a following content in the file:
NAME=ALARMCARDSLOT137 TYPE=2 CLASS=116 SYSPORT=2629 STATE=U ALARM=M APPL=" " CRMPLINK=CHASSIS131 DYNDATA="GL:1,15 ADMN:1 OPER:2 USAG:2 STBY:0 AVAL:0 PROC:0 UKNN:0 INH:0 ALM:20063;1406718801,"
I just want to filter out NAME , SYSPORT and ALM field using sed
Try the below sed command to filter out NAME,SYSPORT,ALM fields ,
$ sed 's/.*\(NAME=[^ ]*\).*\(SYSPORT=[^ ]*\).*\(ALM:[^;]*\).*/\1 \2 \3/g' file
NAME=ALARMCARDSLOT137 SYSPORT=2629 ALM:20063
why not using grep?
grep -oE 'NAME=\S*|SYSPORT=\S*|ALM:[^;]*'
test with your text:
kent$ echo 'NAME=ALARMCARDSLOT137 TYPE=2 CLASS=116 SYSPORT=2629 STATE=U ALARM=M APPL=" " CRMPLINK=CHASSIS131 DYNDATA="GL:1,15 ADMN:1 OPER:2 USAG:2 STBY:0 AVAL:0 PROC:0 UKNN:0 INH:0 ALM:20063;1406718801,"'|grep -oE 'NAME=\S*|SYSPORT=\S*|ALM:[^;]*'
NAME=ALARMCARDSLOT137
SYSPORT=2629
ALM:20063
Here is another awk
awk -F" |;" -v RS=" " '/NAME|SYSPORT|ALM/ {print $1}'
NAME=ALARMCARDSLOT137
SYSPORT=2629
ALM:20063
Whenever there are name=value pairs in input files, I find it best to first create an array mapping the names to the values and then operating on the array using the names of the fields you care about. For example:
$ cat tst.awk
function bldN2Varrs( i, fldarr, fldnr, subarr, subnr, tmp ) {
for (i=2;i<=NF;i+=2) { gsub(/ /,RS,$i) }
split($0,fldarr,/[[:blank:]]+/)
for (fldnr in fldarr) {
split(fldarr[fldnr],tmp,/=/)
gsub(RS," ",tmp[2])
gsub(/^"|"$/,"",tmp[2])
name2value[tmp[1]] = tmp[2]
split(tmp[2],subarr,/ /)
for (subnr in subarr) {
split(subarr[subnr],tmp,/:/)
subName2value[tmp[1]] = tmp[2]
}
}
}
function prt( fld, subfld ) {
if (subfld) print fld "/" subfld "=" subName2value[subfld]
else print fld "=" name2value[fld]
}
BEGIN { FS=OFS="\"" }
{
bldN2Varrs()
prt("NAME")
prt("SYSPORT")
prt("DYNDATA","ALM")
}
.
$ awk -f tst.awk file
NAME=ALARMCARDSLOT137
SYSPORT=2629
DYNDATA/ALM=20063;1406718801,
and if 20063;1406718801, isn't the desired value for the ALM field and you just want some subsection of that, simply tweak the array construction function to suit whatever your criteria is.

Bash command to match n line

I have an index HTML file with file/dir listing. It is just a usual filebrowser like :
...content here...
<td>20120011/</td>
<td>20120111/</td>
<td>20120211/</td>
<td>20120411/</td>
...content here...
I don't understand how to extract the 2nd line from the bottom.
1) I downloaded HTML with curl
content=$(curl -sL "http://path-to-html")
2) then used
dir=$(echo $content | sed '/.*href="\([0-9]*\/\)".*/!d;s//\1/;q')
which gives me the last match : 20120411.
But how to get the previous one ?
I don't know the total count of items.
This awk program will print the penultimate line:
echo ${content} | awk '{ pen = ult; ult = $0 } END { print pen }'
This will print the penultimate matching line:
echo ${content} | awk '/href="([0-9]{8}\/)"/ { pen = ult; ult = $0 } END { print pen }'
If you just want to extract the first capture group:
echo ${content} | awk 'match($0, /href="([0-9]{8}\/)"/, a) { pen = ult; ult = a[1] } END { print pen }'
Putting it all together:
bash-4.2$ dir=$(curl -sL http://www.arteetmarte.no/tmp/index.html |
awk 'match($0, /href="([0-9]{8}\/)"/, a) {
pen = ult
ult = a[1]
}
END {
print pen
}
')
bash-4.2$ echo ${dir}
20130918/
Tested with: GNU Awk 4.1.0, API: 1.0
May be a bit easier with awk
dir=$(echo "$content"|awk '/href=/{x=p;p=$0}END{sub(/.*">/,"",x);sub(/<.*/, "",x); print x}')
dir=$(echo $content | sed sed -n '/href="\([0-9]\{1,\}\/\)"/ {s|.*href="\([0-9]\{1,\}/\)".*|-\1-|;H;}
$ {x;l;s|.*-\([0-9]\{1,\}/\)-\(\n-[0-9]\{1,\}/-\)\{1\}$|\1|p;}')
The 1 in \{1\}$ specify how much line must be removed from the end

Remove all the text using sed

Format:
[Headword]{}"UC(icl>restriction)"(Attributes);(gloss)
The testme.txt file has 2 lines
[testme] {} "acetify" (V,lnk,CJNCT,AJ-V,VINT,VOO,VOO-CHNG,TMP,Vo) <H,0,0>;
[newtest] {} "acid-fast" (ADJ,DES,QUAL,TTSM) <H,0,0>;
The expected output is this:
testme = acetify
newtest = acid-fast
What I have achieved so far is:
cat testme.txt | sed 's/[//g' | sed 's/]//g' | sed 's/{}/=/g' | sed 's/\"//'
testme = acetify" (V,lnk,CJNCT,AJ-V,VINT,VOO,VOO-CHNG,TMP,Vo) <H,0,0>;
newtest = acid-fast" (ADJ,DES,QUAL,TTSM) <H,0,0>;
How do I remove all the text from the second " to the end of the line?
Remove everything after the doublequote-space-openparenthesis " (:
sed 's/" (.*//g'
The whole process might be a little quicker with awk:
awk 'NF > 0 { print $1 " = " $3 }' testme.txt | tr -d '[]"'
this is how you do it with awk instead of all those sed commands, which is unnecessary. what you want is field 1 and field 3. use gsub() to remove the quotes and brackets
$ awk '{gsub(/\"/,"",$3);gsub(/\]|\[/,"",$1);print $1" = "$3}' file
testme = acetify
newtest = acid-fast
Your whole sequence of multiple calls to sed can be replaced by:
sed 's/\[\([^]]*\)][^"]*"\([^"]*\).*/\1 = \2/' inputfile