Extract substring from rows with regex and remove rows with duplicate substring - regex

I have a text file with some rows in the following form
*,[anything, even blanks],[dog|log|frog],[dog|log|frog],[0|1],[0|1],[0|1]
I would like to remove duplicate rows that have the same value for * (case insensitive), ie anything left of ,[anything, even blanks],[dog|log|frog],[dog|log|frog],[0|1],[0|1],[0|1]
For example here's a sample text file
test,bar,log,dog,0,0,0
one
foo,bar,log,dog,0,0,0
/^test$/,bar,log,dog,0,0,0
one
FOO,,frog,frog,1,1,1
The resulting text file should have the duplicate foo removed (order does not matter to me so long as the duplicates are removed, leaving 1 unique)
test,bar,log,dog,0,0,0
one
/^test$/,bar,log,dog,0,0,0
one
FOO,,frog,frog,1,1,1
What's the simplest bash command I could do to achieve this?

awk -F, '!seen[tolower($1)]++' file

You can do this with awk like this (since you don't care which of the duplicates gets kept):
awk -F, '{lines[tolower($1)]=$0}END{for (l in lines) print lines[l]}'
If you wanted to keep the first instead:
awk -F, '{if (lines[tolower($1)]!=1) { print; lines[tolower($1)]=1 } }'

Search for
(?:(?<=\n)|^)(.*)((?:,(?:d|l|fr)og){2}(?:,[01]){3})(?=\n)([\s\S]*)(?<=\n).*\2(?:\n|$)
...and replace with
$1$2$3

#!/bin/bash
for line in $(cat $1)
do
key=$( echo ${line%%,*} | awk '{print tolower($0)}')
found=0
for k in ${keys[#]} ; do [[ "$k" == "$key" ]] && found=1 && break ; done
(( found )) && continue
echo $line
keys=( "${keys[#]}" "$key" )
done
Using an array instead of an association (hash), which is less performant. But it seems to work.

This might work for you (GNU sed):
cat -n file |
sort -fk2,2 |
sed -r ':a;$!N;s/^.{7}([^,]*),[^,]*(,(d|l|fr)og){2}(,[01]){3}\n(.{7}\1,[^,]*(,(d|l|fr)og){2}(,[01]){3})$/\5/i;ta;P;D' |
sort -n |
sed -r 's/^.{7}//'
Number each line.
Sort by the first key (ignoring case)
Remove duplicates (based on specific criteria)
Sort reduced file back into original order
Remove line numbers

Related

Bash - numbers of multiple lines matching regex (possible oneliner?)

I'm not very fluent in bash but actively trying to improve, so I'd like to ask some experts here for a little suggestion :)
Let's say I've got a following text file:
Some
spam
about which I don't care.
I want following letters:
X1
X2
X3
I do not want these:
X4
X5
Nor this:
X6
But I'd like these, too:
I want following letters:
X7
And so on...
And I'd like to get numbers of lines with these letters, so my desired output should look like:
5 6 7 15
To clarify: I want all lines matching some regex /\s*X./, that occur right after one match with another regex /\sI want following letters:/
Right now I've got a working solution, which I don't really like:
cat data.txt | grep -oPz "\sI want following letters:((\s*X.)*)" | grep -oPz "\s*X." > tmp.txt
for entry in $(cat tmp.txt); do
grep -n $entry data.txt | cut -d ":" -f1
done
My question is: Is there any smart way, any tool I don't know with a functionality to do this in one line? (I esspecially don't like having to use temp file and a loop here)
You can use awk:
awk '/I want following/{p=1;next}!/^X/{p=0;next}p{print NR}' file
Explanation in multiline version:
#!/usr/bin/awk
/I want following/{
# Just set a flag and move on with the next line
p=1
next
}
!/^X/ {
# On all other lines that doesn't start with a X
# reset the flag and continue to process the next line
p=0
next
}
p {
# If the flag p is set it must be a line with X+number.
# print the line number NR
print NR
}
Following may help you here.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1} flag' Input_file
Above will print the lines which have I want following letters: too in case you don't want these then use following.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1;next} flag' Input_file
To add line number to output use following.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1;next} flag{print FNR}' Input_file
First, let's optimize a little bit your current script:
#!/bin/bash
FILE="data.txt"
while read -r entry; do
[[ $entry ]] && grep -n $entry "$FILE" | cut -d ":" -f1
done < <(grep -oPz "\sI want following letters:((\s*X.)*)" "$FILE"| grep -oPz "\s*X.")
And here's some comments:
No need to use cat file|grep ... => grep ... file
Do not use the syntaxe for i in $(command), it's often the cause of multiple bugs and there's always a smarter solution.
No need to use a tmp file either
And then, there's a lot of shorter possible solutions. Here's one using awk:
$ awk '{ if($0 ~ "I want following letters:") {s=1} else if(!($0 ~ "^X[0-9]*$")) {s=0}; if (s && $0 ~ "^X[0-9]*$") {gsub("X", ""); print}}' data.txt
1
2
3
7

How to display words as per given number of letters?

I have created this basic script:
#!/bin/bash
file="/usr/share/dict/words"
var=2
sed -n "/^$var$/p" /usr/share/dict/words
However, it's not working as required to be (or still need some more logic to put in it).
Here, it should print only 2 letter words but with this it is giving different output
Can anyone suggest ideas on how to achieve this with sed or with awk?
it should print only 2 letter words
Your sed command is just searching for lines with 2 in text.
You can use awk for this:
awk 'length() == 2' file
Or using a shell variable:
awk -v n=$var 'length() == n' file
What you are executing is:
sed -n "/^2$/p" /usr/share/dict/words
This means: all lines consisting in exactly the number 2, nothing else. Of course this does not return anything, since /usr/share/dict/words has words and not numbers (as far as I know).
If you want to print those lines consisting in two characters, you need to use something like .. (since . matches any character):
sed -n "/^..$/p" /usr/share/dict/words
To make the number of characters variable, use a quantifier {} like (note the usage of \ to have sed's BRE understand properly):
sed -n "/^.\{2\}$/p" /usr/share/dict/words
Or, with a variable:
sed -n '/^.\{'"$var"'\}$/p' /usr/share/dict/words
Note that we are putting the variable outside the quotes for safety (thanks Ed Morton in comments for the reminder).
Pure bash... :)
file="/usr/share/dict/words"
var=2
#building a regex
str=$(printf "%${var}s")
re="^${str// /.}$"
while read -r word
do
[[ "$word" =~ $re ]] && echo "$word"
done < "$file"
It builds a regex in a form ^..$ (the number of dots is variable). So doing it in 2 steps:
create a string of the desired length e.g: %2s. without args the printf prints only the filler spaces for the desired length e.g.: 2
but we have a variable var, therefore %${var}s
replace all spaces in the string with .
but don't use this solution. It is too slow, and here are better utilities for this, best is imho grep.
file="/usr/share/dict/words"
var=5
grep -P "^\w{$var}$" "$file"
Try awk-
awk -v var=2 '{if (length($0) == var) print $0}' /usr/share/dict/words
This can be shortened to
awk -v var=2 'length($0) == var' /usr/share/dict/words
which has the same effect.
To output only lines matching 2 alphabetic characters with grep:
grep '^[[:alpha:]]\{2\}$' /usr/share/dict/words
GNU awk and mawk at least (due to empty FS):
$ awk -F '' 'NF==2' /usr/share/dict/words #| head -5
aa
Ab
ad
ae
Ah
Empty FS separates each character on its own field so NF tells the record length.

grep line with exact pattern in first column

I have this script :
while read line; do grep $line my_annot | awk '{print $2}' ; done < foo.txt
But it doesn't return what I want.
The problem is that in foo.txt, when I have for instance Contig1, the script will return the column 2 of the file my_annot even if the pattern found is Contig12 and not Contig1 only!
I tried with $ at the end of the pattern but the problem is that it corresponds to end of line while this expression I search is in column 1 and therefore not end of line.
How can I tell to search this EXACT pattern and not those that contain this pattern?
####### ANSWER :
My script is :
annot='/home/mu/myannot'
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' $1 $annot > out
It allows me to give the list of expression I want to find as first argument doing ./myscript.sh mylist
And I redirect the result in a file called out.
Thank you guys !!!!
You should use awk to do the whole thing:
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' foo.txt my_annot
This reads each line of foo.txt, setting a key in the array line, then prints the second column of any lines whose first column exactly matches one of the keys in the array.
Of course I have made a guess that the format of your data is the same as in the other answer.
So you have a file like
Contig1 hugo
Contig12 paul
right?
Then this will help:
awk '$1~/^Contig1$/ {print $2}' foo.txt
I think this is what you want
while read line; do grep -w $line my_annot | awk '{print $2}' ; done < foo.txt
But it's not 100% clear (because of a lack of example data) whether it will work in all cases.

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After
Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.
This might work for you (GNU sed):
sed 's/-[^-]*//2g' file
You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u
#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.
awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).
This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u
awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After
This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'

delete characters in lines starting with an unique pattern

I have a file consisting of many entries that look like this:
>1761420406686363113470.1
CAAGATTCTGAGATAATCGCGGTTTAAAGTTTCAAATTTGTTTCGGCCGATTCGAAGTCA
i.e. a header line starting with > and many lines of sequence, followed by a header line.
I am trying to write a sed script that goes to only the lines that start with > (not the sequences lines) and deletes all but the first 10 numbers.
There are a lot of similar questions to this, but I can't figure it out. I've been trying variations on this code:
sed 's/^>..........*/^>........../' input.fasta
but clearly am not doing it right..
This might work for you (GNU sed):
sed -r 's/^(>.{10}).*/\1/p;d' file
This deletes all but those lines that are substituted, if you want to retain the sequence lines:
sed -r 's/^(>.{10}).*/\1/' file
should fit the bill.
You have to capture the first 10 characters in parentheses:
sed -e 's/^\(>..........\).*/\1/'
Which can be shortened to
sed -e 's/^\(>.\{10\}\).*/\1/'
as an alternative to sed, use cut
$ echo ">1761420406686363113470.1" | cut -c1-11
>1761420406
To operate on lines starting with an >, wrap it in a bash-while-loop
$ while read line; do if [[ $line == \>* ]]; then cut -c1-11 <<< $line; else echo $line; fi done < input
>1761420406
CAAGATTCTGAGATAATCGCGGTTTAAAGTTTCAAATTTGTTTCGGCCGATTCGAAGTCA
or using awk:
$ awk '{if ($0 ~ />/){print substr($0,0,11)}else{print}}' input
>1761420406
CAAGATTCTGAGATAATCGCGGTTTAAAGTTTCAAATTTGTTTCGGCCGATTCGAAGTCA
Since good sed answers are already posted, here is an `GNU-awk solution.
gawk '/^>/{print gensub(/(.{11}).*/,"\\1","G",$1);next }1' inputFile