Print only '+' or '-' if string matches (two files) - regex

I would like to print only a '+' o '-' symbols if string is found or not. Basically, I have two files:
Input file 1 (tab-delimited):
HPNK_00457
HPNK_00458
HPNK_00459
Input file 2 (tab-delimited):
HPNK_00457 AAA50325 1e-43 437 28 43 83 ATP-binding protein.
HPNK_00458 P25256 8e-43 429 28 43 82 RecName: Full=Tylosin resistance ATP-binding protein tlrC.
HPNK_00458 CAM96590 1e-42 429 27 42 87 ABC transporter ATP-binding protein [Streptomyces ambofaciens].
Desired output (tab-delimited, maintaining order of strings in file 1):
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -
This is what I've been using up to now, but need to update:
while read vl; do grep "^$vl " file2 || printf -- "- -\n" ; done < file1
Thanks, trying to learn everyday here.

Here's one way using awk:
awk 'FNR==NR { a[$1]; next } { print $1, ($1 in a ? "+" : "-" ) }' file2 file1
Results:
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -

You can use:
while read -r line
do
grep -q "$line" f2 && echo "$line +" || echo "$line -"
done < f1
As grep -q just returns true if it has matched something, in that case we print the file name + + otherwise, we print the file name + -.
It returns:
$ while read -r line; do grep -q "$line" f2 && echo "$line +" || echo "$line -"; done < f1
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -

perl -lane'
BEGIN{ $, ="\t"; $x=shift; #h{ map /(\S+)/, <> } =(); #ARGV=$x }
print #F, exists $h{$F[0]} ? "+" : "-";
' file1 file2
output
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -

Here's the algorithm:
Read file 2. For each line,
Get the first word
Store it in a hash.
Read file 1. For each line, chomp it, then
print $hash{$_}? '+' : '-'
I can write the code for you but if you want to learn everyday, it will be a useful exercise if you want to write it yourself.

This simple Perl script should do the work
#!/usr/local/bin/perl
## f1 and f2 are the 2 files containing your input data
open FILE1, f1;
open FILE2, f2;
#file1data = <FILE1>;
#file2data = <FILE2>;
my $row = 0;
foreach $data (#file1data) {
chomp($data);
if (grep (/$data/,$file2data[$row]) ) {
print $data . " " . "+\n";
}
else {
print $data . " " . "-\n";
}
$row++;
}

awk 'FNR==NR
{a[$1];next}
{b[$1]}
END{
for(i in a)
if(b[i]){print i,"+"}
else{print i,"-"}
}' file1 file2

Related

detect string case and apply to another one

How can I detect the case (lowercase, UPPERCASE, CamelCase [, maybe WhATevERcAse]) of a string to apply to another one?
I would like to do it as a oneline with sed or whatever.
This is used for a spell checker which proposes corrections.
Let's say I get something like string_to_fix:correction:
BEHAVIOUR:behavior => get BEHAVIOUR:BEHAVIOR
Behaviour:behavior => get Behaviour:Behavior
behaviour:behavior => remains behaviour:behavior
Extra case to be handled:
MySpecalCase:myspecialcase => MySpecalCase:MySpecialCase (so character would be the point of reference and not the position in the word)
With awk you can use the posix character classes to detect case:
$ cat case.awk
/^[[:lower:]]+$/ { print "lower"; next }
/^[[:upper:]]+$/ { print "upper"; next }
/^[[:upper:]][[:lower:]]+$/ { print "capitalized"; next }
/^[[:alpha:]]+$/ { print "mixed case"; next }
{ print "non alphabetic" }
Jims-MacBook-Air so $ echo chihuahua | awk -f case.awk
lower
Jims-MacBook-Air so $ echo WOLFHOUND | awk -f case.awk
upper
Jims-MacBook-Air so $ echo London | awk -f case.awk
capitalized
Jims-MacBook-Air so $ echo LaTeX | awk -f case.awk
mixed case
Jims-MacBook-Air so $ echo "Jaws 2" | awk -f case.awk
non alphabetic
Here's an example taking two strings and applying the case of the first to the second:
BEGIN { OFS = FS = ":" }
$1 ~ /^[[:lower:]]+$/ { print $1, tolower($2); next }
$1 ~ /^[[:upper:]]+$/ { print $1, toupper($2); next }
$1 ~ /^[[:upper:]][[:lower:]]+$/ { print $1, toupper(substr($2,1,1)) tolower(substr($2,2)); next }
$1 ~ /^[[:alpha:]]+$/ { print $1, $2; next }
{ print $1, $2 }
$ echo BEHAVIOUR:behavior | awk -f case.awk
BEHAVIOUR:BEHAVIOR
$ echo Behaviour:behavior | awk -f case.awk
Behaviour:Behavior
$ echo behaviour:behavior | awk -f case.awk
behaviour:behavior
With GNU sed:
sed -r 's/([A-Z]+):(.*)/\1:\U\2/;s/([A-Z][a-z]+):([a-z])/\1:\U\2\L/' file
Explanations:
s/([A-Z]+):(.*)/\1:\U\2/: search for uppercase letters up to : and using backreference and uppercase modifier \U, change letters after : to uppercase
s/([A-Z][a-z]+):([a-z])/\1:\U\2\L/ : search for words starting with uppercase letter and if found, replace first letter after : to uppercase
awk -F ':' '
{
# read Pattern to reproduce
Pat = $1
printf("%s:", Pat)
# generic
if ( $1 ~ /^[:upper:]*$/) { print toupper( $2); next}
if ( $1 ~ /^[:lower:]*$/) { print tolower( $2); next}
# Specific
gsub( /[^[:upper:][:lower:]]/, "~:", Pat)
gsub( /[[:upper:]]/, "U:", Pat)
gsub( /[[:lower:]]/, "l:", Pat)
LengPat = split( Pat, aDir, /:/)
# print with the correponsing pattern
LenSec = length( $2)
for( i = 1; i <= LenSec; i++ ) {
ThisChar = substr( $2, i, 1)
Dir = aDir[ (( i - 1) % LengPat + 1)]
if ( Dir == "U" ) printf( "%s", toupper( ThisChar))
else if ( Dir == "l" ) printf( "%s", tolower( ThisChar))
else printf( "%s", ThisChar)
}
printf( "\n")
}' YourFile
take all case (and taking same concept as #Jas for quick upper or lower pattern)
works for this strucure only (spearator by :)
second part (text) could be longer than part1, pattern is used cyclingly
This might work for you (GNU sed):
sed -r '/^([^:]*):\1$/Is//\1:\1/' file
This uses the I flag to do a caseless match and then replaces both instances of the match with the first.

Search strings from bulk data

I have a folder with many files containing text like the following:
blabla
chargeableDuration 00 01 03
...
timeForStartOfCharge 14 55 41
blabla
...
blabla
calledPartyNumber 123456789
blabla
...
blabla
callingPartyNumber 987654321
I require the output like:
987654321 123456789 145541 000103
I have been trying with following awk:
awk -F '[[:blank:]:=,]+' '/findstr chargeableDuration|dateForStartOfCharge|calledPartyNumber|callingPartyNumber/ && $4{
if (calledPartyNumber != "")
print dateForStartOfCharge, "NIL"
dateForStartOfCharge=$5
next
}
/calledPartyNumber/ {
for(i=1; i<=NF; i++)
if ($i ~ /calledPartyNumber/)
break
print chargeableDuration, $i
chargeableDuration=""
}' file
Cannot make it work. Please help.
Assuming you have a file with text named "test.txt", below linux shell command will do the work for you.
egrep -o "[0-9 ]{1,}" test.txt | tr -d ' \t\r\f' | sort -nr | tr "\n" "\t"
Pretty much like Manishs answer:
tac test_regex.txt | grep -oP '(?<=chargeableDuration|timeForStartOfCharge|calledPartyNumber|callingPartyNumber)\s+([^\n]+)' | tr -d " \t\r\f" | tr "\n" " "
Only difference is, you keep the preceding order instead of sorting the result. So for your example both solutions would produce the same output, but you could end up with different results.
awk '/[0-9 ]+$/{
x=substr($0,( index($0," ") + 1 ) );
gsub(" ","",x);
a[$1]=x
}
END {
split("callingPartyNumber calledPartyNumber timeForStartOfCharge chargeableDuration",b," ");
for (i=1;i<=4;i++){
printf a[(b[i])]" "
}
}'
/[0-9 ]+$/ : Find lines end with number separated with/without spaces.
x=substr($0,( index($0," ") + 1 ) ) : Get the index after the first space match in $0 and save the substring after the first space match(ie digits) to a variable x
gsub(" ","",x) : Remove white spaces in x
a[$1]=x : Create an array a with index as $0 and assign x to it
END:
split("callingPartyNumber calledPartyNumber timeForStartOfCharge chargeableDuration",b," ") : Create array b where index 1,2,3 and 4 has value of your required field in the order you need
for (i=1;i<=4;i++){
printf a[(b[i])]" "
} : for loop to get the value in array a with index as value in array b[1],b[2],b[3] and b[4]

copying first string into second line

I have a text file in this format:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375 Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375 aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375 abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
Here I call the first string before the first space as word (for example abacısı)
The string which starts with after first space and ends with integer is definition (for example Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875)
I want to do this: If a line includes more than one definition (first line has one, second line has two, third line has three), apply newline and put the first string (word) into the beginning of the new line. Expected output:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
I have almost 1.500.000 lines in my text file and the number of definition is not certain for each line. It can be 1 to 5
Small python script does the job. Input is expected in input.txt, output gotes to output.txt.
import re
rf = re.compile('([^\s]+\s).+')
r = re.compile('([^\s]+\s\:\s\d+\.\d+)')
with open("input.txt", "r") as f:
text = f.read()
with open("output.txt", "w") as f:
for l in text.split('\n'):
offset = 0
first = ""
match = re.search(rf, l[offset:])
if match:
first = match.group(1)
offset = len(first)
while True:
match = re.search(r, l[offset:])
if not match:
break
s = match.group(1)
offset += len(s)
f.write(first + " " + s + "\n")
I am assuming the following format:
word definitionkey : definitionvalue [definitionkey : definitionvalue …]
None of those elements may contain a space and they are always delimited by a single space.
The following code should work:
awk '{ for (i=2; i<=NF; i+=3) print $1, $i, $(i+1), $(i+2) }' file
Explanation (this is the same code but with comments and more spaces):
awk '
# match any line
{
# iterate over each "key : value"
for (i=2; i<=NF; i+=3)
print $1, $i, $(i+1), $(i+2) # prints each "word key : value"
}
' file
awk has some tricks that you may not be familiar with. It works on a line-by-line basis. Each stanza has an optional conditional before it (awk 'NF >=4 {…}' would make sense here since we'll have an error given fewer than four fields). NF is the number of fields and a dollar sign ($) indicates we want the value of the given field, so $1 is the value of the first field, $NF is the value of the last field, and $(i+1) is the value of the third field (assuming i=2). print will default to using spaces between its arguments and adds a line break at the end (otherwise, we'd need printf "%s %s %s %s\n", $1, $i, $(i+1), $(i+2), which is a bit harder to read).
With perl:
perl -a -F'[^]:]\K\h' -ne 'chomp(#F);$p=shift(#F);print "$p ",shift(#F),"\n" while(#F);' yourfile.txt
With bash:
while read -r line
do
pre=${line%% *}
echo "$line" | sed 's/\([0-9]\) /\1\n'$pre' /g'
done < "yourfile.txt"
This script read the file line by line. For each line, the prefix is extracted with a parameter expansion (all until the first space) and spaces preceded by a digit are replaced with a newline and the prefix using sed.
edit: as tripleee suggested it, it's much faster to do all with sed:
sed -i.bak ':a;s/^\(\([^ ]*\).*[0-9]\) /\1\n\2 /;ta' yourfile.txt
Assuming there are always 4 space-separated words for each definition:
awk '{for (i=1; i<NF; i+=4) print $i, $(i+1), $(i+2), $(i+3)}' file
Or if the split should occur after that floating point number
perl -pe 's/\b\d+\.\d+\K\s+(?=\S)/\n/g' file
(This is the perl equivalent of Avinash's answer)
Bash and grep:
#!/bin/bash
while IFS=' ' read -r in1 in2 in3 in4; do
if [[ -n $in4 ]]; then
prepend="$in1"
echo "$in1 $in2 $in3 $in4"
else
echo "$prepend $in1 $in2 $in3"
fi
done < <(grep -o '[[:alnum:]][^:]\+ : [[:digit:].]\+' "$1")
The output of grep -o is putting all definitions on a separate line, but definitions originating from the same line are missing the "word" at the beginning:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
The for loop now loops over this, using a space as the input file separator. If in4 is a zero length string, we're on a line where the "word" is missing, so we prepend it.
The script takes the input file name as its argument, and saving output to an output file can be done with simple redirection:
./script inputfile > outputfile
Using perl:
$ perl -nE 'm/([^ ]*) (.*)/; my $word=$1; $_=$2; say $word . " " . $_ for / *(.*?[0-9]+\.[0-9]+)/g;' < input.log
Output:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
Explanation:
Split the line to separate first field as word.
Then split the remaining line using the regex .*?[0-9]+\.[0-9]+.
Print word concatenated with every match of above regex.
I would approach this with one of the excellent Awk answers here; but I'm posting a Python solution to point to some oddities and problems with the currently accepted answer:
It reads the entire input file into memory before processing it. This is harmless for small inputs, but the OP mentions that the real-world input is kind of big.
It needlessly uses re when simple whitespace tokenization appears to be sufficient.
I would also prefer a tool which prints to standard output, so that I can redirect it where I want it from the shell; but to keep this compatible with the earlier solution, this hard-codes output.txt as the destination file.
with open('input.txt', 'r') as input:
with open('output.txt', 'w') as output:
for line in input:
tokens = line.rstrip().split()
word = tokens[0]
for idx in xrange(1, len(tokens), 3):
print(word, ' ', ' '.join(tokens[idx:idx+3]), file=output)
If you really, really wanted to do this in pure Bash, I suppose you could:
while read -r word analyses; do
set -- $analyses
while [ $# -gt 0 ]; do
printf "%s %s %s %s\n" "$word" "$1" "$2" "$3"
shift; shift; shift
done
done <input.txt >output.txt
Please find the following bash code
#!/bin/bash
# read.sh
while read variable
do
for i in "$variable"
do
var=`echo "$i" |wc -w`
array_1=( $i )
counter=0
for((j=1 ; j < $var ; j++))
do
if [ $counter = 0 ] #1
then
echo -ne ${array_1[0]}' '
fi #1
echo -ne ${array_1[$j]}' '
counter=$(expr $counter + 1)
if [ $counter = 3 ] #2
then
counter=0
echo
fi #2
done
done
done
I have tested and it is working.
To test
On bash shell prompt give the following command
$ ./read.sh < input.txt > output.txt
where read.sh is script , input.txt is input file and output.txt is where output is generated
here is a sed in action
sed -r '/^indirger(ken|di)/{s/([0-9]+[.][0-9]+ )(indirge)/\1\n\2/g}' my_file
output
indirgerdi indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]+YDH[Past] : 22.2626953125
indirge[Verb]+[Pos]+Hr[Aor]+YDH[Past]+[A3sg] : 18.720703125
indirgerken indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]-Yken[Adv+While] : 19.6201171875

Extracting columns from previous and next line by pattern matching

I have the following file : extract_info.txt
ABC
PNG
CHNS
and to_extractfrom.txt from which I need to retrieve information:
ABC 123 234 TCHSL
NBV 234 23764 DHG
CHNS 123 347 CGJKS
CVS 233 4747 JSHGD
PNG 122 324 HGH
SJDH 373 3487 JHG
and I am running the following code
while read line
do
gene=$(echo $line | awk -F' ' '{print $1}')
app1=$(awk -v comp1="$gene" '(comp1==$1) {print $1 }' to_extractfrom.txt)
done < extract_info.txt
However my desired output is to extract the information for the column in extract_info.txt from the file to_extractfrom.txt such that I get the first column of the previous line on the right and next line on the left of the pattern matched line i.e for the columns in the first file, I will have the output as :
NBV ABC -
SJDH PNG CVS
CVS CHNS NBV
awk '
BEGIN {prev = "-"}
NR == FNR {extract[$1] = 1; next}
is_match {print $1, m1, m2; is_match = 0}
$1 in extract {is_match = 1; m1 = $1; m2 = prev}
{prev = $1}
' extract_info.txt to_extractfrom.txt
NBV ABC -
CVS CHNS NBV
SJDH PNG CVS
If you must have the output in the same order as the extract_info file, and you use GNU awk, you can do
gawk '
BEGIN {prev = "-"}
NR == FNR {extract[$1] = FNR; next}
is_match {output[m1] = $1 FS m1 FS m2; is_match = 0}
$1 in extract {is_match = 1; m1 = $1; m2 = prev}
{prev = $1}
END {
PROCINFO["sorted_in"] = "#val_num_asc"
for (key in extract) print output[key]
}
' extract_info.txt to_extractfrom.txt
NBV ABC -
SJDH PNG CVS
CVS CHNS NBV

translating awk script into perl

I'm trying to translate this code into perl.
gawk '/^>c/ {OUT=substr($0,2) ".fa";print " ">OUT}; OUT{print >OUT}' your_input
Can someone help me?
Perl has a utility to do this for you called a2p. If your script is call script.awk then you would run:
$ a2p script.awk
Which produces:
#!/usr/bin/perl
eval 'exec /usr/bin/perl -S $0 ${1+"$#"}'
if $running_under_some_shell;
# this emulates #! processing on NIH machines.
# (remove #! line above if indigestible)
eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z_0-9]+=)(.*)/ && shift;
# process any FOO=bar switches
$, = ' '; # set output field separator
$\ = "\n"; # set output record separator
while (<>) {
chomp; # strip record separator
if (/^>c/) {
$OUT = substr($_, (2)-1) . '.fa';
&Pick('>', $OUT) &&
(print $fh ' ');
}
;
if ($OUT) {
&Pick('>', $OUT) &&
(print $fh $_);
}
}
sub Pick {
local($mode,$name,$pipe) = #_;
$fh = $name;
open($name,$mode.$name.$pipe) unless $openammeamme}++;
}
To save this to a file, use redirection:
$ a2p script.awk > script.pl
Perl also provides a tool for converting sed scripts: s2p.
#!/usr/bin/perl
my ($outf,$OUT) ;
while(<>){
if(/^>(c.*)/){ $OUT = "$1.fa";
close($outf) if $outf;
open($outf,">",$OUT);
print OUT " \n"}
if($outf){ print $outf $_ }
}
if input is:
>caaa
sdf
sdff
>cbbb
ew
ew
Creats 2 files:
==> caaa.fa <==
>caaa
sdf
sdff
==> cbbb.fa <==
>cbbb
ew
ew
This perl one liner should be equivalent of that awk command:
perl -ane 'if($F[0] =~ /^>c/){$OUT=substr($F[0],1).".fa"; if(OUT==null) {open(OUT,">$OUT");} print OUT " \n"} if ($OUT){print OUT $_} END{close(OUT)}' file
Indented command line:
perl -ane 'if ($F[0] =~ /^>c/) {
$OUT = substr($F[0], 1).".fa";
if (OUT==null) { open(OUT, ">$OUT") }
print OUT " \n"
}
if ($OUT) {
print OUT $_
}
END{close(OUT)
}' file