How to merge two files based on data in multiple columns?

How to merge two files based on data in multiple columns? - if-statement

I have two separate files, each containing a different number of columns which I want to merge based on data in multiple columns.
file1
VMNF01000015.1 1769465 1769675 . . - Focub_II5_mimp_1
VMNF01000014.1 3225875 3226081 . . + Focub_II5_mimp_1
VMNF01000014.1 3226046 3226081 . . - Focub_II5_mimp_1
VMNF01000014.1 3585246 3585281 . . - Focub_II5_mimp_1
VMNF01000014.1 3692468 3692503 . . - Focub_II5_mimp_1
VMNF01000014.1 3715380 3715415 . . + Focub_II5_mimp_1
VMNF01000014.1 2872478 2872511 . . - Focub_II5_mimp_1
file2
VMNF01000014.1 3225875-3226081(+) gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3226046-3226081(-) tacacacctgcgaatactttttgcatcccactgta
VMNF01000015.1 1769465-1769675(-) gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3692468-3692503(-) tacagtgggatgcaaaaagtattcgcaggtgt
VMNF01000014.1 3715380-3715415(+) gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3585246-3585281(-) tacagtgggatgcaaaaagtattcgcaggtgt
VMNF01000014.1 2872478-2872511(-) gtacttcagcctggattcaaacttattgcatcccactgta
First, I think I need to create another 2 columns in file2, separating numbers by "-" and creating a new column for "(*)", but I cannot work out how to separate the numbers without replacing "(-)" too. So far I have been using this command:
awk '{gsub("-","\t",$2);print;}'
Once this has been done, I would like to add the last column in file2 to file1. I have been able to do this using the following command:
awk 'NR==FNR {a[$1]=$3; next} {print $1,$2,$3,$4,$5,$6,$7,a[$1];}' file2 file1 > file3.
However, the data does not match. It is matched based on the entry in column 1. The data in column 1 is the same in many instances, so the data in column 8 of file3 only matches one of the entries, and doesn't match the data in column 2 or 3 in file1 e.g.
file3:
VMNF01000015.1 1769465 1769675 . . - Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3225875 3226081 . . + Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3226046 3226081 . . - Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3585246 3585281 . . - Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3692468 3692503 . . - Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3715380 3715415 . . + Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 2872478 2872511 . . - Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
Even if I was able to separate the data in column 2 of file2, I would still have the same problem as the data in column 2 is the same in some instances. What I need is code that says something along the lines of: sperate the data in column 2 (see below);
VMNF01000014.1 3225875 3226081 (+) gtacttcagcctggattcaaacttattgcatcccactgta
then:
if $1,$2,$3 in file1 match $1,$2,$3 in file2, print $1,$2,$3,$4,$5,$6,$7 from file1 and add $5 from file2.
How can I do this? I know that awk can use if statements, but I don't know how to use them in awk.
Any advice?

Could you please try following.
awk '
FNR==NR{
split($2,array,"[-(]")
mainarray[$1,array[1],array[2]]=$NF
next
}
(($1,$2,$3) in mainarray){
print $0,mainarray[$1,$2,$3]
}
' Input_file2 Input_file1
2nd solution: Since OP is getting an error in above code so made a little change in above.
awk '
FNR==NR{
split($2,array,"[-(]")
key=$1 OFS array[1] OFS array[2]
mainarray[key]=$NF
next
}
{ key = $1 OFS $2 OFS $3 }
(key in mainarray){
print $0,mainarray[key]
}
' Input_file2 Input_file1
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR when Input_file2 is being read.
split($2,array,"[-(]") ##Splitting 2nd field into an array named array where delimiter is - OR (
mainarray[$1,array[1],array[2]]=$NF ##Creating mainarray index of $1,array[1],array[2] and value is current line is last field.
next ##next will skip all further statements from here.
}
(($1,$2,$3) in mainarray){ ##Checking condition if $1,$2,$3 of current line is present in mainaarray.
print $0,mainarray[$1,$2,$3] ##Printing current line with value of mainarray with index of $1,$2,$3
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.

$ awk '
{ key=$1 OFS $2 OFS $3 }
NR==FNR { map[key]=$NF; next }
{ print $0, map[key] }
' FS='[[:space:](-]+' file2 FS=' ' file1
VMNF01000015.1 1769465 1769675 . . - Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3225875 3226081 . . + Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 3226046 3226081 . . - Focub_II5_mimp_1 tacacacctgcgaatactttttgcatcccactgta
VMNF01000014.1 3585246 3585281 . . - Focub_II5_mimp_1 tacagtgggatgcaaaaagtattcgcaggtgt
VMNF01000014.1 3692468 3692503 . . - Focub_II5_mimp_1 tacagtgggatgcaaaaagtattcgcaggtgt
VMNF01000014.1 3715380 3715415 . . + Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta
VMNF01000014.1 2872478 2872511 . . - Focub_II5_mimp_1 gtacttcagcctggattcaaacttattgcatcccactgta

Related

replace strings with certain format in bash

I have a file like this. it is a 7-column tab file with separator of one space (sep=" ").
however, in the 4th column, it is a string with some words which also contains spaces. Then last 3 column are numbers.
test_find.txt
A UTR3 0.760 Sterile alpha motif domain|Sterile alpha motif domain;Sterile alpha motif domain . . 0.0007
G intergenic 0.673 BTB/POZ domain|BTB/POZ domain|BTB/POZ domain . . 0.0015
I want to replace space into underscore (e.g. replace "Sterile alpha motif domain" to "Sterile_alpha_motif_domain"). Firstly, find the pattern starting with letters and end with "|", then treat as one string and replace all spaces to "_". Then move to next line and find next patter. (Is there any easier way to do it?)
I was able to use sed -i -e 's/Sterile alpha motif domain/Sterile_alpha_motif_domain/g' test_find.txt to only first row, but cannot generalize it.
I tried to find all patterns using sed -n 's/^[^[a-z]]*{\(.*\)\\[^\|]*$/\1/p' test_find.txt but doesn't work.
can anyone help me?
I want output like this:
A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain . . 0.0015
Thank you!!!!

We'll need to two-step processing: first extract the 4th column which may
contain spaces; next replace the spaces in the 4th column with underscores.
With GNU awk:
gawk '{
if (match($0, /^(([^ ]+ ){3})(.+)(( [0-9.]+){3})$/, a)) {
gsub(/ /, "_", a[3])
print a[1] a[3] a[4]
}
}' test_find.txt
Output:
A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015
The regex ^(([^ ]+ ){3})(.+)(( [0-9.]+){3})$ matches a line capturing
each submatches.
The 3rd argument (GNU awk extension) a is an array name which is
assigned to the capture group. a[1] holds 1st-3rd columns,
a[3] holds 4th column, and a[4] holds 5th-7th columns.
The gsub function replaces whitespaces with an underscores.
Then the columns are concatenated and printed.

Assuming you have special character at the end before the final column with integers, You can try this sed
$ sed -E 's~([[:alpha:]/]+) ~\1_~g;s/_([[:punct:]])/ \1/g' input_file
0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015

Without making any assumptions on the content of each field, you can 'brute force' the expected result by counting the number of characters in each field (+ the number of field separators) for the beginning of the line and the end of the line, and use this to manipulate the '4th column', e.g.
awk '{start=length($1)+length($2)+length($3)+4; end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6; text=substr($0, start, end); gsub(" ", "_", text); print $1, $2, $3, text, $(NF-2), $(NF-1), $NF}' test.txt
'Neat' version:
awk '{
start=length($1)+length($2)+length($3)+4
end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6
text=substr($0, start, end)
gsub(" ", "_", text)
print $1, $2, $3, text, $(NF-2), $(NF-1), $NF
}' test.txt
A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015
Breakdown:
awk '{
# How many characters are there before column 4 begins (length of each field + total count of field separators (in this case, "4"))
start=length($1)+length($2)+length($3)+4;
# How many characters are there in column 4 (total - (first 3 fields + last 3 fields + total field separators (6)))
end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6;
# Use the substr function to define column 4
text=substr($0, start, end);
# Substitute spaces for underscores in column 4
gsub(" ", "_", text);
# Print everything
print $1, $2, $3, text, $(NF-2), $(NF-1), $NF
}' test.txt

Replace text between 2 particular lines in a text file using sed

Similar questions have been asked but they are for Powershell.
I have a Markdown file like:
.
.
.
## See also
- [a](./A.md)
- [A Child](./AChild.md)
.
.
.
- [b](./B.md)
.
.
.
## Introduction
.
.
.
I wish to replace all occurrences of .md) with .html) between ## See also and ## Introduction :
.
.
.
## See also
- [a](./A.html)
- [A Child](./AChild.html)
.
.
.
- [b](./B.html)
.
.
.
## Introduction
.
.
.
I tried like this in Bash
orig="\.md)"; new="\.html)"; sed "s~$orig~$new~" t.md -i
But, this replaces everywhere in the file. But I wish that the replacement happens only between ## See also and ## Introduction
Could you please suggest changes? I am using awk and sed as I am little familiar with those. I also know a little Python, is it recommended to do such scripting in Python (if it is too complicated for sed or awk)?

$ sed '/## See also/,/## Introduction/s/\.md/.html/g' file

find first value matching the substring

The 9th column has multiple values separated with ";". I am trying to find first occurrence of string after "name_id" in column $9 of a tab limited file - the first line of the file looks like this eg.
1 NY state 3102016 3102125 . + . name_id "ENSMUSG8868"; trans_id "ENSMUST00000082908"; number "1"; id_name "Gm26206";ex_id "ENSMUSE000005";
There are multiple values separated by";" in 9th column. I could come up with this command that pulls out the last "ENSMUSE000005" id
sed 's|.*"\([0-9_A-Z]\+\)".*|\1|' input.txt | head
Can it be done with regex in awk? thanks a lot!

echo $x |awk -F';' '{split($1,a," ");gsub(/"/ ,"" ,a[10]);print a[10]}'
ENSMUSG8868
Where x is your line.
Based on OP's comments :
echo $x |awk -F';' '{split($1,a," ");gsub(/"/ ,"" ,a[10]);print a[1],a[10]}'
1 ENSMUSG8868

Remove middle character from variable names

I have variable names ending with an underscore (_), followed by a year code:
clear
set obs 1
foreach var in age_58 age_64 age_75 age_184 age_93 age99 {
generate `var' = rnormal()
}
list
+----------------------------------------------------------------------+
| age_58 age_64 age_75 age_184 age_93 age99 |
|----------------------------------------------------------------------|
1. | .1162236 -.8781271 1.199268 -1.475732 .9077238 -.0858719 |
+----------------------------------------------------------------------+
I would like to rename them into:
age58 age64 age75 age184 age93 age99
I know I can do this by renaming one variable at a time as follows:
rename age_58 age58
rename age_64 age64
rename age_75 age75
rename age_184 age184
rename age_93 age93
How can I remove the underscore from all the variable names at once?

In Stata 13 and later versions, this can be done in one line using the built-in command rename.
One merely has to specify the relevant rules, which can include wildcard characters:
rename *_# *#
list
+----------------------------------------------------------------------+
| age58 age64 age75 age184 age93 age99 |
|----------------------------------------------------------------------|
1. | .1162236 -.8781271 1.199268 -1.475732 .9077238 -.0858719 |
+----------------------------------------------------------------------+
Type help rename group for details on the various available specifiers.

For Stata 8 up, the community-contributed command renvars offers a solution:
renvars age_*, subst(_)
For documentation and download, see
. search renvars, historical
Search of official help files, FAQs, Examples, SJs, and STBs
SJ-5-4 dm88_1 . . . . . . . . . . . . . . . . . Software update for renvars
(help renvars if installed) . . . . . . . . . N. J. Cox and J. Weesie
Q4/05 SJ 5(4):607
trimend() option added and help file updated
STB-60 dm88 . . . . . . . . Renaming variables, multiply and systematically
(help renvars if installed) . . . . . . . . . N. J. Cox and J. Weesie
3/01 pp.4--6; STB Reprints Vol 10, pp.41--44
renames variables by changing prefixes, postfixes, substrings,
or as specified by a user supplied rule
For the 2001 paper, see this .pdf file.

You can loop over the variables using the macro extended function subinstr:
foreach var of varlist * {
local newname : subinstr local var "_" "", all
if "`newname'" != "`var'" {
rename `var' `newname'
}
}

RegEx for DNS Servers via IPCONFIG

Stack Overflow RegEx Wizards, I've scoured Google and haven't quite found a good solution for this. I need to pull out 1:N DNS servers from IPCONFIG results. In the example below, I would need the first three. However, there may be an instance where there are more or less.
Update: Optimally we want to place cursor at first colon(:) in the DNS string then capture IPs until we hit an alpha character. So if we can just scrape a string from that colon to that alpha character we can run another RegEx to match IPs.
DNS.*: gets us to the first colon (:)
Need to read-ahead until alpha character.
Important Note: Because of the third-party tool we're using we can only use RegEx :)
Here's the RegEx value I've been using as for IPs. This will capture all IP's instead of just the DNS ones...
(([0-9]){1,3}.){1,3}[0-9]{1,3}
IPCONFIG Example
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : 152.225.244.1
DHCP Server . . . . . . . . . . . : 10.204.40.57
DNS Servers . . . . . . . . . . . : 10.204.127.11
10.207.2.50
10.200.10.6
Primary WINS Server . . . . . . . : 10.207.40.145
Secondary WINS Server . . . . . . : 10.232.40.38
Lease Obtained. . . . . . . . . . : Tuesday, August 28, 2012 6:45:12 AM
Lease Expires . . . . . . . . . . : Sunday, September 02, 2012 6:45:12 A

#!/usr/bin/env perl
use strict;
use warnings;
my $data = <<END;
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : 152.225.244.1
DHCP Server . . . . . . . . . . . : 10.204.40.57
DNS Servers . . . . . . . . . . . : 10.204.127.11
10.207.2.50
10.200.10.6
Primary WINS Server . . . . . . . : 10.207.40.145
Secondary WINS Server . . . . . . : 10.232.40.38
Lease Obtained. . . . . . . . . . : Tuesday, August 28, 2012 6:45:12 AM
Lease Expires . . . . . . . . . . : Sunday, September 02, 2012 6:45:12 A
END
my #ips = ();
if ($data =~ /^DNS Servers[\s\.:]+((\d{2}\.\d{3}\.\d{1,3}\.\d{1,3}\s*)+)/m) {
#ips = split(/\s+/, $1);
print "$_\n" foreach(#ips);
}

I would use unpack instead of regular expressions for parsing column-based data:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
my ($ip) = unpack 'x36 A*';
print "$ip\n";
}
__DATA__
DNS Servers . . . . . . . . . . . : 10.204.127.11
10.207.2.50
10.200.10.6
Primary WINS Server . . . . . . . : 10.207.40.145
Secondary WINS Server . . . . . . : 10.232.40.38
You may have to adjust the number 36 to the actual number of characters that should be skipped.

Personally, I'd go in a different direction. Instead of manually parsing the output of ipconfig, I'd use the Win32::IPConfig module.
Win32::IPConfig - IP Configuration Settings for Windows NT/2000/XP/2003
use Win32::IPConfig;
use Data::Dumper;
my $host = shift || "127.0.0.1";
my $ipconfig = Win32::IPConfig->new($host);
my #searchlist = $ipconfig->get_searchlist;
print Dumper \#searchlist;

Match
DNS.+?:(\s*([\d.]+).)+
and pull out the groups. This assumes you have the entire multi-line string in one blob, ans that the extracted text may contain newlines and other whitespace.
The last dot is to match the newline, you need to use /m option

Match against this regex (see in action):
DNS Servers.*:\s*(.*(?:[\n\r]+\s+.*(?:[\n\r]+\s+.*)?)?)
First capture group will be your three IP's (atmost three) as you requested. You need to trim whitespaces surely.
Edit: Regex fixed to match at most three IP's. If there is less IP's, matches them only.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to merge two files based on data in multiple columns? - if-statement

Related

replace strings with certain format in bash

Replace text between 2 particular lines in a text file using sed

find first value matching the substring

Remove middle character from variable names

RegEx for DNS Servers via IPCONFIG

Categories

Resources