replace strings with lines from another text file by matching patterns - regex

I have a file with a correspondence key -> value:
sort keyFile.txt | head
ENSMUSG00000000001 ENSMUSG00000000001_Gnai3
ENSMUSG00000000003 ENSMUSG00000000003_Pbsn
ENSMUSG00000000003 ENSMUSG00000000003_Pbsn
ENSMUSG00000000028 ENSMUSG00000000028_Cdc45
ENSMUSG00000000028 ENSMUSG00000000028_Cdc45
ENSMUSG00000000028 ENSMUSG00000000028_Cdc45
ENSMUSG00000000031 ENSMUSG00000000031_H19
ENSMUSG00000000031 ENSMUSG00000000031_H19
ENSMUSG00000000031 ENSMUSG00000000031_H19
ENSMUSG00000000031 ENSMUSG00000000031_H19
And I would like to replace every correspondence of "key" with the "value" in the temp.txt:
head temp.txt
ENSMUSG00000000001:001 515
ENSMUSG00000000001:002 108
ENSMUSG00000000001:003 64
ENSMUSG00000000001:004 45
ENSMUSG00000000001:005 58
ENSMUSG00000000001:006 63
ENSMUSG00000000001:007 46
ENSMUSG00000000001:008 11
ENSMUSG00000000001:009 13
ENSMUSG00000000003:001 0
The result should be:
out.txt
ENSMUSG00000000001_Gnai3:001 515
ENSMUSG00000000001_Gnai3:002 108
ENSMUSG00000000001_Gnai3:003 64
ENSMUSG00000000001_Gnai3:004 45
ENSMUSG00000000001_Gnai3:005 58
ENSMUSG00000000001_Gnai3:006 63
ENSMUSG00000000001_Gnai3:007 46
ENSMUSG00000000001_Gnai3:008 11
ENSMUSG00000000001_Gnai3:009 13
ENSMUSG00000000001_Gnai3:001 0
I have tried a few variations following this AWK example but as you can see the result is not what I expected:
awk 'NR==FNR{a[$1]=$1;next}{$1=a[$1];}1' keyFile.txt temp.txt | head
515
108
64
45
58
63
46
11
13
0
My guess is that column 1 of temp does not match 'exactly' column 1 of keyValues. Could someone please help me with this?
R/python/sed solutions are also welcome.

Use awk command like this:
awk 'NR==FNR {a[$1]=$2;next} {
split($1, b, ":");
if (b[1] in a)
print a[b[1]] ":" b[2], $2;
else
print $0;
}' keyFile.txt temp.txt

Code for GNU sed:
$sed -nr '$!N;/^(.*)\n\1$/!bk;D;:k;s#\S+\s+(\w+)_(\w+)#/^\1/s/(\\w+)(:\\w+)\\s+(\\w+)/\\1_\2\\2 \\3/p#;P;s/^(.*)\n//' keyfile.txt|sed -nrf - temp.txt
ENSMUSG00000000001_Gnai3:001 515
ENSMUSG00000000001_Gnai3:002 108
ENSMUSG00000000001_Gnai3:003 64
ENSMUSG00000000001_Gnai3:004 45
ENSMUSG00000000001_Gnai3:005 58
ENSMUSG00000000001_Gnai3:006 63
ENSMUSG00000000001_Gnai3:007 46
ENSMUSG00000000001_Gnai3:008 11
ENSMUSG00000000001_Gnai3:009 13
ENSMUSG00000000003_Pbsn:001 0

Another awk option
awk -F: 'NR == FNR{split($0, a, " "); x[a[1]]=a[2]; next}{print x[$1]":"$2}' keyFile.txt temp.txt

Another awk version:
awk 'NR==FNR{a[$1]=$2;next}
{sub(/[^:]+/,a[substr($1,1,index($1,":")-1)])}1' keyFile.txt temp.txt

Related

How to rename chromosome_position column in a Beagle file and match it with the index fai?

I have text files (tab separated) which have different columns. I need to rename my chromosome_position column (see MM.beagle.gz file below) since a program that I use don't allow multiple underscores in the chromosome name (causing a parsing issue because NC_044592.1_3795 is not working as a name).
My indexed genome looks like this:
head my.fna.fai
Contains this:
NC_044571.1 115307910 88 80 81
NC_044572.1 151975198 116749435 80 81
NC_044573.1 113180500 270624411 80 81
NC_044574.1 71869398 385219756 80 81
The bealgle file looks like this:
zcat MM.beagle.gz | head | cut -f 1-3
Which gives:
marker allele1 allele2
NC_044571.1_3795 G T
NC_044573.1_3796 G T
NC_044572.1_3801 T C
NC_044574.1_3802 G A
In R I can get the chromosome and position:
beag = read.table("MM.beagle.gz", header = TRUE)
chr=gsub("_\\d+$", "", beag$marker)
pos=gsub("^[A-Z]*_[0-9]*.[0-9]_", "", beag$marker)
But I'm not able to rename the beagle file in-place. I'd like to rename all contigs in the .fai file from 1:nrow(my.fna.fai) and match it to the beagle file.
So in the end the .fai should look like:
head my.fna.fai
Desired output:
1 115307910 88 80 81
2 151975198 116749435 80 81
3 113180500 270624411 80 81
4 71869398 385219756 80 81
And the beagle file:
zcat MM.beagle.gz | head | cut -f 1-3
Would give:
marker allele1 allele2
1_3795 G T
3_3796 G T
2_3801 T C
4_3802 G A
where 22_3795 is the concatenation of the contig 22 and the position 3795, separated with an _.
The solution would preferentially be in bash as R is not practical due to the large file size of my final compressed beagle file (>210GB)
Someone proposed to change the .fai with this:
awk 'BEGIN{OFS="\t"}{print NR, $2, $3, $4, $5}' my.fna.fai
What I'm not able to figure out now is to make sure that the .fai and the .beagle file are consistent with each other. For example, event if the first column (marker) of the .beagle file is shuffled, it should be possible to match it with the .fai file and rename the chromosome names in the .beagle file. For example, if NC_1234.1 is renamed to 142 in the .fai, then all NC_1234.1_XXX in the .beagle should become 142_XXX, where XXX are numbers.
Here is an attempt at the solution:
awk 'BEGIN{OFS="\t"}{print $1, NR}' my.fna.fai > my.fna.fai.nr
awk -F'\t' -v OFS='\t' '{split($1,a,"_"); print $0,a[1]"_"a[2],a[3]}' MM.beagle.txt | awk 'NR!=1 {print}' | awk 'BEGIN{OFS="\t"}{print $0, NR}'> file2.sep.txt
sort file2.sep.txt > file2.1.s.txt
join -1 4 -2 1 file2.1.s.txt my.fna.fai.nr | sort -k6 -n | awk 'BEGIN{OFS="\t"}{$1=$2=$6="";print $7"_"$5,$0}' | awk 'BEGIN{OFS="\t"}{$4=$5="";print $0}' > file4.txt
echo $(awk 'NR==1 {print}' MM.beagle.txt); cat file4.txt
Gives
marker allele1 allele2
1_3795 G T
3_3796 G T
2_3801 T C
4_3802 G A
To ensure the new FASTA index and modified Beagle files are consistent, we can use the FASTA index and an associative array to store the chromosome name with it's line number. This lets us then parse the Beagle file and use the chromosome name to retrieve the line number from the array. Here's one way using awk:
Contents of rename_chroms.awk:
BEGIN {
FS=OFS="\t"
}
FNR==NR {
arr[$1]=NR
next
}
FNR==1 {
print
next
}
{
n = split($1, a, "_")
chrom = substr($1, 0, length($1) - (length(a[n]) + 1))
pos = a[n]
print arr[chrom] "_" pos, $2, $3
}
Run using:
awk -f rename_chroms.awk my.fna.fai <(zcat MM.beagle.gz)
Results:
marker allele1 allele2
1_3795 G T
2_3796 G T
3_3801 T C
4_3802 G A

Bash select valid rows from file with awk

I have a large data set with some invalid rows. I want to copy to another file only rows which start with valid date (regex digits).
Basically check if awk $1 is digit ([0-9]), if yes, write whole row ($0) to output file, if no skip this row, go to next row.
How I imagine it like (both versions give syntax error):
awk '{if ($1 =~ [0-9]) print $0 }' >> output.txt
awk '$1 =~ [0-9] {print $0}' filename.txt
while this does print the first field, I have no idea how to proceed.
awk '{ print $1 }' filename.txt
19780101
19780102
19780103
a
19780104
19780105
19780106
...
Full data set:
19780101 1 1 1 1 1
19780102 2 2 2 2 2
19780103 3 3 3 3 3
a a a a a a
19780104 4 4 4 4 4
19780105 5 5 5 5 5
19780106 6 6 6 6 6
19780107 7 7 7 7 7
19780108 8 8 8 8 8
19780109 9 9 9 9 9
19780110 10 10 10 10 10
19780111 11 11 11 11 11
19780112 12 12 12 12 12
19780113 13 13 13 13 13
19780114 14 14 14 14 14
19780115 15 15 15 15 15
19780116 16 16 16 16 16
a a a a a a
19780117 17 17 17 17 17
19780118 18 18 18 18 18
19780119 19 19 19 19 19
19780120 20 20 20 20 20
The data set can be reproduced with R
library(dplyr)
library(DataCombine)
N <- 20
df = as.data.frame(matrix(seq(N),nrow=N,ncol=5))
df$date = format(seq.Date(as.Date('1978-01-01'), by = 'day', len = N), "%Y%m%d")
df <- df %>% select(date, everything())
df <- InsertRow(df, NewRow = rep("a", 6), RowNum = 4)
df <- InsertRow(df, NewRow = rep("a", 6), RowNum = 18)
write.table(df,"filename.txt", quote = FALSE, sep="\t",row.names=FALSE)
Questions about reading first N rows don't address my need, because my invalid rows could be anywhere. This solution doesn't work for some reason.
Since you have a large data set and such a simple requirement, you could just use grep for this as it'd be faster than awk:
grep '^[0-9]' file
Based on your data, you can check if first column has 8 digits to be representing a date in YYYYMMDD format using this command:
awk '$1 ~ /^[0-9]{8}$/' file > output
You can just go with this:
awk '/^[0-9]+/' file.txt >> output.txt
By default awk works with lines, so you tell him (I am assuming he is a boy) to select the lines that starts (^) with at least one digit ([0-9]+), and to print them, redirecting in output.txt.
Hope helps.
You can also try this..
sed '/^[0-9]/!d' inputfile > outputfile

Regular Expression to get substrings in PowerShell

I need help with the regular expression. I have 1000's of lines in a file with the following format:
+ + [COMPILED]\SRC\FileCheck.cs - TotalLine: 99 RealLine: 27 Braces: 18 Comment: 49 Empty: 5
+ + [COMPILED]\SRC\FindstringinFile.cpp - TotalLine: 103 RealLine: 26 Braces: 22 Comment: 50 Empty: 5
+ + [COMPILED]\SRC\findingstring.js - TotalLine: 91 RealLine: 22 Braces: 14 Comment: 48 Empty: 7
+ + [COMPILED]\SRC\restinpeace.h - TotalLine: 95 RealLine: 24 Braces: 16 Comment: 48 Empty: 7
+ + [COMPILED]\SRC\Getsomething.h++ - TotalLine: 168 RealLine: 62 Braces: 34 Comment: 51 Empty: 21
+ + [COMPILED]\SRC\MemDataStream.hh - TotalLine: 336 RealLine: 131 Braces: 82 Comment: 72 Empty: 51
+ + [CONTEXT]\SRC\MemDataStream.sql - TotalLine: 36 RealLine: 138 Braces: 80 Comment: 76 Empty: 59
I need a regular expression that can give me:
FilePath i.e. \SRC\FileMap.cpp
Extension i.e. .cpp
RealLine value i.e. 17
I'm using PowerShell to implement this and been successful in getting the results back using Get-Content (to read the file) and Select-String cmdlets.
Problem is its taking a long time to get the various substrings and then writing those in the xml file.(I have not put in the code for generating and xml).
I've never used regular expressions before but I know using a regular expression would be an efficient way to get the strings..
Help would be appreciated.
The Select-String cmdlet accepts the regular expression to search for the string.
Current code is as follows:
function Get-SubString
{
Param ([string]$StringtoSearch, [string]$StartOfTheString, [string]$EndOfTheString)
If($StringtoSearch.IndexOf($StartOfTheString) -eq -1 )
{
return
}
[int]$StartOfIndex = $StringtoSearch.IndexOf($StartOfTheString) + $StartOfTheString.Length
[int]$EndOfIndex = $StringtoSearch.IndexOf($EndOfTheString , $StartOfIndex)
if( $StringtoSearch.IndexOf($StartOfTheString)-ne -1 -and $StringtoSearch.IndexOf($EndOfTheString) -eq -1 )
{
[string]$ExtractedString=$StringtoSearch.Substring($StartOfTheString.Length)
}
else
{
[string]$ExtractedString = $StringtoSearch.Substring($StartOfIndex, $EndOfIndex - $StartOfIndex)
}
Return $ExtractedString
}
function Get-FileExtension
{
Param ( [string]$Path)
[System.IO.Path]::GetExtension($Path)
}
#For each file extension we will be searching all lines starting with + +
$SearchIndividualLines = "+ + ["
$TotalLines = select-string -Pattern $SearchIndividualLines -Path
$StandardOutputFilePath -allmatches -SimpleMatch
for($i = $TotalLines.GetLowerBound(0); $i -le $TotalLines.GetUpperBound(0); $i++)
{
$FileDetailsString = $TotalLines[$i]
#Get File Path
$StartStringForFilePath = "]"
$EndStringforFilePath = "- TotalLine"
$FilePathValue = Get-SubString -StringtoSearch $FileDetailsString -StartOfTheString $StartStringForFilePath -EndOfTheString $EndStringforFilePath
#Write-Host FilePathValue is $FilePathValue
#GetFileExtension
$FileExtensionValue = Get-FileExtension -Path $FilePathValue
#Write-Host FileExtensionValue is $FileExtensionValue
#GetRealLine
$StartStringForRealLine = "RealLine:"
$EndStringforRealLine = "Braces"
$RealLineValue = Get-SubString -StringtoSearch $FileDetailsString -
StartOfTheString $StartStringForRealLine -EndOfTheString $EndStringforRealLine
if([string]::IsNullOrEmpty($RealLineValue))
{
continue
}
}
Assume you have those in C:\temp\sample.txt
Something like this?
PS> (get-content C:\temp\sample.txt) | % { if ($_ -match '.*COMPILED\](\\.*)(\.\w+)\s*.*RealLine:\s*(\d+).*') { [PSCustomObject]#{FilePath=$matches[1]; Extention=$Matches[2]; RealLine=$matches[3]} } }
FilePath Extention RealLine
-------- --------- --------
\SRC\FileCheck .cs 27
\SRC\FindstringinFile .cpp 26
\SRC\findingstring .js 22
\SRC\restinpeace .h 24
\SRC\Getsomething .h 62
\SRC\MemDataStream .hh 131
Update:
Stuff inside paranthesis is captured, so if you want to capture [COMPILED], you will need to just need to add that part into the regex:
Instead of
$_ -match '.*COMPILED\](\\.*)
use
$_ -match '.*(\[COMPILED\]\\.*)
The link in the comment to your question includes a good primer on the regex.
UPDATE 2
Now that you want to capture set of path, I am guessing you sample looks like this:
+ + [COMPILED]C:\project\Rom\Main\Plan\file1.file2.file3\Cmd\Camera.culture.less-Lat‌​e-PP.min.js - TotalLine: 336 RealLine: 131 Braces: 82 Comment: 72 Empty: 51
The technique above will work, you just need to do a very slight adjustment for the first parenthesis like this:
$_ -match (\[COMPILED\].*)
This will tell regex that you want to capture [COMPILED] and everything that comes after it, until
(\.\w+)
i.e to the extension, which as a dot and a couple of letters (which might not works if you had an extension like .3gp)
So, your original one liner would instead be:
(get-content C:\temp\sample.txt) | % { if ($_ -match '.(\[COMPILED\].*)(\.\w+)\s*.*RealLine:\s*(\d+).*') { [PSCustomObject]#{FilePath=$matches[1]; Extention=$Matches[2]; RealLine=$matches[3]} } }

Getting the last column of a grep match for each line

Let's say I have
this is a test string
this is a shest string
this est is another example of sest string
I want the number of the character in string of the last "t" IN THE WORDS [tsh]EST, how do I get it? (In bash)
EDIT2: I can get the wanted substring with [tsh]*est if I'm not wrong.
I cannot rely on the first match (awk where=match(regex,$0) ) since it gives the first character position but the size of the match is not always the same.
EDIT: Expected output ->
last t of [tsh]*est at char number: 14
last t of [tsh]*est at char number: 15
last t of [tsh]*est at char number: 35
Hope I was clear, I think I edited the question too many times sorry !
What you got wrong
where=match(regex,$0)
the syntax of match is wrong. its string followd by the regex. That is match($0, regex)
Correction
$ awk '{print match($0, "t[^t]*$")}' input
17
18
38
EDIT
Get number of the character in string of the last "t" IN THE WORDS [tsh]EST,
$ awk '{match($0, "(t|sh|s)est"); print RSTART+RLENGTH-1}' input
14
15
35
OR
a much simpler version
$ awk 'start=match($0, "(t|sh|s)est")-1{$0=start+RLENGTH}1' input
14
15
35
Thanks Jidder for the suggestion
EDIT
To use the regex same as OP has provided
$ awk '{for(i=NF; match($i, "(t|sh|s)*est") == 0 && i > 0; i--); print index($0,$i)+RLENGTH-1;}' input
14
15
35
You can use this awk using same regex as provided by OP:
awk -v re='[tsh]*est' '{
i=0;
s=$0;
while (p=match(s, re)) {
p+=RLENGTH;
i+=p-1;
s=substr(s, p)
}
print i;
}' file
14
15
35
Try:
awk '{for (i=NF;i>=0;i--) { if(index ($i, "t") != 0) {print i; break}}}' myfile.txt
This will print the column with the last word that contains t
awk '{s=0;for (i=1;i<=NF;i++) if ($i~/t/) s=i;print s}' file
5
5
8
awk '{s=w=0;for (i=1;i<=NF;i++) if ($i~/t/) {s=i;w=$i};print "last t found in word="w,"column="s}'
last t found in word=string column=5
last t found in word=string column=5
last t found in word=string column=8

Using AWK to print lines between patterns after a pattern has been met

I am trying to print specific parts of a file that looks like this:
99999 1 55 127
{bunch of numbers here}
99999 2 55 126
{bunch of numbers here}
99999 3 55 144
{bunch of numbers here}
and basically I am trying to print the "bunch of numbers" (along with the preceeding line) when a specific sequence is met. The 99999 is always constant and i dont care about the number right after, but i want to condition the next two numbers.
#! /usr/bin/awk -f
BEGIN{}
{
if ( $3 == 55 && $4 = 100 )
{next
do{print $0}
while($1 != 99999}
}}
END{}
I'm quite new to awk and would really appreciate the help! Thanks
your question is not clear to me...
I guess you want to print out block of lines after (inclusive) a 99999 x 55 100 and before (exclusive) another 99999 ... line.
I used your example, (btw, you should provide a better example and with output.) but I changed your criteria to $3==55 and $4=126 so that the block sits in the middle of your data.
awk '$1==99999{f=($3==55&&$4==126)?1:0}f' file
test:
kent$ cat f
99999 1 55 127
{bunch of numbers here}
1
2
99999 2 55 126
3
4
{bunch of numbers here}
99999 3 55 144
5
6
{bunch of numbers here}
kent$ awk '$1==99999{f=($3==55&&$4==126)?1:0}f' f
99999 2 55 126
3
4
{bunch of numbers here}
You can use a flag, set it when either third and fourth field match, and reset it (exiting) when found next line that its first field is 99999:
awk '
$1 == 99999 && flag == 1 { exit }
$3 == 55 && $4 == 126 { flag = 1 }
flag == 1 { print }
' infile
It yields:
99999 2 55 126
{bunch of numbers here}