sed: Dynamically remove all text columns except positions defined by pattern - regex

By searching and trying (no regex expert), I have managed to process a text output using sed or grep, and extract some lines, formatted this way:
Tree number 280:
1 0.500 1 node_15 6 --> H 1551.code
1 node_21 S ==> H node_20
Tree number 281:
1 0.500 1 node_16 S ==> M 1551.code
1 node_20 S --> H node_19
Then, using
sed 's/^.\{35\}\(.\{9\}\).*/\1/' infile , I get the desired part, plus some output which I get rid of later (not a problem).
Tree number 280:
6 --> H
S ==> H
Tree number 281:
S ==> M
S --> H
However, the horizontal position of the C --> C pattern may vary from file to file, although it is always aligned. Is there a way to extract the --> or ==> including the single preceeding and following characters, no matter which columns they are found in?
The Tree number # part is not necessary and could be left blank as well, but there has to be a separator of a kind.
UPDATE (alternative approach)
Trying to use grep, I issued
grep -Eo '(([a-zA-Z0-9] -- |[a-zA-Z0-9] ==)> [a-zA-Z0-9]|Changes)' infile.
A sample of my initial file follows, if anyone thinks of a better, more efficient approach, or my use of regex is insane, please comment!
..MISC TEXT...
Character change lists:
Character CI Steps Changes
----------------------------------------------------------------
1 0.000 1 node_235 H --> S node
1 node_123 S ==> 6 1843
1 node_126 S ==> H 2461
1 node_132 S ==> 6 1863
1 node_213 H --> I 1816
1 node_213 H --> 8 1820
..CT...
Character change lists:
Character CI Steps Changes
----------------------------------------------------------------
1 0.000 1 node_165 H --> S node
1 node_123 S ==> 6 1843
1 node_231 H ==> S 1823
..MISC TEXT...

Grep is a bit easier for just extracting the matching regex (if you need different separators you can add them to the list separated by pipes [-|=]
grep -o '. [-|=][-|=]> .' infile
Of if you really want to sed for this, this should do the first part matches only lines that have the pattern and the second part extracts only the matching regex
sed -n '/[--|==]>/{s/.*\(. [=|-][-|=]> .\).*/\1/p}' infile

Related

Split file by vector of line numbers

I have a large file, about 10GB. I have a vector of line numbers which I would like to use to split the file. Ideally I would like to accomplish this using command-line utilities. As a regex:
File:
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
Vector of line numbers:
2 5
Desired output:
File 1:
1 2 3
File 2:
4 5 6
7 8 9
10 11 12
File 3:
13 14 15
16 17 18
Using awk:
$ awk -v v="2 5" ' # space-separated vector if indexes
BEGIN {
n=split(v,t) # reshape vector to a hash
for(i=1;i<=n;i++)
a[t[i]]
i=1 # filename index
}
{
if(NR in a) { # file record counter in the vector
close("file" i) # close previous file
i++ # increase filename index
}
print > ("file" i) # output to file
}' file
Sample output:
$ cat file2
4 5 6
7 8 9
10 11 12
Very slightly different from James's and kvantour's solutions: passing the vector to awk as a "file"
vec="2 5"
awk '
NR == FNR {nr[$1]; next}
FNR == 1 {filenum = 1; f = FILENAME "." filenum}
FNR in nr {
close(f)
f = FILENAME "." ++filenum
}
{print > f}
' <(printf "%s\n" $vec) file
$ ls -l file file.*
-rw-r--r-- 1 glenn glenn 48 Jul 17 10:02 file
-rw-r--r-- 1 glenn glenn 7 Jul 17 10:09 file.1
-rw-r--r-- 1 glenn glenn 23 Jul 17 10:09 file.2
-rw-r--r-- 1 glenn glenn 18 Jul 17 10:09 file.3
This might work for you:
csplit -z file 2 5
or if you want regexp:
csplit -z file /2/ /5/
With the default values, the output files will be named xxnn where nn starts at 00 and is incremented by 1.
N.B. The -z option prevents empty elided files.
Here is a little awk that does the trick for you:
awk -v v="2 5" 'BEGIN{v=" 1 "v" "}
index(v," "FNR" ") { close(f); f=FILENAME "." (++i) }
{ print > f }' file
This will create files of the form: file.1, file.2, file.3, ...
Ok, I've gone totally mental this morning, and I came up with a Sed program (with functions, loops, and all) to generate a Sed script to make what you want.
Usage:
put the script in a file (e.g. make.sed) and chmod +x it;
then use it as the script for this Sed command sed "$(./make.sed <<< '1 4')" inputfile¹
Note that ./make.sed <<< '1 4' generates the following sed script:
1,1{w file.1
be};1,4{w file.2
be};1,${w file.3
be};:e
¹ Unfortunately I misread the question, so my script works taking the line number of the last line of each block that you want to write to file, so your 2 5 has to be changed to 1 4 to be fed to my script.
#!/usr/bin/env -S sed -Ef
###########################################################
# Main
# make a template sed script, in which we only have to increase
# the number of each numbered output file, each of which is marked
# with a trailing \x0
b makeSkeletonAndMarkNumbers
:skeletonMade
# try putting a stencil on the rightmost digit of the first marked number on
# the line and loop, otherwise exit
b stencilLeastDigitOfNextMarkedNumber
:didStencilLeastDigitOfNextMarkedNumber?
t nextNumberStenciled
b exit
# continue processing next number by adding 1
:nextNumberStenciled
b numberAdd1
:numberAdded1
# try putting a stencil on the rightmost digit of the next marked number on
# the line and loop, otherwise we're done with the first marked number, we can
# clean its marker, and we can loop
b stencilNextNumber
:didStencilNextNumber?
t nextNumberStenciled
b removeStencilAndFirstMarker
:removeStencilAndFirstMarkerDone
b stencilLeastDigitOfNextMarkedNumber
###########################################################
# puts a \n on each side of the first digit marked on the right by \x0
:stencilLeastDigitOfNextMarkedNumber
tr
:r
s/([0-9])\x0;/\n\1\n\x0;/1
b didStencilLeastDigitOfNextMarkedNumber?
###########################################################
# makes desired sed script skeleton from space-separated numbers
:makeSkeletonAndMarkNumbers
s/$/ $/
s/([1-9]+|\$) +?/1,\1{w file.0\x0;be};/g
s/$/:e/
b skeletonMade
###########################################################
# moves the stencil to the next number followed by \x0
:stencilNextNumber
trr
:rr
s/\n(.)\n([^\x0]*\x0[^\x0]+)([0-9])\x0/\1\2\n\3\n\x0/
b didStencilNextNumber?
###########################################################
# +1 with carry to last digit on the line enclosed in between two \n characters
:numberAdd1
#i\
#\nprima della somma:
#l
:digitPlus1
h
s/.*\n([0-9])\n.*/\1/
y/0123456789/1234567890/
G
s/(.)\n(.*)\n.\n/\2\n\1\n/
trrr
:rrr
/[0-9]\n0\n/s/(.)\n0\n/\n\1\n0/
t digitPlus1
# the following line can be problematic for lines starting with number
/[^0-9]\n0\n/s/(.)\n0\n/\n\1\n10/
b numberAdded1
###########################################################
# remove stencil and first marker on line
:removeStencilAndFirstMarker
s/\n(.)\n/\1/
s/\x0//
b removeStencilAndFirstMarkerDone
###########################################################
:exit
# a bit of post processing the `w` command has to be followed
# by the filename, then by a newline, so we change the appropriate `;`s to `\n`.
s/(\{[^;]+);/\1\n/g

Keeping specific rows with grep function

I have a large data sets and the variable includes different format
Subject Result
1 3
2 4
3 <4
4 <3
5 I need to go to school<>
6 I need to <> be there
7 2.3 need to be< there
8 <.3
9 .<9
10 ..<9
11 >3 need to go to school
12 <16.1
13 <5.0
I just want to keep the rows which include the "< number" or "> number" and not the rows with the text format (forexample, I want to exclude >3 need to school, I need to go to school <>). The problem is that some records are something like .<3, ..<9, >9., >:9. So how can I remove ".","..",":" from the data set and then keep the rows with "< a number" notation. How can I use "grep" function?
Again, I just want to keep the following rows
Subject Result
> 3 <4
> 4 <3
> 8 <.3
> 9 .<9
> 10 ..<9
> 12 <16.1
> 13 <5.0
You can simply apply two greps, one to find the "<>" keys, and then one to eliminate fields with characters:
grep "[><]" | grep -v "[A-Za-z]"
If you want to be pedantic, you can also apply another grep to find those with numbers
grep "[><]" | grep -v "[A-Za-z]" | grep "[0-9]"
"grep -v" means match and don't return, by the way.
Assuming you're certain that [.,:;] are the only problematic punctuation:
df$Result<-gsub("[.,;:]","", df$Result) # remove any cases of [.,;:] from your results column
df[grep("^\\s*[<>][0-9]+$", df$Result),] # find all cases of numbers preceded by < or > (with possible spaces) and succeeded by nothing else.

Removing Leading 0 and applying Regex to Sed

I have several file names, for ease I've put them in a file as follows:
01.action1.txt
04action2.txt
12.action6.txt
2.action3.txt
020.action9.txt
10action4.txt
15action7.txt
021action10.txt
11.action5.txt
18.action8.txt
As you can see the formats aren't consistent what I'm trying to do is extract the first numbers from these file names 1,4,12,2,20 etc
I have the following regex
(\.)?action\d{1,}.txt
Which is successfully matching .action[number].txt but I need to also match the leading 0 and apply it to my substitute with blank in sed so i'm only left with the leading numbers. I'm having trouble matching the leading 0 and applying the whole thing to sed.
Thanks
With GNU sed:
sed -r 's/0*([0-9]*).*/\1/' file
Output:
1
4
12
2
20
10
15
21
11
18
See: The Stack Overflow Regular Expressions FAQ
I don't know if the below awk is helpful but it works as well:
awk '{print $1 + 0}' file
1
4
12
2
20
10
15
21
11
18

Using AWK compare two files having single columns in each and get count againts each matched item

I am going to split my problem as two problems
Problem 1
I have two numerically sorted files having single column as below. File t1.txt has unique values. File t2.txt has duplicate values.
file1: t1.txt
1
2
3
4
5
file2: t2.txt
0
2
2
3
4
7
8
9
9
The output I require is as below:
item matched ---> times it matched in t2.txt
With awk I am using this:
awk 'FNR==NR {a[$1]; next} $1 in a' t2.txt t1.txt
The output I get is:
2
3
4
However I want this:
2 --> 2
3 --> 1
4 --> 1
Problem 2
I am going to run this on large files. The actual target files have below line count:
t1.txt 9702304
t2.txt 32412065
How can we enhance the performance of the script/solution as the file size increases. Please consider that both files will have exactly one column and will be numerically sorted.
Will appreciate your help here. Thanks!
If you don't need to use awk, this pipeline gets you most of the way there:
$ grep -Fxf t1.txt t2.txt | sort | uniq -c
2 2
1 3
1 4
$ join <(sort t1.txt) <(sort t2.txt) | uniq -c | awk '{ print $2 " --> " $1}'
2 --> 2
3 --> 1
4 --> 1
(Of course you can skip the sort if the files are really already sorted, though I noticed in your sample data that 0 follows 9.)
For your problem1, this one-liner should help.
awk 'NR==FNR{a[$1];next}$1 in a{b[$1]++}END{for(x in b)printf "%s --> %s\n", x, b[x]}' f1 f2
tested with your data:
kent$ head f*
==> f1 <==
1
2
3
4
5
==> f2 <==
2
3
4
2
7
8
9
9
0
kent$ awk 'NR==FNR{a[$1];next}$1 in a{b[$1]++}END{for(x in b)printf "%s --> %s\n", x, b[x]}' f1 f2
2 --> 2
3 --> 1
4 --> 1
For the problem 2, you can test this one-liner on your files, see if performance is ok.

how does this AWK associative array with two files work?

I am writing to ask for an explanation for some of the elements of this short AWK command, which I am using to print fields from test-file_long.txt which match fields in input test-file_short.txt. The code works fine-- I would just like to know exactly what the program is doing since I am very new to programming and I would like to be able to think on my toes for future commands that I will need to write. Here is the example:
$ cat test-file_long.txt
2 41647 41647 A G
2 45895 45895 A G
2 45953 45953 T C
2 224919 224919 A G
2 230055 230055 C G
2 233239 233239 A G
2 234130 234130 T G
$ cat test-file_short.txt
2 41647 41647 A G
2 45895 45895 A G
2 FALSE 224919 A G
2 233239 233239 A G
2 234130 234130 T G
$ awk 'NR==FNR{a[$2];next}$2 in a{print $0,FNR}' test-file_short.txt test-file_long.txt
2 41647 41647 A G 1
2 45895 45895 A G 2
2 233239 233239 A G 6
2 234130 234130 T G 7
It is a very simple matching problem for which I found the commands on this site a few weeks ago. My questions are 1) what exactly does NR==FNR do? I know that it stands for number of records = number of records of the current input file, respectively, but why is this necessary for the code to operate? When I remove this from the command, the result is the same as paste test-file_long.txt test-file_short.txt. 2) for $2 in a, does AWK automatically read field 2 from file 2 as part of the syntax here? 3) I just want to confirm that ;next just means to skip all other blocks and go to next line? So in other words the code first performs a[$2] for every line and then goes back and performs the other blocks for each line? When I remove ;next I still get the filtered output but only trailing a full printout of test-file_short.txt.
Thanks for any and all input, my goal is just to understand better how AWK works, since it has been extraordinarily useful for my current work (processing large genomics datasets).
Here are some information related to your code:
NR==FNR will only be valid for the first file. Since, for file number 2, FNR will start from 1 again, whereas NR continues to increase.
$2 in a will only be executed for file number 2, this is due to the next statement inside the first rule. Due to this next statement, the second rule will never be reached for file number 1.