Extracting text file information via command line/script

Extracting text file information via command line/script - regex

I'd like to extract only certain information from a block of text. I have had great luck with asking the StackOverflow community for their expertise assistance, especially with tricky topics (RegEx, perl, sed, awk).
The text is output from a tshark command that I would like to manipulate and print out to avoid unnecessary information.
Any help would be appreciated. I am currently learning the ways of the aforementioned topics, but it's slow going!
Any script or command help to achieve the following output will be seriously helpful.
Original:
Host 1 Host 2 Total Relative Duration
Host 1 Host 2 Frames Bytes Frames Bytes Frames Bytes Start
192.168.0.14 <-> 192.168.0.13 3898 4872033 1971 120545 5869 4992578 0.001886000 283.6363
192.168.0.162 <-> 192.168.0.71 2 1992 2 1992 4 3984 176.765198000 77.0542
192.168.0.191 <-> 192.168.0.150 3 2988 0 0 3 2988 199.319020000 59.7055
192.168.0.227 <-> 192.168.0.157 3 2988 0 0 3 2988 197.013283000 76.7197
192.168.0.221 <-> 192.168.0.94 3 2988 0 0 3 2988 196.312847000 59.7065
192.168.0.75 <-> 192.168.0.58 2 1992 1 996 3 2988 191.995706000 59.7121
224.0.0.252 <-> 192.168.0.13 3 207 0 0 3 207 180.521299000 0.0536
192.168.0.191 <-> 192.168.0.50 1 996 2 1992 3 2988 173.452130000 59.6849
192.168.0.41 <-> 192.168.0.13 3 2988 0 0 3 2988 167.180087000 76.6960
192.168.0.206 <-> 192.168.0.153 1 996 1 996 2 1992 270.528070000 4.4070
Desired:
Host 1 Host 2 Total Bytes
x.x.x.x x.x.x.x N
x.x.x.x x.x.x.x N
x.x.x.x x.x.x.x N

Try:
awk '
BEGIN { printf "%-15s %-15s %s\n", "Host 1", "Host 2", "Total Bytes" }
NR>2 { printf "%-15s %-15s %11s\n", $1, $3, $9 }
' file
Adjust the output-field widths as needed.
The BEGIN block is used to print the output header line.
NR > 2 ensures that the input header lines are skipped.
printf is used with field-width specifiers create column-aligned output.
a - before the width specifier indicates left-aligned output (e.g.,%-15s; without it, the value is right-aligned (e.g., %11s)

in perl:
tshark | perl -lane 'print join "\t", ($F[0], $F[2], $F[8])'
the -a option splits each line of stdin into an array called #F. the column numbers don't correspond well to the array index numbers because -a splits by space by default. you can set the delimiter with -F if you like.
-F would help get the headers aligned correctly too, but to just skip the misaligned headers, add next if $. < 3; before print to skip the first two lines

Given your output is in filename:
sed 's/ \+/ /g' filename | tail -n +3 | cut -f1,3,9 -d ' ' | sed 's/ /\t/g' | sort -r -n -k3
replace multiple spaces with a single one, for tokenizing
discard the first two header lines
project columns 1, 3, and 9
replace spaces with tabs to have columns back
sort desc by total bytes
output:
192.168.0.14 192.168.0.13 4992578
192.168.0.162 192.168.0.71 3984
192.168.0.75 192.168.0.58 2988
192.168.0.41 192.168.0.13 2988
192.168.0.227 192.168.0.157 2988
192.168.0.221 192.168.0.94 2988
192.168.0.191 192.168.0.50 2988
192.168.0.191 192.168.0.150 2988
192.168.0.206 192.168.0.153 1992
224.0.0.252 192.168.0.13 207

Related

Split file by vector of line numbers

I have a large file, about 10GB. I have a vector of line numbers which I would like to use to split the file. Ideally I would like to accomplish this using command-line utilities. As a regex:
File:
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
Vector of line numbers:
2 5
Desired output:
File 1:
1 2 3
File 2:
4 5 6
7 8 9
10 11 12
File 3:
13 14 15
16 17 18

Using awk:
$ awk -v v="2 5" ' # space-separated vector if indexes
BEGIN {
n=split(v,t) # reshape vector to a hash
for(i=1;i<=n;i++)
a[t[i]]
i=1 # filename index
}
{
if(NR in a) { # file record counter in the vector
close("file" i) # close previous file
i++ # increase filename index
}
print > ("file" i) # output to file
}' file
Sample output:
$ cat file2
4 5 6
7 8 9
10 11 12

Very slightly different from James's and kvantour's solutions: passing the vector to awk as a "file"
vec="2 5"
awk '
NR == FNR {nr[$1]; next}
FNR == 1 {filenum = 1; f = FILENAME "." filenum}
FNR in nr {
close(f)
f = FILENAME "." ++filenum
}
{print > f}
' <(printf "%s\n" $vec) file
$ ls -l file file.*
-rw-r--r-- 1 glenn glenn 48 Jul 17 10:02 file
-rw-r--r-- 1 glenn glenn 7 Jul 17 10:09 file.1
-rw-r--r-- 1 glenn glenn 23 Jul 17 10:09 file.2
-rw-r--r-- 1 glenn glenn 18 Jul 17 10:09 file.3

This might work for you:
csplit -z file 2 5
or if you want regexp:
csplit -z file /2/ /5/
With the default values, the output files will be named xxnn where nn starts at 00 and is incremented by 1.
N.B. The -z option prevents empty elided files.

Here is a little awk that does the trick for you:
awk -v v="2 5" 'BEGIN{v=" 1 "v" "}
index(v," "FNR" ") { close(f); f=FILENAME "." (++i) }
{ print > f }' file
This will create files of the form: file.1, file.2, file.3, ...

Ok, I've gone totally mental this morning, and I came up with a Sed program (with functions, loops, and all) to generate a Sed script to make what you want.
Usage:
put the script in a file (e.g. make.sed) and chmod +x it;
then use it as the script for this Sed command sed "$(./make.sed <<< '1 4')" inputfile¹
Note that ./make.sed <<< '1 4' generates the following sed script:
1,1{w file.1
be};1,4{w file.2
be};1,${w file.3
be};:e
¹ Unfortunately I misread the question, so my script works taking the line number of the last line of each block that you want to write to file, so your 2 5 has to be changed to 1 4 to be fed to my script.
#!/usr/bin/env -S sed -Ef
###########################################################
# Main
# make a template sed script, in which we only have to increase
# the number of each numbered output file, each of which is marked
# with a trailing \x0
b makeSkeletonAndMarkNumbers
:skeletonMade
# try putting a stencil on the rightmost digit of the first marked number on
# the line and loop, otherwise exit
b stencilLeastDigitOfNextMarkedNumber
:didStencilLeastDigitOfNextMarkedNumber?
t nextNumberStenciled
b exit
# continue processing next number by adding 1
:nextNumberStenciled
b numberAdd1
:numberAdded1
# try putting a stencil on the rightmost digit of the next marked number on
# the line and loop, otherwise we're done with the first marked number, we can
# clean its marker, and we can loop
b stencilNextNumber
:didStencilNextNumber?
t nextNumberStenciled
b removeStencilAndFirstMarker
:removeStencilAndFirstMarkerDone
b stencilLeastDigitOfNextMarkedNumber
###########################################################
# puts a \n on each side of the first digit marked on the right by \x0
:stencilLeastDigitOfNextMarkedNumber
tr
:r
s/([0-9])\x0;/\n\1\n\x0;/1
b didStencilLeastDigitOfNextMarkedNumber?
###########################################################
# makes desired sed script skeleton from space-separated numbers
:makeSkeletonAndMarkNumbers
s/$/ $/
s/([1-9]+|\$) +?/1,\1{w file.0\x0;be};/g
s/$/:e/
b skeletonMade
###########################################################
# moves the stencil to the next number followed by \x0
:stencilNextNumber
trr
:rr
s/\n(.)\n([^\x0]*\x0[^\x0]+)([0-9])\x0/\1\2\n\3\n\x0/
b didStencilNextNumber?
###########################################################
# +1 with carry to last digit on the line enclosed in between two \n characters
:numberAdd1
#i\
#\nprima della somma:
#l
:digitPlus1
h
s/.*\n([0-9])\n.*/\1/
y/0123456789/1234567890/
G
s/(.)\n(.*)\n.\n/\2\n\1\n/
trrr
:rrr
/[0-9]\n0\n/s/(.)\n0\n/\n\1\n0/
t digitPlus1
# the following line can be problematic for lines starting with number
/[^0-9]\n0\n/s/(.)\n0\n/\n\1\n10/
b numberAdded1
###########################################################
# remove stencil and first marker on line
:removeStencilAndFirstMarker
s/\n(.)\n/\1/
s/\x0//
b removeStencilAndFirstMarkerDone
###########################################################
:exit
# a bit of post processing the `w` command has to be followed
# by the filename, then by a newline, so we change the appropriate `;`s to `\n`.
s/(\{[^;]+);/\1\n/g

Using AWK compare two files having single columns in each and get count againts each matched item

I am going to split my problem as two problems
Problem 1
I have two numerically sorted files having single column as below. File t1.txt has unique values. File t2.txt has duplicate values.
file1: t1.txt
1
2
3
4
5
file2: t2.txt
0
2
2
3
4
7
8
9
9
The output I require is as below:
item matched ---> times it matched in t2.txt
With awk I am using this:
awk 'FNR==NR {a[$1]; next} $1 in a' t2.txt t1.txt
The output I get is:
2
3
4
However I want this:
2 --> 2
3 --> 1
4 --> 1
Problem 2
I am going to run this on large files. The actual target files have below line count:
t1.txt 9702304
t2.txt 32412065
How can we enhance the performance of the script/solution as the file size increases. Please consider that both files will have exactly one column and will be numerically sorted.
Will appreciate your help here. Thanks!

If you don't need to use awk, this pipeline gets you most of the way there:
$ grep -Fxf t1.txt t2.txt | sort | uniq -c
2 2
1 3
1 4

$ join <(sort t1.txt) <(sort t2.txt) | uniq -c | awk '{ print $2 " --> " $1}'
2 --> 2
3 --> 1
4 --> 1
(Of course you can skip the sort if the files are really already sorted, though I noticed in your sample data that 0 follows 9.)

For your problem1, this one-liner should help.
awk 'NR==FNR{a[$1];next}$1 in a{b[$1]++}END{for(x in b)printf "%s --> %s\n", x, b[x]}' f1 f2
tested with your data:
kent$ head f*
==> f1 <==
1
2
3
4
5
==> f2 <==
2
3
4
2
7
8
9
9
0
kent$ awk 'NR==FNR{a[$1];next}$1 in a{b[$1]++}END{for(x in b)printf "%s --> %s\n", x, b[x]}' f1 f2
2 --> 2
3 --> 1
4 --> 1
For the problem 2, you can test this one-liner on your files, see if performance is ok.

Regex for soccer data

Why isn't my regex working? It just returns back the original file. My file looks like this (for a few hundred lines):
1 Germany 1765 0 Equal
2 Argentina 1631 0 Equal
3 Colombia 1488 1 Up
4 Netherlands 1456 -1 Down
5 Belgium 1444 0 Equal
6 Brazil 1291 1 Up
7 Uruguay 1243 -1 Down
8 Spain 1228 -1 Down
9 France 1202 1 Up
...
192 US Virgin Islands 28 -1 Down
And I want this:
Germany,1
Argentina,2
Colombia,3
...
US Virgin Islands,192
This is the regex I tried:
sed 's/\([0-9]*\)\t\([a-zA-Z]*\)/\2,\1/g' <fifa.csv >fifa.csv
But it just returns the original file.
EDIT:
Now I tried
sed 's/\([0-9]*\)\t\([a-zA-Z]*\)/\2,\1/g' <fifa.csv >fifa.csv
and got
,1 Germany,,1765Equal,0,
,2 Argentina,,1631Equal,0,
,3 Colombia,,1488Up,1,
,4 Netherlands,,1456-Down,1,
,5 Belgium,,1444Equal,0,

You could try the below sed command if the fields are tab-separated.
sed 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' file
Add the inline-edit option -i to save the changes made.
sed -i 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' file
^ means start of the line anchor. + would repeat the previous character one or more times. Basic sed uses BRE so you need to escape the + to do the functionality of repeating the previous character one or more times. [^\t]* matches any character but not of \t tab character zero or more times.

The following is what you are looking for. The -i option specifies that files are to be edited in-place.
sed -i 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' fifa.csv

awk '{print( $2 "," $1)}' YourFile
not a sed but easier to manage

SED- combine matching regex lines to make a csv file

I was wondering if it is possible to use sed to create a csv file by combining multiple lines together onto a singe line separate by commas.
For example I have written a sed statement that retrieves the lines I want.
sed -n -e '/ENTITIES/,/ENDSEC/p' | sed -n -e '/ 8/{n;p;}' -e '/ 10/{n;p;}' -e '/ 20/{n;p;}' -e '/ 11/{n;p;}' -e '/ 21/{n;p;}' < Test.txt > out.csv
Which produces the output;
0
4.93
9.04
27.9
23.4
0
34.56
0.77
66.65
19.50
0
55.26
47.29
53.42
19.75
0
-18.22
44.35
19.74
53.28
But I would Like it to output;
0,4.93,9.04,27.9,23.4
0,34.56,0.77,66.65,19.50
0,55.26,47.29,53.42,19.75
0,-18.22,44.35,19.74,53.28
Is there anyway to do this without a pipe? Id Rather not invoke another command as the files I process are upwards of 100 mil lines or so.
Thanks in advance for your help!
To add, here is a portion of my input file;
More Stuff Above
AcDbBlockEnd
0
ENDSEC
0
SECTION
2
ENTITIES
0
LINE
5
1B1
330
1F
100
AcDbEntity
8
0
100
AcDbLine
10
4.933855223957067
20
9.042372500389475
30
0.0
11
27.92566226775641
21
23.49207557886149
31
0.0
0
LINE
5
1B2
330
1F
100
AcDbEntity
8
0
100
AcDbLine
10
34.56437535704545
20
0.778745874786317
30
0.0
11
66.65564369957746
21
19.50612180407816
31
0.0
0
LINE
5
1B3
330
1F
100
AcDbEntity
8
0
100
AcDbLine
10
55.26446832764479
20
47.29118282642324
30
0.0
11
53.42718194719286
21
19.75092411476788
31
0.0
0
LINE
5
1B4
330
1F
100
AcDbEntity
ENDSEC
0
More stuff below.

Something like this might be what you're looking for, but as jaypal said, without seeing the input it's somewhat of a guess.
sed -n '
/ENTITIES/,/ENDSEC/p
/ 8/{n;h}
/ 10/{n;H}
/ 20/{n;H}
/ 11/{n;H}
/ 21/{n;H;g;s/\n/,/g;p}
' Test.txt > out.csv
With comments:
sed -n '
/ENTITIES/,/ENDSEC/p
/ 8/{n;h} # store next line in hold space
/ 10/{n;H} # append next line to hold space (after newline)
/ 20/{n;H} # ditto
/ 11/{n;H} # ditto
/ 21/{n;H; # ditto
g; # put hold space into pattern space
s/\n/,/g; # substitute commas for newlines
p # print it
}
' Test.txt > out.csv

Just pipe your sed to
sed 'your long sed commnand' | paste -d, - - - - -
the result will be
0,4.93,9.04,27.9,23.4
0,34.56,0.77,66.65,19.50
0,55.26,47.29,53.42,19.75
0,-18.22,44.35,19.74,53.28

Got it Thanks to ooga! Before I lacked the understanding of hold space vs. pattern space, now it has all become clear!
sed -n '
/ENTITIES/,/ENDSEC/{
/ 8/{n;h;};
/ 10/{n;H;};
/ 20/{n;H;};
/ 11/{n;H;};
/ 21/{n;H;g;s/\n/,/g;p};
}
' < Test.dxf > out.csv

Search and replace regex in VI, clarification needed

Given the following, i'd like to comment lines starting with 1 or 2 or 3
Some text
1 101 12
1 102 13
2 200 2
// Some comments inside
2 202 4
2 201 7
3 300 0
3 301 7
Some other text
The following regex (seems to) look(s) right, and yet it does not work ..
%s/^([123])(.+)/#\1\2/g
The same regex matches when used by egrep
egrep '^([123])(.+)' file_name
Please help me understand why this search and replace is failing in VI

You need to escape the characters: ()+. So you could do %s/^\([123]\)\(.\+\)/#\1\2/g, but it seems easier to do: :g/^[123]/s/^/#
Note that vi does have various options for changing the meaning of symbols in patterns (help magic). In particular, you could use 'very magic' and do: :%s/\v^([123].+)/#\1/g (note that the g flag is completely redundant here!)

In Perl,
my $t = "Some text
1 101 12
1 102 13
2 200 2
2 202 4
2 201 7
3 300 0
3 301 7
Some other text";
foreach (split /^/, $t) {
$_ =~ s/^([1-3])/# $1/;
print $_;
}
Result:
Some text
# 1 101 12
# 1 102 13
# 2 200 2
# 2 202 4
# 2 201 7
# 3 300 0
# 3 301 7
Some other text

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting text file information via command line/script - regex

Related

Split file by vector of line numbers

Using AWK compare two files having single columns in each and get count againts each matched item

Regex for soccer data

SED- combine matching regex lines to make a csv file

Search and replace regex in VI, clarification needed

Categories

Resources