Sed regex string substitution from terminal

Sed regex string substitution from terminal - regex

I have a log file with a standard format, e.g.:
31 Mar - Lorem Ipsom1
31 Mar - Lorem Ipsom2
31 Mar - Lorem Ipsom3
The replacement I want to implement is 31*31 to 31 so I'll end up with a log that has only its last line, in this example it will look like:
31 Mar - Lorem Ipsom3
I wish to perform it on a customized linux machine that has no perl.
I tried to use sed like this:
sed -i -- 's/31*31/31/g' /var/log/prog/logFile
But it did nothing..
Any alternatives involving ninja bash commands are also welcomed.

A way to keep only the last of consecutive lines that match a pattern is
sed -n '/^31/ { :a $!{ h; n; //ba; x; G } }; p' filename
This works as follows:
/^31/ { # if a line begins with 31
:a # jump label for looping
$!{ # if the end of input has not been reached (otherwise the current
# line is the last line of the block by virtue of being the last
# line)
h # hold the current line
n # fetch the next line. (note that this doesn't print the line
# because of -n)
//ba # if that line also begins with 31, go to :a. // attempts the
# most recently attempted regex again, which was ^31
x # swap hold buffer, pattern space
G # append hold buffer to pattern space. The PS now contains
# the last line of the block followed by the first line that
# comes after it
}
}
p # in the end, print the result
This avoids some problems of mult-line regular expressions such as matches that begin or end in the middle of a line. It will also not discard lines between two blocks of matching lines and keep the last line of each block.

* is not a wildcard as it is in the shell, it is a quantifier. You need to quantify over . (any character). The regex is thus:
sed ':a;N;$!ba;s/31.*31/31/g'
(I removed the -i flag so you can first test your file safely).
The :a;N;$!ba; part makes it possible to process over new lines.
Note however:
The regex will match any 31 so:
31 Mar - Lorem Ipsom1
31 Mar - Lorem 31 Ipsom2
Will result in
31 Ipsom2
It will match greedy, if the log reads:
31 Mar - Lorem Ipsom1
30 Mar - Lorem Ipsom2
31 Mar - Lorem Ipsom3
It remove the second line.
You can solve the first problem by writing:
sed ':a;N;$!ba;s/(^|\n)31.*\n31/31/g'
Which forces the regex that second 31 is located at the beginning of the line.

I think you might be looking for "tail" to get the last line of the file
e.g.
tail -1 /path/file
or if you want the last entry from each day then "sort" might be your solution
sort -ur -k 1,2 /path/file | sort
the -u flag specifies only a single match for the keyfields will be returned
the -k 1,2 specifies that the keyfields are the first two fields - in this case they are the month and the date - fields by default are separated by white space.
the -r flag reverses the lines such that the last match for each date will be returned. Sort a second time to restore the original order.
If your log file has more than a single month of data, and you wish to preserve order (e.g. if you have Mar 31 and Apr 1 in the same file) you can try:
cat -n tmp2 | sort -nr | sort -u -k 2,3 | sort -n | cut -f 2-
cat -n adds the line number to the log file before sorting.
sort as before but use fields 2 and 3, because field 1 is now the original line number
sort by the original line number to restore the original order.
use cut to remove the line numbers and restore the original line content.
e.g.
$ cat tmp2
30 Mar - Lorem Ipsom2
30 Mar - Lorem Ipsom1
31 Mar - Lorem Ipsom1
31 Mar - Lorem Ipsom2
31 Mar - Lorem Ipsom3
1 Apr - Lorem Ipsom1
1 Apr - Lorem Ipsom2
$ cat -n tmp2 | sort -r | sort -u -k 2,3 | sort | cut -f 2-
30 Mar - Lorem Ipsom1
31 Mar - Lorem Ipsom3
1 Apr - Lorem Ipsom2

Related

How do i fix regex matching few unexpected characters?

i am using a regex where as a first preference i am intending to match the character ( number or alphanumeric ) immediately succeeding the string "Lecture" else match the last character of line in absence of string "Lecture".
Curent regex
cat 1.txt | perl -ne 'print "$& \n" while /Lecture\h*\K\w+|^(?!.*Lecture).*\h\K[^.\s]+/g;/^.*?-(.*)/g' | perl -ne 'print "$& \n" while /(\d+\w*)/g'
The data to read is not very consistent. There could be spaces or hyphen around the string "Lecture" or end character and line may not end as .mp4
My current regex is working almost well , it just having the issues for the bottom 3 lines . I could have only included those lines here but i don't want the solution regex to break for the other cases. So including all possibilities below
cat 1.txt
54282068 Lecture74- AS 29 Question.mp4
174424104Lecture 74B - AS 29 Theory.mp4
Branch Accounts Lecture 105
Lecture05 - Practicals AS 28
Submissions 20.mp4
HW Section 77N
Residential status HWS Q.1 to 6 -60A
Residential status HWS Q.7 to 20 -60B
House property all HWS-60C
Salary HWS Q.11 to 13 - 60F
Salary HWS Q.1 to 5-60D
Salary HWS Q.6 to 10-60E
Salary HWS Q.14 to 20-60G
Operating Costing 351
Expected Output
74
74B
105
05
20
77N
60A
60B
60C
60F
60D
60E
60G
351
Exact Issue - For the bottom 3 lines above the last one it is printing 5,10 and 20 additionally along with the end character 60D, 60E and 60G
I believe there's a issue in the last part of my regex somewhere, needs a very small edit to fix . Hopefully someone can help me.

Please inspect following piece of code for compliance with your requirements
use strict;
use warnings;
use feature 'say';
while( <DATA> ) {
chomp;
s/\.mp4//;
say $1 if /Lecture\s*(\w+)/ or /(\d{2}[A-Z]?)\Z/;
}
__DATA__
54282068 Lecture74- AS 29 Question.mp4
174424104Lecture 74B - AS 29 Theory.mp4
Branch Accounts Lecture 105
Lecture05 - Practicals AS 28
Submissions 20.mp4
HW Section 77N
Residential status HWS Q.1 to 6 -60A
Residential status HWS Q.7 to 20 -60B
House property all HWS-60C
Salary HWS Q.11 to 13 - 60F
Salary HWS Q.1 to 5-60D
Salary HWS Q.6 to 10-60E
Salary HWS Q.14 to 20-60G
Output
74
74B
105
05
20
77N
60A
60B
60C
60F
60D
60E
60G

Split file by vector of line numbers

I have a large file, about 10GB. I have a vector of line numbers which I would like to use to split the file. Ideally I would like to accomplish this using command-line utilities. As a regex:
File:
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
Vector of line numbers:
2 5
Desired output:
File 1:
1 2 3
File 2:
4 5 6
7 8 9
10 11 12
File 3:
13 14 15
16 17 18

Using awk:
$ awk -v v="2 5" ' # space-separated vector if indexes
BEGIN {
n=split(v,t) # reshape vector to a hash
for(i=1;i<=n;i++)
a[t[i]]
i=1 # filename index
}
{
if(NR in a) { # file record counter in the vector
close("file" i) # close previous file
i++ # increase filename index
}
print > ("file" i) # output to file
}' file
Sample output:
$ cat file2
4 5 6
7 8 9
10 11 12

Very slightly different from James's and kvantour's solutions: passing the vector to awk as a "file"
vec="2 5"
awk '
NR == FNR {nr[$1]; next}
FNR == 1 {filenum = 1; f = FILENAME "." filenum}
FNR in nr {
close(f)
f = FILENAME "." ++filenum
}
{print > f}
' <(printf "%s\n" $vec) file
$ ls -l file file.*
-rw-r--r-- 1 glenn glenn 48 Jul 17 10:02 file
-rw-r--r-- 1 glenn glenn 7 Jul 17 10:09 file.1
-rw-r--r-- 1 glenn glenn 23 Jul 17 10:09 file.2
-rw-r--r-- 1 glenn glenn 18 Jul 17 10:09 file.3

This might work for you:
csplit -z file 2 5
or if you want regexp:
csplit -z file /2/ /5/
With the default values, the output files will be named xxnn where nn starts at 00 and is incremented by 1.
N.B. The -z option prevents empty elided files.

Here is a little awk that does the trick for you:
awk -v v="2 5" 'BEGIN{v=" 1 "v" "}
index(v," "FNR" ") { close(f); f=FILENAME "." (++i) }
{ print > f }' file
This will create files of the form: file.1, file.2, file.3, ...

Ok, I've gone totally mental this morning, and I came up with a Sed program (with functions, loops, and all) to generate a Sed script to make what you want.
Usage:
put the script in a file (e.g. make.sed) and chmod +x it;
then use it as the script for this Sed command sed "$(./make.sed <<< '1 4')" inputfile¹
Note that ./make.sed <<< '1 4' generates the following sed script:
1,1{w file.1
be};1,4{w file.2
be};1,${w file.3
be};:e
¹ Unfortunately I misread the question, so my script works taking the line number of the last line of each block that you want to write to file, so your 2 5 has to be changed to 1 4 to be fed to my script.
#!/usr/bin/env -S sed -Ef
###########################################################
# Main
# make a template sed script, in which we only have to increase
# the number of each numbered output file, each of which is marked
# with a trailing \x0
b makeSkeletonAndMarkNumbers
:skeletonMade
# try putting a stencil on the rightmost digit of the first marked number on
# the line and loop, otherwise exit
b stencilLeastDigitOfNextMarkedNumber
:didStencilLeastDigitOfNextMarkedNumber?
t nextNumberStenciled
b exit
# continue processing next number by adding 1
:nextNumberStenciled
b numberAdd1
:numberAdded1
# try putting a stencil on the rightmost digit of the next marked number on
# the line and loop, otherwise we're done with the first marked number, we can
# clean its marker, and we can loop
b stencilNextNumber
:didStencilNextNumber?
t nextNumberStenciled
b removeStencilAndFirstMarker
:removeStencilAndFirstMarkerDone
b stencilLeastDigitOfNextMarkedNumber
###########################################################
# puts a \n on each side of the first digit marked on the right by \x0
:stencilLeastDigitOfNextMarkedNumber
tr
:r
s/([0-9])\x0;/\n\1\n\x0;/1
b didStencilLeastDigitOfNextMarkedNumber?
###########################################################
# makes desired sed script skeleton from space-separated numbers
:makeSkeletonAndMarkNumbers
s/$/ $/
s/([1-9]+|\$) +?/1,\1{w file.0\x0;be};/g
s/$/:e/
b skeletonMade
###########################################################
# moves the stencil to the next number followed by \x0
:stencilNextNumber
trr
:rr
s/\n(.)\n([^\x0]*\x0[^\x0]+)([0-9])\x0/\1\2\n\3\n\x0/
b didStencilNextNumber?
###########################################################
# +1 with carry to last digit on the line enclosed in between two \n characters
:numberAdd1
#i\
#\nprima della somma:
#l
:digitPlus1
h
s/.*\n([0-9])\n.*/\1/
y/0123456789/1234567890/
G
s/(.)\n(.*)\n.\n/\2\n\1\n/
trrr
:rrr
/[0-9]\n0\n/s/(.)\n0\n/\n\1\n0/
t digitPlus1
# the following line can be problematic for lines starting with number
/[^0-9]\n0\n/s/(.)\n0\n/\n\1\n10/
b numberAdded1
###########################################################
# remove stencil and first marker on line
:removeStencilAndFirstMarker
s/\n(.)\n/\1/
s/\x0//
b removeStencilAndFirstMarkerDone
###########################################################
:exit
# a bit of post processing the `w` command has to be followed
# by the filename, then by a newline, so we change the appropriate `;`s to `\n`.
s/(\{[^;]+);/\1\n/g

Multi-line regex should match multiple times in a file (one-line command if possible)

I'm trying to convert some (multi-line) git history info (extract file name changes) into a CSV file. Here's my regex and sample file. It's working perfectly on that site.
Regex:
commit (.+)\n(?:.*\n)+?similarity index (\d+)+%\n(rename|copy) from (.+)\n\3 to (.+)\n
Sample input:
commit 2701af4b3b66340644b01835a03bcc760e1606f8
Author: ostrovsky.alex <ostrovsky.alex#a51b5712-02d0-11de-9992-cbdf800730d7>
Date: Sat Oct 16 20:44:32 2010 +0000
* Moved old sources to Maven src/main/java
diff --git a/alexo-chess/src/ao/chess/v2/move/Pawns.java b/alexo-chess/src/main/java/ao/chess/v2/move/Pawns.java
similarity index 100%
rename from alexo-chess/src/ao/chess/v2/move/Pawns.java
rename to alexo-chess/src/main/java/ao/chess/v2/move/Pawns.java
commit ea53898dcc969286078700f42ca5be36789e7ea7
Author: ostrovsky.alex <ostrovsky.alex#a51b5712-02d0-11de-9992-cbdf800730d7>
Date: Sat Oct 17 03:30:43 2009 +0000
synch
diff --git a/src/chess/v2/move/Pawns.java b/alexo-chess/src/ao/chess/v2/move/Pawns.java
similarity index 100%
copy from src/chess/v2/move/Pawns.java
copy to alexo-chess/src/ao/chess/v2/move/Pawns.java
commit b869f395429a2c1345ce100953bfc6038d9835f5
Author: ostrovsky.alex <ostrovsky.alex#a51b5712-02d0-11de-9992-cbdf800730d7>
Date: Wed Oct 7 22:43:06 2009 +0000
MctsPlayer works
diff --git a/ao/chess/v2/move/Pawns.java b/src/chess/v2/move/Pawns.java
similarity index 100%
copy from ao/chess/v2/move/Pawns.java
copy to src/chess/v2/move/Pawns.java
commit 4c697c510f5154d20be7500be1cbdecbaf99495c
Author: ostrovsky.alex <ostrovsky.alex#a51b5712-02d0-11de-9992-cbdf800730d7>
Date: Wed Sep 23 15:06:17 2009 +0000
* synch
diff --git a/v2/move/Pawns.java b/ao/chess/v2/move/Pawns.java
similarity index 95%
rename from v2/move/Pawns.java
rename to ao/chess/v2/move/Pawns.java
index e0172a3..e3659c5 100644
--- a/v2/move/Pawns.java
+++ b/ao/chess/v2/move/Pawns.java
However, when I try to run the following perl command (in git bash on Windows 10), I only get a single matching line (as opposed to the 4 lines in the sample you can see on the site I linked to above).
I know it's probably something stupid, like it needs to be in a loop. But I'm confused about slurping -0777 and applying a pattern multiple times. I tried the -p option but it prints out the entire input, and I only want to see output from the print (i.e., the CSV lines). I also thought /g would make the pattern be applied multiple times to the input file, but since -0777 makes it all one line, I'm not sure anymore.
<Pawns.java.history.txt perl -0777 -ne 'if (/commit (.+)\n(?:.*\n)+?similarity index (\d+)+%\n(rename|copy) from (.+)\n\3 to (.+)\n/g) { print $1.",".$2.",".$3.",".$4.",".$5."\n" }'
The output is only one line, whereas it should be 4 lines in total with the sample file:
2701af4b3b66340644b01835a03bcc760e1606f8,100,rename,alexo-chess/src/ao/chess/v2/move/Pawns.java,alexo-chess/src/main/java/ao/chess/v2/move/Pawns.java
Expected output:
2701af4b3b66340644b01835a03bcc760e1606f8,100,rename,alexo-chess/src/ao/chess/v2/move/Pawns.java,alexo-chess/src/main/java/ao/chess/v2/move/Pawns.java
ea53898dcc969286078700f42ca5be36789e7ea7,100,copy,src/chess/v2/move/Pawns.java,alexo-chess/src/ao/chess/v2/move/Pawns.java
b869f395429a2c1345ce100953bfc6038d9835f5,100,copy,ao/chess/v2/move/Pawns.java,src/chess/v2/move/Pawns.java
4c697c510f5154d20be7500be1cbdecbaf99495c,95,rename,v2/move/Pawns.java,ao/chess/v2/move/Pawns.java

You just need to convert your if with while:
perl -0777 -ne 'while (/commit (.+)\n(?:.*\n)+?similarity index (\d+)+%\n(rename|copy) from (.+)\n\3 to (.+)\n/g) { print $1.",".$2.",".$3.",".$4.",".$5."\n" }' file
2701af4b3b66340644b01835a03bcc760e1606f8,100,rename,alexo-chess/src/ao/chess/v2/move/Pawns.java,alexo-chess/src/main/java/ao/chess/v2/move/Pawns.java
ea53898dcc969286078700f42ca5be36789e7ea7,100,copy,src/chess/v2/move/Pawns.java,alexo-chess/src/ao/chess/v2/move/Pawns.java
b869f395429a2c1345ce100953bfc6038d9835f5,100,copy,ao/chess/v2/move/Pawns.java,src/chess/v2/move/Pawns.java
4c697c510f5154d20be7500be1cbdecbaf99495c,95,rename,v2/move/Pawns.java,ao/chess/v2/move/Pawns.java

The //g operator returns the captured results in list context. Since there are 5 sets of capturing parentheses and 4 matches, the returned list has 20 elements. You need to iterate over that list. Your code only looks at the first match. Here's one technique:
perl -0777 -nE '
#matches = /commit (.+)\n(?:.*\n)+?similarity index (\d+)+%\n(rename|copy) from (.+)\n\3 to (.+)\n/g;
$" = ",";
while (#matches) {
#thismatch = splice #matches, 0, 5;
say "#thismatch";
}
' Pawns.java.history.txt

Extract string between fields from begin to end and end to begin of the line - space delimited - shell commands

I have one logfile that is space delimited file. The structure is this
Mon Oct 05 23:17:52 2015 0 10.0.0.1 3989728 /dir/file name.txt X X X X acct proto 0 *
I want to be able to extract the filenames, which sometimes for my lucky contains a space in its name. e.g "file name.txt"
I cannot simply cut this using the field position, because of that space that sometimes appears in the name of the files.
The way I was thinking of doing this was getting what is between the field 8 from left to right and field 8 from right to left.
But I cannot think of anything to help me with that.
Does anyone had to do it before and can shine a light.
Thanks

This is difficult to attempt without a larger data, but here is a rough solution that will discard the tenth field if it does not match a specified pattern. (This only works if there is a single whitespace ' ' in the file name):
#!/bin/sh
STORE1=$( echo "Mon Oct 05 23:17:52 2015 0 10.0.0.1 3989728 /dir/file name.txt X X X X acct proto 0 *" | awk '{print $9}' )
STORE2=$( echo "Mon Oct 05 23:17:52 2015 0 10.0.0.1 3989728 /dir/file name.txt X X X X acct proto 0 *" | awk '{print $10}' )
# if the tenth field matches the string "X" discard it
if [ "$STORE2" != "X" ]
then STORE1="$STORE1 $STORE2"
fi
printf "%s" "$STORE1"

Here's a quick test with python:
import re
txt = "Mon Oct 05 23:17:52 2015 0 10.0.0.1 3989728 /dir/file name.txt X X X X acct proto 0 *"
print re.search("\d+(\.\d+){3}\s+\d+\s+(.*)(\s+\S+){8}",txt).group(2)
Yes, I realize this is not shell, but the regular expression will pick up anything between the (ip address, integer) and before the last 8 fields as you were attempting. Just use the regex and apply it to your script.

echo "Mon Oct 05 23:17:52 2015 0 10.0.0.1 3989728 /dir/file name.txt X X X X acct proto 0 *"
sed -r 's#.*/([^.]+\.[A-Za-z]*).*#\1#' logfile.txt
The regex could be explained as follows:
.*/ Matches every character until the last slash.
([^.]+\.[A-Za-z]*) Matches everything from there and up to the first dot, followed by alphabetic characters. This is the filename. The text matched is captured by the group.
.* Matches the rest of the line.
The whole line is therefore substituted with \1, the text captured by the group 1 (the filename), and output to logfile.txt.
Some assumptions were made: the file must always have a slash from its path, the filename must have only one dot for the extension, and the extension consists of only alphabetic characters.

Thanks everyone for the inputs. I thought a bit more about it and used AWK to get that done.
Looping file content from the field I want to last field minus 8.
cat file | awk '{out=""; for(i=9;i<=NF-8;i++){out=out" "$i}; print out}'

split on certain column when it is a url and has spaces

I have thousands of lines of data similar to
abc:1.35 (Johndoe 10-Oct-14): /usr/data/2013a/resources/fspecs/abstractbpweight/abstractbpweight.xml - Wed Aug 27 17:57:37 2014 33 13590770 33056 1 422 6367 234
efg:1.1 (Jane 12-Oct-14): /usr/data/2013a/resources/source data/abstractbpweight/file.xml - Tue Aug 26 17:57:37 2014 33 13590770 33056 1 422 6367 234
To get just the first column and the fourth column (url) into another file, I was using
awk '{print $1 $4}' file > smallerfile
Now the fourth column url sometimes has spaces and the entire path did not get captured for some cases. Also I suspect it might have other characters too (e.g. -,_ etc) and hence I wasnt sure if I can split using "-". How can I get just the first column and the fourth column in its entirety.
Thanks

Assuming your normal lines (i.e. those without extra spaces in url) have always 17 fields:
awk '{printf "%s",$1;for(i=4;i<NF-12;i++)printf "%s%s",OFS,$i;if(NF)print ""}' input.txt
Output:
abc:1.35 /usr/data/2013a/resources/fspecs/abstractbpweight/abstractbpweight.xml
efg:1.1 /usr/data/2013a/resources/source data/abstractbpweight/file.xml
It prints first field, then field 4 and extra fields belonging to url which made total number of fields be greater than 17. This removes empty lines if you need them delete if(NF).

You can try this way:
awk -F[-:] '{ split($2,a," "); print $1 ":" a[1] $5 }' file
The idea is to use - and : as field separators to allow any number of spaces inside the parenthesis.
But indeed the path can contain hyphen too. So to prevent this you can use sed instead that will check the space and hyphen after the path:
sed -r 's/^(\S+)[^:]+:\s+(.+?)\s+-.*/\1 \t\2/' file

Use the pattern /\.xml/ to decide what to print
awk '$4~/\.xml/{print $1,$4} $5~/\.xml/{print $1,$4,$5}' file
will produce output
abc:1.35 /usr/data/2013a/resources/fspecs/abstractbpweight/abstractbpweight.xml
efg:1.1 /usr/data/2013a/resources/source data/abstractbpweight/file.xml
what it does?
$4~/\.xml/ checks if the pattern .xml is contained in 4th field, if yes print $1 and $4
$5~/\.xml/ checks if the pattern .xml is contained in 5th field, then print all the fields.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Sed regex string substitution from terminal - regex

Related

How do i fix regex matching few unexpected characters?

Split file by vector of line numbers

Multi-line regex should match multiple times in a file (one-line command if possible)

Extract string between fields from begin to end and end to begin of the line - space delimited - shell commands

split on certain column when it is a url and has spaces

Categories

Resources