List lines beetween 2 keywords using grep/sed/awk

List lines beetween 2 keywords using grep/sed/awk - regex

I have a sas log file and I want to list only those lines that are between two words: data and run.
File can contain many such words in many lines, for example:
MPRINT: data xxxxx;
yyyyy
xxxxxx
MPRINT: run;
fffff
yyyyy
data fff;
fffff
run;
I would like to have lines 1-4 and 8-10.
I tried something like
egrep -iz file -e '\sdata\s+\S*\s+(.|\s)*\srun\s' but this expression lists all lines between first begin and last end ((.|\s) is for the purpose of new line character).
I may also want to add additional words to pattern between data and run like:
MPRINT: data xxx;
fffff
NOTE: ffdd
set fff;
xxxxxx
MPRINT: run;
data fff;
yyyyyy
run;
In some cases I would like to list only lines between data and run where there is set word in some line.
I know there are many similar threads, but I didn't find any when keywords can repeat multiple times.
I'm not familiar awk or sed but if it can help I can also use it.
[Edit]
Note that data and run are not necessarily on the beginning of the line (I updated the example). Also there can't be any other data between data and run.
[Edit2]
As Tom noted every line that I was looking for started with MPRINT(...):, so filtered those lines.
Anubhava answer helped me the most with my final solution so I mark it as an answer.
Final expression looked like this :
grep -o path -e 'MPRINT.*' | cut -f '2-' -d ' '|
grep -iozP '(?ms) data [^\(;\s]+.*?(set|infile).*?run[^\n]*\n

You may use this gnu grep command witn -P (PCRE) option:
grep -ozP '(?ms).*?data .*?run[^\n]*\n' file
If you only want to print block with line starting from set then use:
grep -ozP '(?ms).*?data .*?^set.*?run[^\n]*\n' file
MPRINT: data xxxxx;
yyyyy
set fff;
xxxxxx
MLOGIC: run;
You may use this awk to print between 2 keywords that must contain a line starting with set:
awk '/data / {
p=1
}
p && !y {
if (/^set/)
y=1
else
buf = buf $0 ORS
}
y {
if (buf != "")
printf "%s", buf
buf=""
print
}
/run/ {
p=y=0
}' file
MPRINT: data xxxxx;
yyyyy
set fff;
xxxxxx
MLOGIC: run;
If you just want to print data between 2 keywords in awk, it is so simple:
awk '/data /,/run/' file

For what i understand the following will do the trick
sed -n '/data.*;/,/run;/p' $FILENAME
Note that the '.*' after data can be improved by something like [a-z|A-Z]{5} that you protect against matching the word data somewhere in the middle
From there matching from data to set would already require some external decision processes, so the command would be
sed -n '/data.*;/,/set.*;/p' $FILENAME
(Probably learned along the way from How to use sed/grep to extract text between two words?)

Just try (?s)data.+?run;
Explanation:
(?s) - single line mode, . matches newline character
data - match data literally
.+? - match one or more of any character (including neline), non-greedy due to ?
run; - match run; literally
Demo

Related

How to output multiple regex matches through comma on the same line

I want to use grep/awk/sed to extract matched strings for each line of a log file. Then place it into csv file.
Highlighted strings (1432,53,http://www.espn.com/)
If the input is:
2018-10-31
18:48:01.717,INFO,15592.15627,PfbProxy::handlePfbFetchDone(0x1d69850,
pfbId=561, pid=15912, state=4, fd=78, timer=61), FETCH DONE: len=45,
PFBId=561, pid=0, loadTime=1434 ms, objects=53, fetchReqEpoch=0.0,
fetchDoneEpoch:0.0, fetchId=26, URL=http://www.espn.com/
2018-10-31
18:48:01.806,DEBUG,15592.15621,FETCH DONE: len=45, PFBId=82, pid=0,
loadTime=1301 ms, objects=54, fetchReqEpoch=0.0, fetchDoneEpoch:0.0,
fetchId=28, URL=http://www.diply.com/
Expected output for the above log lines:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
This is an example, and the actual Log File will have much more data.
--My-Solution-So-far-
For now I used grep to get all lines containing keyword 'FETCH DONE' (these lines contain strings I am looking for).
I did come up with regular expression that matches the data I need, but when I grep it and put it in the file it prints each string on the new line which is not quite what I am looking for.
The grep and regular expression I use (online regex tool: https://regexr.com/42cah):
echo -en 'url,loadtime,object\n'>test1.csv #add header
grep -Po '(?<=loadTime=).{1,5}(?= )|((?<=URL=).*|\/(?=.))|((?<=objects=).{1,5}(?=\,))'>>test1.csv #get matching strings
Actual output:
URL,LoadTime,Objects
http://www.espn.com
1434
53
http://www.diply.com
1301
54
Expected output:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
I was trying using awk to match multiple regex and print comma in between. I couldn't get it to work at all for some reason, even though my regex matches correct strings.
Another idea I have is to use sed to replace some '\n' for ',':
for(i=1;i<=n;i++)
if(i % 3 != 0){
sed REPLACE "\n" with "," on i-th line
}
Im pretty sure there is a more efficient way of doing it

Using sed:
sed -n 's/.*loadTime=\([0-9]*\)[^,]*, objects=\([0-9]*\).* URL=\(.*\)/\3,\1,\2/p' input | \
sed 1i'URL,LoadTime,Objects'

Insert text into line if that line doesn't contain another string using sed

I am merging a number of text files on a linux server but the lines in some differ slightly and I need to unify them.
For example some files will have line like
id='1244' group='american' name='fred',american
Other files will be like
id='2345' name='frank', english
finally others will be like
id='7897' group='' name='maria',scottish
what I need to do is, if group='' or group is not in the string at all I need to add it somewhere before the comma setting it to the text after the comma so in the 2nd example above the line would become:
id='2345' name='frank' group='english',english
and the same in the last example which would become
id='7897' name='maria' group='scottish',scottish
This is going into a bash script. I can't actually delete the line and add to the end of the file as it relates to the following line.
I've used the following:
sed -i.bak 's#group=""##' file
which deletes the group="" string so the lines will either contain group='something' or wont contain it at all and that works
Then I tried to add the group if it doesn't exist using the following:
sed -i.bak '/group/! s#,(.*$)#group="\1",\1#' file
but that throws up the error
sed: -e expression #1, char 38: invalid reference \1 on `s' command's RHS
EDIT by Ed Morton to create a single sample input file and expected output:
Sample Input:
id='1244' group='american' name='fred',american
foo
id='2345' name='frank', english
bar
id='7897' group='' name='maria',scottish
Expected Output:
id='1244' group='american' name='fred',american
foo
id='2345' name='frank' group='english',english
bar
id='7897' name='maria' group='scottish',scottish

sed -r "
/group=''/ s/// # group is empty, remove it
/group=/! s/,[[:blank:]]*(.+)/ group='\\1',\\1/ # group is missing, add it
" file
id='1244' group='american' name='fred',american
foo
id='2345' name='frank' group='english',english
bar
id='7897' name='maria' group='scottish',scottish
The foo and bar lines are untouched because the s/// command did not match a comma followed by characters.

something like
sed '
/^[^,]*group[^,]*,/ ! {
s/, *\(.*\)/ group='\''\1'\'', \1/
}
/^[^,]*group='\'\''/ {
s/group='\'\''\([^,]*\), *\(.*\)/group='\''\2'\''\1, \2/
}
'

This GNU awk may help:
awk -v sq="'" '
BEGIN{RS="[ ,\n]+"; FS="="; found=0}
$1=="group"{
if($2==sq sq)
{next}
else
{found=1}
}
NF>1{
printf "%s=%s ",$1,$2
}
NF==1{
if(!found)
{printf "group=%s",$1}
print ","$1
found=0
}
' file
The script relies on the record separator RS which is set to get all key='value' pairs.
If the key group isn't found or is empty, it is printed when reaching a record with only one field.
Note that the variable sq holds the single quote character and is used to detect empty group field.

Sed can be pretty ugly. And your data format appears to be somewhat inconsistent. This MIGHT work for you:
$ sed -e "/group='[a-z]/b e" -e "s/group='' *//" -e "s/,\([a-z]*\)$/ group='\1', /" -e ':e' input.txt
Broken out for easier reading, here's what we're doing:
/group='[a-z]/b e - If the line contains a valid group, branch to the end.
s/group='' *// - Remove any empty group,
s/,\([a-z]*\)$/ group='\1', / - add a new group based on your specs
:e - branch label for the first command.
And then the default action is to print the line.
I really don't like manipulating data this way. It's prone to error, and you'll be further ahead reading this data into something that accurately stores its data structure, then prints the data according to a new structure. A more robust solution would likely be tied directly to whatever is producing or consuming this data, and would not sit in the middle like this.

Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

I have a bunch of daily printer logs in CSV format and I'm writing a script to keep track of how much paper is being used and save the info to a database, but I've come across a small problem
Essentially, some of the document names in the logs include commas in them (which are all enclosed within double quotes), and since it's in comma separated format, my code is messing up and pushing everything one column to the right for certain records.
From what I've been reading, it seems like the best way to go about fixing this would be using awk or sed, but I'm unsure which is the best option for my situation, and how exactly I'm supposed to implement it.
Here's a sample of my input data:
2015-03-23 08:50:22,Jogn.Doe,1,1,Ineo 4000p,"MicrosoftWordDocument1",COMSYRWS14,A4,PCL6,,,NOT DUPLEX,GRAYSCALE,35kb,
And here's what I have so far:
#!/bin/bash
#Get today's file name
yearprefix="20"
currentdate=$(date +"%m-%d-%y");
year=${currentdate:6};
year="$yearprefix$year"
month=${currentdate:0:2};
day=${currentdate:3:2};
filename="papercut-print-log-$year-$month-$day.csv"
echo "The filename is: $filename"
# Remove commas in between quotes.
#Loop through CSV file
OLDIFS=$IFS
IFS=,
[ ! -f $filename ] && { echo "$Input file not found"; exit 99; }
while read time user pages copies printer document client size pcl blank1 blank2 duplex greyscale filesize blank3
do
#Remove headers
if [ "$user" != "" ] && [ "$user" != "User" ]
then
#Remove any file name with an apostrophe
if [[ "$document" =~ "'" ]];
then
document="REDACTED"; # Lazy. Need to figure out a proper solution later.
fi
echo "$time"
#Save results to database
mysql -u username -p -h localhost -e "USE printerusage; INSERT INTO printerlogs (time, username, pages, copies, printer, document, client, size, pcl, duplex, greyscale, filesize) VALUES ('$time', '$user', '$pages', '$copies', '$printer', '$document', '$client', '$size', '$pcl', '$duplex', '$greyscale', '$filesize');"
fi
done < $filename
IFS=$OLDIFS
Which option is more suitable for this task? Will I have to create a second temporary file to get this done?
Thanks in advance!

As I wrote in another answer:
Rather than interfere with what is evidently source data, i.e. the stuff inside the quotes, you might consider replacing the field-separator commas (with say |) instead:
s/,([^,"]*|"[^"]*")(?=(,|$))/|$1/g
And then splitting on | (assuming none of your data has | in it).
Is it possible to write a regular expression that matches a particular pattern and then does a replace with a part of the pattern

There is probably an easier way using sed alone, but this should work. Loop on the file, for each line match the parentheses with grep -o then replace the commas in the line with spaces (or whatever it is you would like to use to get rid of the commas - if you want to preserve the data you can use a non printable and explode it back to commas afterward).
i=1 && IFS=$(echo -en "\n\b") && for a in $(< test.txt); do
var="${a}"
for b in $(sed -n ${i}p test.txt | grep -o '"[^"]*"'); do
repl="$(sed "s/,/ /g" <<< "${b}")"
var="$(sed "s#${b}#${repl}#" <<< "${var}")"
done
let i+=1
echo "${var}"
done

Replace delimiter in csv that is not between square brackets

I have a lot of csv files that I am having trouble reading since the delimiter is ',' and one of the fields is a list with comma separated values in square brackets. As an example:
first,last,list
John,Doe,['foo','234','&3bar']
Johnny,Does,['foofo','abc234','d%9lk','other']
I would like to change the delimiter to '|' (or whatever else) to get:
first|last|list
John|Doe|['foo','234','&3bar']
Johnny|Does|['foofo','abc234','d%9lk','other']
How can I do this? I'm trying to use sed right now, but anything that works is fine.

I don't know it could be possible through sed or awk but you could do this easily through perl.
$ perl -pe 's/\[.*?\](*SKIP)(*F)|,/|/g' file
first|last|list
John|Doe|['foo','234','&3bar']
Johnny|Does|['foofo','abc234','d%9lk','other']
Run the below command to save the changes made to that file.
perl -i -pe 's/\[.*?\](*SKIP)(*F)|,/|/g' file

If it's always 2 values before the list, you could make use of the limit argument to split in perl:
perl -pe '$_ = join "|", split /,/, $_, 3' list
This splits on commas up to a maximum number of 3 fields, then joins them back together with a pipe. The -p switch means that each line of input is stored as $_ and processed before, then $_ is printed.
Output:
first|last|list
John|Doe|['foo','234','&3bar']
Johnny|Does|['foofo','abc234','d%9lk','other']

sed: remove strings between two patterns leaving the 2nd pattern intact (half inclusive)

I am trying to filter out text between two patterns, I've seen a dozen examples but didn't manage to get exactly what I want:
Sample input:
START LEAVEMEBE text
data
START DELETEME text
data
more data
even more
START LEAVEMEBE text
data
more data
START DELETEME text
data
more
SOMETHING that doesn't start with START
# sometimes it starts with characters that needs to be escaped...
I want to stay with:
START LEAVEMEBE text
data
START LEAVEMEBE text
data
more data
SOMETHING that doesn't start with START
# sometimes it starts with characters that needs to be escaped...
I tried running sed with:
sed 's/^START DELETEME/,/^[^ ]/d'
And got an inclusive removal, I tried adding "exclusions" (not sure if I really understand this syntax well):
sed 's/^START DELETEME/,/^[^ ]/{/^[^ ]/!d}'
But my "START DELETEME" line is still there (yes, I can grep it out, but that's ugly :) and besides - it DOES remove the empty line in this sample as well and I'd like to leave empty lines if they are my end pattern intact )
I am wondering if there is a way to do it with a single sed command.
I have an awk script that does this well:
BEGIN { flag = 0 }
{
if ($0 ~ "^START DELETEME")
flag=1
else if ($0 !~ "^ ")
flag=0
if (flag != 1)
print $0
}
But as you know "A is for awk which runs like a snail". It takes forever.
Thanks in advance.
Dave.

Using a loop in sed:
sed -n '/^START DELETEME/{:l n; /^[ ]/bl};p' input

GNU sed
sed '/LEAVEMEBE/,/DELETEME/!d;{/DELETEME/d}' file

I would stick with awk:
awk '
/LEAVE|SOMETHING/{flag=1}
/DELETE/{flag=0}
flag' file
But if you still prefer sed, here's another way:
sed -n '
/LEAVE/,/DELETE/{
/DELETE/b
p
}
' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

List lines beetween 2 keywords using grep/sed/awk - regex

Just try (?s)data.+?run; Explanation: (?s) - single line mode, . matches newline character data - match data literally .+? - match one or more of any character (including neline), non-greedy due to ? run; - match run; literally Demo

Related

How to output multiple regex matches through comma on the same line

Insert text into line if that line doesn't contain another string using sed

Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

Replace delimiter in csv that is not between square brackets

sed: remove strings between two patterns leaving the 2nd pattern intact (half inclusive)

Categories

Resources