How to print a greedy range of lines using awk - regex

I've encountered the following problem and haven't found a solution nor why awk behaves in this strange way.
So let's say I have the following text in a file:
startcue
This shouldn't be found.
startcue
This is the text I want to find.
endcue
startcue
This shouldn't be found either.
And I want to find the lines "startcue", "This is the text I want to find.", and "endcue".
I naively assumed that a simple range search by awk '/startcue/,/endcue/' would do it, but this prints out the whole file. I guess awk somehow finds the first range, but as the third startcue triggers on the printing of lines, it prints all the lines until the end of the file (still, this all seems a bit strange to me).
Now to the question: How can I get awk to print out just the lines I wan't? And maybe as an extra question: Can anybody explain awk's behaviour?
Thanks

$ awk '/startcue/{f=1; buf=""} f{buf = buf $0 RS} /endcue/{printf "%s",buf; f=0}' file
startcue
This is the text I want to find.
endcue

Here is a simple way to do it.
Since data is separated by blank lines, I set RS to nothing.
This makes awk to work with data in blocks.
Then find all blocks starting with startcue and ending with endcue
awk -v RS="" '/^startcue/ && /endcue$/' file
startcue
This is the text I want to find.
endcue
If startcue and endcue are always start line and end line and does only appears once int the block, this should do: (PS testing does show that it does not matter if there are more or less hits in the block. This always prints the block if both startclue and endcue are found)
awk -v RS="" '/startcue/ && /endcue/' file
startcue
This is the text I want to find.
endcue
And this should work too:
awk -v RS="" '/startcue.*endcue/' file
startcue
This is the text I want to find.
endcue

To summarize the problem, you want print lines from startcue to endcue but not if the endcue is missing. Ed Morton's approach is good. Here is yet another approach:
$ tac file | awk '/endcue/,/startcue/' | tac
startcue
This is the text I want to find.
endcue
How it works
tac file
This prints the lines in reverse order. tac is just like cat except that the lines come out in reverse order.
awk '/endcue/,/startcue/'
This prints all lines starting from endcue and finishing with startcue. When done this way, passages with missing endcues are not printed.
tac
This reverses the lines once again so that are back in the correct order.
How awk ranges work
Consider:
awk '/startcue/,/endcue/' file
This tells awk to start printing when if finds startcue and continue printing until if finds endcue. This is exactly what it does on your file.
There is no implied rule that the range /startcue/,/endcue/ cannot itself contain multiple instances of startcue. awk simply starts printing when it sees the first occurrence of startcue and continues until if finds endcue.

no buffering needed :
{m,n,g}awk 'BEGIN { _ +=_ ^= ORS = FS = RS = "\nendcue\n"
sub("end", "?start", RS)
__= substr(RS, _+--_) } (NF=_<NF) && $!_=__$_'
startcue
This is the text I want to find.
endcue

Related

Is there a way to use regex with awk to execute a command only when the pattern is matched?

I'm trying to write a bash script that gets user input, checks a .txt for the line that contains that input then plugs that into a wget statement to commence a download.
In testing the functionality awk seems to print out every line, not just pattern matched lines.
chosen=DSC01985
awk -v c="$chosen" 'BEGIN {FS="/"; /c/}
{print $8, "found", c}
END{print " done"}' ./imgLink.txt
The above should take from imgLink.txt, search for the pattern and return that the pattern is found. Instead it prints the the 8th field of every line in the file.
I have tried moving /c/ out of the begin statement but to no avail.
what's going on here?
Example input:
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01533.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01536.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01543.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01558.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01565.jpg
etc.
Example output:
...
DSC02028.jpg found DSC01985
DSC02030.jpg found DSC01985
DSC02032.jpg found DSC01985
DSC02038.jpg found DSC01985
DSC02042.jpg found DSC01985
etc.
You were close in your attempt, you can't search an awk variable like /var/ you need different method for this. Could you please try following.Considering that your string which you want to look will come in URL value(s) which you have currently xxxed in your post.
awk -v c="$chosen" -F'/' '$0 ~ c{print $NF " found " c}' Input_file
Not sure why you have written done in your END block, you could add it here if you need it. Also $NF means last field of current line you could print it as per your need too.

Adding commas when necessary to a csv file using regex

I have a csv file like the following:
entity_name,data_field_name,type
Unit,id
Track,id,LONG
The second row is missing a comma. I wonder if there might be some regex or awk like tool in order to append commas to the end of line in case there are missing commas in these rows?
Update
I know the requirements are a little vague. There might be several alternative ways to narrow down the requirements such as:
The header row should define the number of columns (and commas) that is valid for the whole file. The script should read the header row first and find out the correct number of columns.
The number of columns might be passed as an argument to the script.
The number of columns can be hardcoded into the script.
I didn't narrow down the requirements at first because I was ok with any of them. Of course, the first alternative is the best but I wasn't sure if this was easy to implement or not.
Thanks for all the great answers and comments. Next time, I will state acceptable alternative requirements explicitly.
You can use this awk command to fill up all rows starting from 2nd row with the empty cell values based on # of columns in the header row, in order to avoid hard-coding # of columns:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} NF{$nc=$nc} 1' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
Earlier solution:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} {printf "%s", $0;
for (i=NF+1; i<=nc; i++) printf "%s", OFS; print ""}' file
I would use sed,
sed 's/^[^,]*,[^,]*$/&,/' file
Example:
$ echo 'Unit,id' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,
$ echo 'Unit,id,bar' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,bar
Try this:
$ awk -F , 'NF==2{$2=$2","}1' file
Output:
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
With another awk:
awk -F, 'NF==2{$3=""}1' OFS=, yourfile.csv
to present balance to all the awk solutions, following could be a vim only solution
:v/,.*,/norm A,
rationale
/,.*,/ searches for 2 comma's in a line
:v apply a global command on each line NOT matching the search
norm A, enters normal mode and appends a , to the end of the line
This MIGHT be all you need, depending on the info you haven't shared with us in your question:
$ awk -F, '{print $0 (NF<3?FS:"")}' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG

Find text enclosed by patterns using sed

I have a config file like this:
[whatever]
Do I need this? no!
[directive]
This lines I want
Very much text here
So interesting
[otherdirective]
I dont care about this one anymore
Now I want to match the lines in between [directive] and [otherdirective] without matching [directive] or [otherdirective].
Also if [otherdirective] is not found all lines till the end of file should be returned. The [...] might contain any number or letter.
Attempt
I tried this using sed like this:
sed -r '/\[directive\]/,/\[[[:alnum:]+\]/!d
The only problem with this attempt is that the first line is [directive]and the last line is [otherdirective].
I know how to pipe this again to truncate the first and last line but is there a sed solution to this?
You can use the range, as you were trying, and inside it use // negated. When it's empty it reuses last regular expression matched, so it will skip both edge lines:
sed -n '/\[directive\]/,/\[otherdirective\]/ { //! p }' infile
It yields:
This lines I want
Very much text here
So interesting
Here is a nice way with awk to get section of data.
awk -v RS= '/\[directive\]/' file
[directive]
This lines I want
Very much text here
So interesting
When setting RS to nothing RS= it divides the file up in records based on blank line.
So when searching for [directive] it will print that record.
Normally a record is one line, but due to the RS (record selector) is change, it gives the block.
Okay damn after more tries I found the solution or merely one solution:
sed -rn '/\[buildout\]/,/\[[[:alnum:]]+\]/{
/\[[[:alnum:]]+\]/d
p }'
is this what you want?
\[directive\](.*?)\[
Look here

sed: remove strings between two patterns leaving the 2nd pattern intact (half inclusive)

I am trying to filter out text between two patterns, I've seen a dozen examples but didn't manage to get exactly what I want:
Sample input:
START LEAVEMEBE text
data
START DELETEME text
data
more data
even more
START LEAVEMEBE text
data
more data
START DELETEME text
data
more
SOMETHING that doesn't start with START
# sometimes it starts with characters that needs to be escaped...
I want to stay with:
START LEAVEMEBE text
data
START LEAVEMEBE text
data
more data
SOMETHING that doesn't start with START
# sometimes it starts with characters that needs to be escaped...
I tried running sed with:
sed 's/^START DELETEME/,/^[^ ]/d'
And got an inclusive removal, I tried adding "exclusions" (not sure if I really understand this syntax well):
sed 's/^START DELETEME/,/^[^ ]/{/^[^ ]/!d}'
But my "START DELETEME" line is still there (yes, I can grep it out, but that's ugly :) and besides - it DOES remove the empty line in this sample as well and I'd like to leave empty lines if they are my end pattern intact )
I am wondering if there is a way to do it with a single sed command.
I have an awk script that does this well:
BEGIN { flag = 0 }
{
if ($0 ~ "^START DELETEME")
flag=1
else if ($0 !~ "^ ")
flag=0
if (flag != 1)
print $0
}
But as you know "A is for awk which runs like a snail". It takes forever.
Thanks in advance.
Dave.
Using a loop in sed:
sed -n '/^START DELETEME/{:l n; /^[ ]/bl};p' input
GNU sed
sed '/LEAVEMEBE/,/DELETEME/!d;{/DELETEME/d}' file
I would stick with awk:
awk '
/LEAVE|SOMETHING/{flag=1}
/DELETE/{flag=0}
flag' file
But if you still prefer sed, here's another way:
sed -n '
/LEAVE/,/DELETE/{
/DELETE/b
p
}
' file

How can I remove all characters in each line after the first space in a text file?

I have a large log file from which I need to extract file names.
The file looks like this:
/path/to/loremIpsumDolor.sit /more/text/here/notAlways/theSame/here
/path/to/anotherFile.ext /more/text/here/differentText/here
.... about 10 million times
I need to extract the file names like this:
loremIpsumDolor.sit
anotherFile.ext
I figure my first strategy is to find/replace all /path/to/ with ''. But I'm stuck how to remove all characters after the space.
Can you help?
sed 's/ .*//' file
It doesn't take any more. The transformed output appears on standard output, of course.
In theory, you could also use awk to grab the filename from each line as:
awk '{ print $1 }' input_file.log
That, of course, assumes that there are no spaces in any of the filenames. awk defaults to looking for whitespace as the field delimiters, so the above snippet would take the first "field" from your log file (your filename) for each line, and output it.
Pass it to cut:
cut '-d ' -f1 yourfile
a bash-only solution:
while read path otherstuff; do
echo ${path##*/}
done < filename