Find text enclosed by patterns using sed - regex

I have a config file like this:
[whatever]
Do I need this? no!
[directive]
This lines I want
Very much text here
So interesting
[otherdirective]
I dont care about this one anymore
Now I want to match the lines in between [directive] and [otherdirective] without matching [directive] or [otherdirective].
Also if [otherdirective] is not found all lines till the end of file should be returned. The [...] might contain any number or letter.
Attempt
I tried this using sed like this:
sed -r '/\[directive\]/,/\[[[:alnum:]+\]/!d
The only problem with this attempt is that the first line is [directive]and the last line is [otherdirective].
I know how to pipe this again to truncate the first and last line but is there a sed solution to this?

You can use the range, as you were trying, and inside it use // negated. When it's empty it reuses last regular expression matched, so it will skip both edge lines:
sed -n '/\[directive\]/,/\[otherdirective\]/ { //! p }' infile
It yields:
This lines I want
Very much text here
So interesting

Here is a nice way with awk to get section of data.
awk -v RS= '/\[directive\]/' file
[directive]
This lines I want
Very much text here
So interesting
When setting RS to nothing RS= it divides the file up in records based on blank line.
So when searching for [directive] it will print that record.
Normally a record is one line, but due to the RS (record selector) is change, it gives the block.

Okay damn after more tries I found the solution or merely one solution:
sed -rn '/\[buildout\]/,/\[[[:alnum:]]+\]/{
/\[[[:alnum:]]+\]/d
p }'

is this what you want?
\[directive\](.*?)\[
Look here

Related

Replace several lines by one using sed

I have an input like this:
This_is(A)
Goto(B,condition_1)
Goto(C,condition_2)
This_is(B)
Goto(A,condition_3)
This_is(C)
Goto(B,condition_1)
I want it to become like this
(A,B,condition_1)
(A,C,condition_2)
(B,A,condition_3)
(C,B,condition_1)
Anyone knows how to do this with sed?
Assuming you don't really need to do this with sed, this will work using any awk in any shell on every UNIX box:
$ awk -F'[()]' '/^[^[:space:]]/{s=$2; next} {sub(/[^[:space:]]*\(/,"("s",")} 1' file
(A,B,condition_1)
(A,C,condition_2)
(B,A,condition_3)
(C,B,condition_1)
This is a possible sed solution, where I have hardcoded a few bits, like This_is and Goto because the OP did not clarify if those strings change along the file in the actual file:
sed '/^This_is/{:a;N;s/\(^This_is(\(.\)).*\)\(\n *\)Goto(\([^)]*)\)$/\1\3(\2,\4/;$!ta;s/[^\n]*\n//}' input_file
(Unfortunately, with all these parenthesis, using the -E does not shorten the command much.)
The code is slightly more readable if split on more lines:
sed '/^This_is/{
:a
N
s/\(^This_is(\(.\)).*\)\(\n *\)Goto(\([^)]*)\)$/\1\3(\2,\4/
$!ta
s/[^\n]*\n//
}' os
Here you can see that the code takes action only on the lines starting with This_is; when the program hits those lines, it does the following.
It uses the N command to append the next line to the pattern space (interspersing \ns),
and it attempts a substitution with s/…/…/, which essentially tries to pick the x in This_is(x) and to put it just after the last Goto( on the multiline,
and it keeps doing this as long as the latter action is successful (ta branches to :a if s was successful) and the last line has not been read ($! matches all line but the last);
Indeed, this is a do-while loop, where :a marks the entry point, where the control jumps back if the while-condition is true, and ta is the command that evaluates the logical condition.
When the above while loop terminates, the shorter s/…/…/ command removes the leading line from the multiline pattern space, which is the This_is line.
This might work for you (GNU sed):
sed -E '/^\S.*\(.*\)/{h;d};G;s/\S+\((.*\))\n.*(\(.*)\).*/\2,\1/;P;d' file
If a line starts with a non-white space character and contains parens, copy it to the hold space (HS) and then delete it.
Otherwise, append the HS, remove non-white characters upto the opening paren, insert the value between parens from the stored value, add a comma and print the first line and then delete the whole of the pattern space.
N.B. Lines that do not meet the substitution criteria will be unchanged.
An alternative solution using GNU parallel and sed:
parallel --pipe --recstart T -kqN1 sed -E '1{h;d};G;s/\S+\((.*)\n.*(\(.*)\).*/\2,\1/;P;d' <file

How can i remove a line only if it is followed by a line that starts with the same character?

I need some help with sed or awks.
How can i remove a line only if it is followed by a line that starts with the same character (in this case >)?
Example I have this:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422298
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422294
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
I want to get this:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
Note that not all the lines have the same numbers but they all have the same format, which is why I want to use regular expressions. If you could explain how to read the code you produce that would be really great.
Thank you so much!
If the whole file follows that pattern (some number of lines starting with >, of which you want only the last, followed by a single line that should always be printed), you could use something like this:
awk '/^>/ { latest=$0 } !/^>/ { if (latest) { print latest; latest="" } print }'
If the line starts with >, then it is remembered (stored in the variable latest) but not printed. If the line doesn't start with >, then it is printed, but only after first printing whatever was most recently stored in latest.
The conditional means each printed > line will appear only once, even if there are multiple non-> lines in a row. Since that doesn't happen in your sample data, you may not need the complication, and could use this simpler unconditional version:
awk '/^>/ { latest=$0 } !/^>/ { print latest; print }'
The needed result can be easily achieved by just using uniq command with -w(--check-chars=N) option:
cat testfile | uniq -w 3
The output:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422298
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422294
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
-w, --check-chars=N
compare no more than N characters in lines
http://man7.org/linux/man-pages/man1/uniq.1.html
It will compare the first N characters of each line to make decision for repeated lines
try: if your data is same as given sample Input_file then following may help you in same.
awk '/^>/{A=$0;next} {print A ORS $0;A=""}' Input_file
This might work for you (GNU sed):
sed 'N;/^>.*\n>/!P;D' file
Read two lines into the pattern space and do not print the first of these lines if the first and second lines begin with >.
sed 'N;/^>.*\n\w/!D' file #(GNU sed)
N: read next line into the pattern space. /^>.*\n\w/!D: delete the first line if the first line starts with ">" and the second line doesn't begin with a letter

Regex command line change format of each line

I have a file that contains lines in a format similar to this...
/data/file.geojson?10,20,30,40
/data/file.geojson?bbox=-5.20751953125,49.05227025601607,3.0322265625,56.46249048388979
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-2.8482055664062496,54.38935426009769,-0.300750732421875,55.158473983815306
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
I've tried a combination of grep, sed, gawk, and |(pipes) to try and pattern match and then change the format to be more like this...
[10,40],[30,40],[30,20][10,20],
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979].....
Hopefully you get the idea from the first line so I don't have to type out all the examples manually!
I've got the hang of regex to match the co-ordinates. In fact the input file is the result of extracting from apache access logs. It might be easier to read/understand answers if they just match positive integer numbers, I will then be able to slot in a more complicated pattern to match the right range.
To be able to arrange the results like you which it is important to be able to access the last for values per line.
No pattern matching is required if you use awk. You can split the input strings by a set of delimiters and reassemble the resulting fields. 40 can be accessed as $(NF), 30 as $(NF-1) and so on.
awk -F'[?,=]' '
{printf "[%s,%s],[%s,%s],[%s,%s],[%s,%s]\n",
$(NF-3),$(NF),$(NF-1),$(NF),
$(NF-1),$(NF-2),$(NF-3),$(NF-2)
}' file
I'm using ?, , or = as the field delimiters. This makes it simple to access the columns of interest.
Output:
[10,40],[30,40],[30,20],[10,20]
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979],[3.0322265625,49.05227025601607],[-5.20751953125,49.05227025601607]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-2.8482055664062496,55.158473983815306],[-0.300750732421875,55.158473983815306],[-0.300750732421875,54.38935426009769],[-2.8482055664062496,54.38935426009769]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
Btw, also sed can be used here:
sed -r 's/.*[?=]([^,]+),([^,]+),([^,]+),(.*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
The command is capturing the numbers at the end each in a separate capturing group and re-assembles them in the replacement part.
Not all versions of sed support the + quantifier. The most compatible version would look like this :)
sed 's/.*[?=]\([^,]\{1,\}\),\([^,]\{1,\}+\),\([^,]\{1,\}\),\(.*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
sed strips off items prior to numbers, then awk splits on comma and outputs in different order. Assuming data is in a file called "td.txt"
sed 's/^[^0-9-]*//' td.txt|awk -F, '{print "["$1","$4"],["$3","$4"],["$3","$2"],["$1","$2"],"}'
This might work for you (GNU sed):
sed -r 's/^.*\?[^-0-9]*([^,]*),([^,]*),([^,]*),([^,]*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
Or with more toothpicks:
sed 's/^.*\?[^-0-9]*\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
You can use the following to match:
(\/data\/file\.geojson\?(?:bbox=)?)([0-9.-]+),([0-9.-]+),([0-9.-]+),([0-9.-]+)
And replace with the following:
$1[$2,$3],[$4,$5]
See DEMO

How to print a greedy range of lines using awk

I've encountered the following problem and haven't found a solution nor why awk behaves in this strange way.
So let's say I have the following text in a file:
startcue
This shouldn't be found.
startcue
This is the text I want to find.
endcue
startcue
This shouldn't be found either.
And I want to find the lines "startcue", "This is the text I want to find.", and "endcue".
I naively assumed that a simple range search by awk '/startcue/,/endcue/' would do it, but this prints out the whole file. I guess awk somehow finds the first range, but as the third startcue triggers on the printing of lines, it prints all the lines until the end of the file (still, this all seems a bit strange to me).
Now to the question: How can I get awk to print out just the lines I wan't? And maybe as an extra question: Can anybody explain awk's behaviour?
Thanks
$ awk '/startcue/{f=1; buf=""} f{buf = buf $0 RS} /endcue/{printf "%s",buf; f=0}' file
startcue
This is the text I want to find.
endcue
Here is a simple way to do it.
Since data is separated by blank lines, I set RS to nothing.
This makes awk to work with data in blocks.
Then find all blocks starting with startcue and ending with endcue
awk -v RS="" '/^startcue/ && /endcue$/' file
startcue
This is the text I want to find.
endcue
If startcue and endcue are always start line and end line and does only appears once int the block, this should do: (PS testing does show that it does not matter if there are more or less hits in the block. This always prints the block if both startclue and endcue are found)
awk -v RS="" '/startcue/ && /endcue/' file
startcue
This is the text I want to find.
endcue
And this should work too:
awk -v RS="" '/startcue.*endcue/' file
startcue
This is the text I want to find.
endcue
To summarize the problem, you want print lines from startcue to endcue but not if the endcue is missing. Ed Morton's approach is good. Here is yet another approach:
$ tac file | awk '/endcue/,/startcue/' | tac
startcue
This is the text I want to find.
endcue
How it works
tac file
This prints the lines in reverse order. tac is just like cat except that the lines come out in reverse order.
awk '/endcue/,/startcue/'
This prints all lines starting from endcue and finishing with startcue. When done this way, passages with missing endcues are not printed.
tac
This reverses the lines once again so that are back in the correct order.
How awk ranges work
Consider:
awk '/startcue/,/endcue/' file
This tells awk to start printing when if finds startcue and continue printing until if finds endcue. This is exactly what it does on your file.
There is no implied rule that the range /startcue/,/endcue/ cannot itself contain multiple instances of startcue. awk simply starts printing when it sees the first occurrence of startcue and continues until if finds endcue.
no buffering needed :
{m,n,g}awk 'BEGIN { _ +=_ ^= ORS = FS = RS = "\nendcue\n"
sub("end", "?start", RS)
__= substr(RS, _+--_) } (NF=_<NF) && $!_=__$_'
startcue
This is the text I want to find.
endcue

How can I remove all characters in each line after the first space in a text file?

I have a large log file from which I need to extract file names.
The file looks like this:
/path/to/loremIpsumDolor.sit /more/text/here/notAlways/theSame/here
/path/to/anotherFile.ext /more/text/here/differentText/here
.... about 10 million times
I need to extract the file names like this:
loremIpsumDolor.sit
anotherFile.ext
I figure my first strategy is to find/replace all /path/to/ with ''. But I'm stuck how to remove all characters after the space.
Can you help?
sed 's/ .*//' file
It doesn't take any more. The transformed output appears on standard output, of course.
In theory, you could also use awk to grab the filename from each line as:
awk '{ print $1 }' input_file.log
That, of course, assumes that there are no spaces in any of the filenames. awk defaults to looking for whitespace as the field delimiters, so the above snippet would take the first "field" from your log file (your filename) for each line, and output it.
Pass it to cut:
cut '-d ' -f1 yourfile
a bash-only solution:
while read path otherstuff; do
echo ${path##*/}
done < filename