sed print from match until other match NOT inclusive - regex

I want to print all lines from a match up to a second match, not including that second match.
What I have so far does everything and does too much, in that it prints the second match as well.
Specifically, let's say I want to print everything starting on a line containing 'test', up to, but not including, the first line starting with a number or an open bracket '['.
This goes some way, but not all the way:
sed -n '/test/,/^[0-9]\|^\[/p' file

It is much easier to do this via awk:
awk '/test/{p=1} /^([0-9]|\[)/{p=0} p' file

Using awk:
awk 'p && /^[0-9]|^\[/ { exit }; /test/{ p = 1 } p' file
Example:
$ cat temp.txt
4
1
2
3
4
5
$ awk 'p && /4/ { exit }; /2|1/{ p = 1 } p' temp.txt
1
2
3
Notice how it skipped 4 when /2|1/ wasn't found yet.

sed -n '/test/,/^[0-9[]/ {
/test/ {
h;b
}
x;p
$ {
x
/^[^0-9[]/ p
}
}' YourFile
should work but not elegant

Related

stop condition for emulating "grep -oE" with awk

I'm trying to emulate GNU grep -Eo with a standard awk call.
What the man says about the -o option is:
-o --only-matching
     Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.
For now I have this code:
#!/bin/sh
regextract() {
[ "$#" -ge 2 ] || return 1
__regextract_ere=$1
shift
awk -v FS='^$' -v ERE="$__regextract_ere" '
{
while ( match($0,ERE) && RLENGTH > 0 ) {
print substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART+1)
}
}
' "$#"
}
My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input+regex that would need the former but I feel like it might exist. Any idea?
Here's a POSIX awk version, which works with a* (or any POSIX awk regex):
echo abcaaaca |
awk -v regex='a*' '
{
while (match($0, regex)) {
if (RLENGTH) print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + (RLENGTH > 0 ? RLENGTH : 1))
if ($0 == "") break
}
}'
Prints:
a
aaa
a
POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.
If you can consider a gnu-awk solution then using RS and RT may give identical behavior of grep -Eo.
# input data
cat file
FOO:TEST3:11
BAR:TEST2:39
BAZ:TEST0:20
Using grep -Eo:
grep -Eo '[[:alnum:]]+' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
Using gnu-awk with RS and RT using same regex:
awk -v RS='[[:alnum:]]+' 'RT != "" {print RT}' file
FOO
TEST3
11
BAR
TEST2
39
BAZ
TEST0
20
More examples:
grep -Eo '\<[[:digit:]]+' file
11
39
20
awk -v RS='\\<[[:digit:]]+' 'RT != "" {print RT}' file
11
39
20
Thanks to the various comments and answers I think that I have a working, robust, and (maybe) efficient code now:
tested on AIX/Solaris/FreeBSD/macOS/Linux
#!/bin/sh
regextract() {
[ "$#" -ge 1 ] || return 1
[ "$#" -eq 1 ] && set -- "$1" -
awk -v FS='^$' '
BEGIN {
ere = ARGV[1]
delete ARGV[1]
}
{
tail = $0
while ( tail != "" && match(tail,ere) ) {
if (RLENGTH) {
print substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH)
} else
tail = substr(tail,RSTART+1)
}
}
' "$#"
}
regextract "$#"
notes:
I pass the ERE string along the file arguments so that awk doesn't pre-process it (thanks #anubhava for pointing that out); C-style escape sequences will still be translated by the regex engine of awk though (thanks #dan for pointing that out).
Because assigning $0 does reset the values of all fields,
I chose FS = '^$' for limiting the overhead
Copying $0 in a separate variable nullifies the overhead induced by assigning $0 in the while loop (thanks #EdMorton for pointing that out).
a few examples:
# Multiple matches in a single line:
echo XfooXXbarXXX | regextract 'X*'
X
XX
XXX
# Passing the regex string to awk as a parameter versus a file argument:
echo '[a]' | regextract_as_awk_param '\[a]'
a
echo '[a]' | regextract '\[a]'
[a]
# The regex engine of awk translates C-style escape sequences:
printf '%s\n' '\t' | regextract '\t'
printf '%s\n' '\t' | regextract '\\t'
\t
Your code will malfunction for match which might have zero or more characters, consider following simple example, let file.txt content be
1A2A3
then
grep -Eo A* file.txt
gives output
A
A
your while's condition is match($0,ERE) && RLENGTH > 0, in this case former part gives true, but latter gives false as match found is zero-length before first character (RSTART was set to 1), thus body of while will be done zero times.

How can I group unknown (but repeated) words to create an index?

I have to create a shellscript that indexes a book (text file) by taking any words that are encapsulated in angled brackets (<>) and making an index file out of that. I have two questions that hopefully you can help me with!
The first is how to identify the words in the text that are encapsulated within angled brackets.
I found a similar question that was asked but required words inside of square brackets and tried to manipulate their code but am getting an error.
grep -on \\<.*> index.txt
The original code was the same but with square brackets instead of the angled brackets and now I am receiving an error saying:
line 5: .*: ambiguous redirect
This has been answered
I also now need to take my index and reformat it like so, from:
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
Into:
big: 1 3 9
but: 2
sun: 4 6 7 8
I know that I can flip the columns with an awk command like:
awk -F':' 'BEGIN{OFS=":";} {print $2,$1;}' index.txt
But am not sure how to group the same words into a single line.
Thanks!
Could you please try following(if you are not worried about sorting order, in case you need to sort it then append sort to following code).
awk '
BEGIN{
FS=":"
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1
}
END{
for(key in name){
print key": "name[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=":" ##Setting field separator as : here.
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1 ##Creating array named name with index of $2 and value of $1 which is keep appending to its same index value.
}
END{ ##Starting END block of this code here.
for(key in name){ ##Traversing through name array here.
print key": "name[key] ##Printing key colon and array name value with index key
}
}
' Input_file ##Mentioning Input_file name here.
If you want to extract multiple occurrences of substrings in between angle brackets with GNU grep, you may consider a PCRE regex based solution like
grep -oPn '<\K[^<>]+(?=>)' index.txt
The PCRE engine is enabled with the -P option and the pattern matches:
< - an open angle bracket
\K - a match reset operator that discards all text matched so far
[^<>]+ - 1 or more (due to the + quantifier) occurrences of any char but < and > (see the [^<>] bracket expression)
(?=>) - a positive lookahead that requires (but does not consume) a > char immediately to the right of the current location.
Something like this might be what you need, it outputs the paragraph number, line number within the paragraph, and character position within the line for every occurrence of each target word:
$ cat book.txt
Wee, <sleeket>, cowran, tim’rous beastie,
O, what a panic’s in <thy> breastie!
Thou need na start <awa> sae hasty,
Wi’ bickerin brattle!
I wad be laith to rin an’ chase <thee>
Wi’ murd’ring pattle!
I’m <truly> sorry Man’s dominion
Has broken Nature’s social union,
An’ justifies that ill opinion,
Which makes <thee> startle,
At me, <thy> poor, earth-born companion,
An’ fellow-mortal!
.
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS="\t" }
{
for (lineNr=1; lineNr<=NF; lineNr++) {
line = $lineNr
idx = 1
while ( match( substr(line,idx), /<[^<>]+>/ ) ) {
word = substr(line,idx+RSTART,RLENGTH-2)
locs[word] = (word in locs ? locs[word] OFS : "") NR ":" lineNr ":" idx + RSTART
idx += (RSTART + RLENGTH)
}
}
}
END {
for (word in locs) {
print word, locs[word]
}
}
.
$ awk -f tst.awk book.txt | sort
awa 1:3:21
sleeket 1:1:7
thee 1:5:34 2:4:24
thy 1:2:23 2:5:9
truly 2:1:6
Sample input courtesy of Rabbie Burns
GNU datamash is a handy tool for working on groups of columnar data (Plus some sed to massage its output into the right format):
$ grep -oPn '<\K[^<>]+(?=>)' index.txt | datamash -st: -g2 collapse 1 | sed 's/:/: /; s/,/ /g'
big: 1 3 9
but: 2
sun: 4 6 7 8
To transform
index.txt
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
into:
big: 1 3 9
but: 2
sun: 4 6 7 8
you can try this AWK program:
awk -F: '{ if (entries[$2]) {entries[$2] = entries[$2] " " $1} else {entries[$2] = $2 ": " $1} }
END { for (entry in entries) print entries[entry] }' index.txt | sort
Shorter version of the same suggested by RavinderSingh13:
awk -F: '{
{ entries[$2] = ($2 in entries ? entries[$2] " " $1 : $2 ": " $1 }
END { for (entry in entries) print entries[entry] }' index.txt | sort

AWK script to check first line of a file and then print the rest

I am trying to write an AWK script to parse a file of the form
> field1 - field2 field3 ...
lineoftext
anotherlineoftext
anotherlineoftext
and I am checking using regex if the first line is correct (begins with a > and then has something after it) and then print all the other lines. This is the script I wrote but it only verifies that the file is in a correct format and then doesn't print anything.
#!/bin/bash
# FASTA parser
awk ' BEGIN { x = 0; }
{ if ($1 !~ />.*/ && x == 0)
{ print "Not a FASTA file"; exit; }
else { x = 1; next; }
print $0 }
END { print " - DONE - "; }'
Basically you can use the following awk command:
awk 'NR==1 && /^>./ {p=1} p' file
On the first row NR==1 it checks whether the line starts with a > followed by "something" (/^>./). If that condition is true the variable p will be set to one. The p at the end checks whether p evaluates true and prints the line in that case.
If you want to print the error message, you need to revert the logic a bit:
awk 'NR==1 && !/^>./ {print "Not a FASTA file"; exit 1} 1' file
In this case the program prints the error messages and exits the program if the first line does not start with a >. Otherwise all lines gets printed because 1 always evaluates to true.
For this OP literally
awk 'NR==1{p=$0~/^>/}p' YourFile
# shorter version with info of #EdMorton
awk 'NR==1{p=/^>/}p' YourFile
for line after > (including)
awk '!p{p=$0~/^>/}p' YourFile
# shorter version with info of #EdMorton
awk '!p{p=/^>/}p' YourFile
Since all you care about is the first line, you can just check that, then exit.
awk 'NR > 1 { exit (0) }
! /^>/ { print "Not a FASTA file" >"/dev/stderr"; exit (1) }' file
As noted in comments, the >"/dev/stderr" is a nonportable hack which may not work for you. Regard it as a placeholder for something slightly more sophisticated if you want a tool which behaves as one would expect from a standard Unix tool (run silently if no problems; report problems to standard error).

Matching blocks with conditions

I am in the need for some regexp guru help.
I am trying to make a small config system for a home project, but for this it seams that I need a bit more regexp code than my regexp skills can come up with.
I need to be able to extract some info inside blocks based on conditions and actions. For an example.
action1 [condition1 condition2 !condition3] {
Line 1
Line 2
Line 3
}
The conditions are stored in simple variables separated by space. I use these variables to create the regexp used to extract the block info from the file. Most if this is working fine, except that I have no idea how to make the "not matching" part, which basically means that a "word" is not available in the condition variable.
VAR1="condition1 condition2"
VAR2="condition1 condition2 condition3"
When matched against the above, it should match VAR1 but not VAR2.
This is what I have so far
PARAMS="con1 con2 con3"
INPUT_PARAMS="[^!]\\?\\<$(echo $PARAMS | sed 's/ /\\>\\|[^!]\\?\\</g')\\>"
sed -n "/^$ACTION[ \t]*\(\[\($INPUT_PARAMS\)*\]\)\?[ \t]*{/,/}$/p" default.cfg | sed '/^[^{]\+{/d' | sed '/}/d'
Not sure how pretty this is, but it does work, except for not-matching.
EDIT:
Okay I will try to elaborate a bit.
Let's say that I have the below text/config file
action1 [con1 con2 con3] {
Line A
Line B
}
action2 [con1 con2 !con3] {
Line C
}
action3 [con1 con2] {
Line D
}
action4 {
Line E
}
and I have the fallowing conditions to match against
ARG1="con1 con2 con3"
ARG2="con1 con2"
ARG3="con1"
ARG4="con1 con4"
# Matching against ARG1 should print Line A, B, D and E
# Matching against ARG2 should print Line C, D and E
# Matching against ARG3 should print Line E
# Matching against ARG4 should print Line E
Below is a java like example of action2 using normal conditional check. It give a better idea of what I am trying
if (ARG2.contains("con1") && ARG2.contains("con2") && !ARG2.contains("con3")) {
// Print all lines in this block
}
The logic of how you're selecting which records to print lines from isn't clear to me so here's how to create sets of positive and negative conditions using awk:
$ cat tst.awk
BEGIN{
RS = ""; FS = "\n"
# create the set of the positive conditions in the "conds" variable.
n = split(conds,tmp," ")
for (i=1; i<=n; i++)
wanted[tmp[i]]
}
{
# create sets of the positive and negative conditions
# present in the first line of the current record.
delete negPresent # use split("",negPresent) in non-gawk
delete posPresent
n = split($1,tmp,/[][ {]+/)
for (i=2; i<n; i++) {
cond = tmp[i]
sub(/^!/,"",cond) ? negPresent[cond] : posPresent[cond]
}
allPosInWanted = 1
for (cond in posPresent)
if ( !(cond in wanted) )
allPosInWanted = 0
someNegInWanted = 0
for (cond in negPresent)
if (cond in wanted)
someNegInWanted = 1
if (allPosInWanted && !someNegInWanted)
for (i=2;i<NF;i++)
print $i
}
.
$ awk -v conds='con1 con2 con3' -f tst.awk file
Line A
Line B
Line D
Line E
$
$ awk -v conds='con1 con2' -f tst.awk file
Line C
Line D
Line E
$
$ awk -v conds='con1' -f tst.awk file
Line E
$
$ awk -v conds='con1 con4' -f tst.awk file
Line E
$
and now you just have to code whatever logic you like in that final block where the printing is being done to compare the conditions in each of the sets.

Print last match of a sed regex

I have the following:
cat /tmp/cluster_concurrentnodedump.out.20140501.103855 | sed -n '/Starting inject/s/.*[Ii]nject \([0-9]*\).*/\1/p
Which gives a list of
0
1
2
..
How can I print only the last match with this sed?
Thanks.
Store the substitution results in the hold buffer then print it at the end:
sed -ne '
/Starting inject/ {
# do the substitution
s/.*[Ii]nject \([0-9]*\).*/\1/
# instead of printing, copy the results to the hold buffer
h
}
$ { # at the end of the file:
# copy the hold buffer back to the pattern buffer
x
# print the pattern buffer
p
}
' /tmp/cluster_concurrentnodedump.out.20140501.103855
Use tac to print the file in reverse (first line last) and exit after first match:
tac /tmp/cluster_concurrentnodedump.out.20140501.103855 | sed -n '/Starting inject/s/.*[Ii]nject \([0-9]*\).*/\1/p;q'
Last part is where we have ;q to quit:
sed -n '....p;q'
^
Example
Print last number:
$ cat a
1
2
3
4
5
6
7
8
9
$ tac a | sed -n 's/\([0-9]\)/\1/p;q'
9