awk concatenate strings till contain substring

awk concatenate strings till contain substring - regex

I have a awk script from this example:
awk '/START/{if (x) print x; x="";}{x=(!x)?$0:x","$0;}END{print x;}' file
Here's a sample file with lines:
$ cat file
START
1
2
3
4
5
end
6
7
START
1
2
3
end
5
6
7
So I need to stop concatenating when destination string would contain end word, so the desired output is:
START,1,2,3,4,5,end
START,1,2,3,end

Short Awk solution (though it will check for /end/ pattern twice):
awk '/START/,/end/{ printf "%s%s",$0,(/^end/? ORS:",") }' file
The output:
START,1,2,3,4,5,end
START,1,2,3,end
/START/,/end/ - range pattern
A range pattern is made of two patterns separated by a comma, in the
form ‘begpat, endpat’. It is used to match ranges of consecutive
input records. The first pattern, begpat, controls where the range
begins, while endpat controls where the pattern ends.
/^end/? ORS:"," - set delimiter for the current item within a range

here is another awk
$ awk '/START/{ORS=","} /end/ && ORS=RS; ORS!=RS' file
START,1,2,3,4,5,end
START,1,2,3,end
Note that /end/ && ORS=RS; is shortened form of /end/{ORS=RS; print}

You can use this awk:
awk '/START/{p=1; x=""} p{x = x (x=="" ? "" : ",") $0} /end/{if (x) print x; p=0}' file
START,1,2,3,4,5,end
START,1,2,3,end

Another way, similar to answers in How to select lines between two patterns?
$ awk '/START/{ORS=","; f=1} /end/{ORS=RS; print; f=0} f' ip.txt
START,1,2,3,4,5,end
START,1,2,3,end
this doesn't need a buffer, but doesn't check if START had a corresponding end
/START/{ORS=","; f=1} set ORS as , and set a flag (which controls what lines to print)
/end/{ORS=RS; print; f=0} set ORS to newline on ending condition. Print the line and clear the flag
f print input record as long as this flag is set

Since we seem to have gone down the rabbit hole with ways to do this, here's a fairly reasonable approach with GNU awk for multi-char RS, RT, and gensub():
$ awk -v RS='end' -v OFS=',' 'RT{$0=gensub(/.*(START)/,"\\1",1); $NF=$NF OFS RT; print}' file
START,1,2,3,4,5,end
START,1,2,3,end

Related

Replace a block of text

I have a file in this pattern:
Some text
---
## [Unreleased]
More text here
I need to replace the text between '---' and '## [Unreleased]' with something else in a shell script.
How can it be achieved using sed or awk?

Perl to the rescue!
perl -lne 'my #replacement = ("First line", "Second line");
if ($p = (/^---$/ .. /^## \[Unreleased\]/)) {
print $replacement[$p-1];
} else { print }'
The flip-flop operator .. tells you whether you're between the two strings, moreover, it returns the line number relative to the range.

This might work for you (GNU sed):
sed '/^---/,/^## \[Unreleased\]/c\something else' file
Change the lines between two regexp to the required string.

This example may help you.
$ cat f
Some text
---
## [Unreleased]
More text here
$ seq 1 5 >mydata.txt
$ cat mydata.txt
1
2
3
4
5
$ awk '/^---/{f=1; while(getline < c)print;close(c);next}/^## \[Unreleased\]/{f=0;next}!f' c="mydata.txt" f
Some text
1
2
3
4
5
More text here

awk -v RS="\0" 'gsub(/---\n\n## \[Unreleased\]\n/,"something")+1' file
give this line a try.

An awk solution that:
is portable (POSIX-compliant).
can deal with any number of lines between the start line and the end line of the block, and potentially with multiple blocks (although they'd all be replaced with the same text).
reads the file line by line (as opposed to reading the entire file at once).
awk -v new='something else' '
/^---$/ { f=1; next } # Block start: set flag, skip line
f && /^## \[Unreleased\]$/ { f=0; print new; next } # Block end: unset flag, print new txt
! f # Print line, if before or after block
' file

Using awk to find a domain name containing the longest repeated word

For example, let's say there is a file called domains.csv with the following:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line
5,letswelcomewelcomeyou.org
How do I do that?
1 Meaning "immediately repeated", i.e., abcabc, but not abcXabc.

A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.
I'ved added one line to the example input file for the case of multiple longest words:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
And this gets the lines with the longest repeated sequence:
cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile
Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:
Remove the first column with the line number to avoid matches for lines numbers with repeating digits:
$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org
Get all lines with a repeated sequence, extract just that repeated sequence:
... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel
Get the length of each of those lines:
... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel
Sort by the first column, numerically, descending:
...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll
Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:
... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel
Pipe this into grep, using the -f - argument to read stdin as a file:
... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
Limitations
While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.
Alternative solution with more awk, less sort
As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:
$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile
Let's look at that awk step in more detail, with expanded variable names for clarity:
{
# New longest match: throw away stored longest matches, reset index
if (length() > max_len) {
max_len = length()
delete arr_longest
idx = 1
}
# Add line to longest matches
if (length() >= max_len)
arr_longest[idx++] = $0
}
# Print all the longest matches
END {
for (idx in arr_longest)
print arr_longest[idx]
}
Benchmarking
I've timed the two solutions on the top one million domains file mentioned in the comments:
First solution (with sort and two awk steps):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.742s
user 1m57.873s
sys 0m0.045s
Second solution (just one awk step, no sort):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.603s
user 1m56.514s
sys 0m0.045s
And the Perl solution by Casimir et Hippolyte:
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 0m5.249s
user 0m5.234s
sys 0m0.000s
What we learn from this: ask for a Perl solution next time ;)
Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.
Portability remark
Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.

A way with perl:
perl -F, -ane 'if (#m=$F[1]=~/(?=(.+)\1)/g) {
#m=sort { length $b <=> length $a} #m;
$cl=length #m[0];
if ($l<$cl) { #res=($_); $l=$cl; } elsif ($l==$cl) { push #res, ($_); }
}
END { print #res; }' file
The idea is to find all longest overlapping repeated strings for each position in the second field, then the match array is sorted and the longest substring becomes the first item in the array (#m[0]).
Once done, the length of the current repeated substring ($cl) is compared with the stored length (of the previous longest substring). When the current repeated substring is longer than the stored length, the result array is overwritten with the current line, when the lengths are the same, the current line is pushed into the result array.
details:
command line option:
-F, set the field separator to ,
-ane (e execute the following code, n read a line at a time and puts its content in $_, a autosplit, using the defined FS, and puts fields in the #F array)
The pattern:
/
(?= # open a lookahead assertion
(.+)\1 # capture group 1 and backreference to the group 1
) # close the lookahead
/g # all occurrences
This is a well-know pattern to find all overlapping results in a string. The idea is to use the fact that a lookahead doesn't consume characters (a lookahead only means "check if this subpattern follows at the current position", but it doesn't match any character). To obtain the characters matched in the lookahead, all that you need is a capture group.
Since a lookahead matches nothing, the pattern is tested at each position (and doesn't care if the characters have been already captured in group 1 before).

get the last word in body of text

Given a body of text than can span a varying number of lines, I need to use a grep, sed or awk solution to search through many files for the same pattern and get the last word in the body.
A file can include formats such as these where the word I want can be named anything
call function1(input1,
input2, #comment
input3) #comment
returning randomname1,
randomname2,
success3
call function1(input1,
input2,
input3)
returning randomname3,
randomname2,
randomname3
call function1(input1,
input2,
input3)
returning anothername3,
randomname2, anothername3
I need to print out results as
success3
randomname3
anothername3
Also I need some the filename and line information about each .
I've tried
pcregrep -M 'function1.*(\s*.*){6}(\w+)$' filename.txt
which is too greedy and I still need to print out just the specific grouped value and not the whole pattern. The words function1 and returning in my sample code will always be named as this and can be hard coded within my expression.

Last word of code blocks
Split file in blocks using awk's record separator RS. A record will be defined as a block of text, records are separated by double newlines.
A record consists of fields, each two consecutive fields are separated by white space or a single newline.
Now all we have to do is print the last field for each record, resulting in following code:
awk 'BEGIN{ FS="[\n\t ]"; RS="\n\n"} { print $NF }' file
Explanation:
FS this is the field separator and is set to either a newline, a tab or a space: [\n\t ].
RS this is the record separator and is set to a doulbe newline: \n\n
print $NF this will print the field $ with index NF, which is a variable containing the number of fields. Hence this prints the last field.
Note: To capture all paragraphs the file should end in double newline, this can easily be achieved by pre processing the file using: $ echo -e '\n\n' >> file.
Alternate solution based on comments
A more elegant ans simple solution is as follows:
awk -v RS='' '{ print $NF }' file

How about the following awk solution:
awk 'NF == 0 {if(last) print last; last=""} NF > 0 {last=$NF} END {print last}' file
the $NF is getting the value of the last "word" where NF stands for number of fields. Then the last variable always stores the last word on a line and prints it if it encounters an empty line, representing the end of a paragraph.
New version with matches function1 condition.
awk 'NF == 0 {if(last && hasF) print last; last=hasF=""}
NF > 0 {last=$NF; if(/function1/)hasF=1}
END {if(hasF) print last}' filename.txt

This will produce the output you show from the input file you posted:
$ awk -v RS= '{print $NF}' file
success3
randomname3
anothername3
If you want to print FILENAME and line number like you mention then this may be what you want:
$ cat tst.awk
NF { nr=NR; last=$NF; next }
{ prt() }
END { prt() }
function prt() { if (nr) print FILENAME, nr, last; nr=0 }
$ awk -f tst.awk file
file 6 success3
file 13 randomname3
file 20 anothername3
If that doesn't do what you want, edit your question to provide clearer, more truly representative and accurate sample input and expected output.

This is the perl version of Shellfish's awk solution (plus the keywords):
perl -00 -nE '/function1/ and /returning/ and say ((split)[-1])' file
or, with one regex:
perl -00 -nE '/^(?=.*function1)(?=.*returning).*?(\S+)\s*$/s and say $1' file
But the key is the -00 option which reads the file a paragraph at a time.

Replace previous when match regular expression

I need to delete the "end of line" of the previous line when current line starts is not a number ^[!0-9], basically if match, append to the line before, I'm a sed & awk n00b, and really like them btw. thanks
edit:
$ cat file
1;1;1;text,1
2;4;;8;some;1;1;1;more
100;tex
t
broke
4564;1;1;"also
";12,2121;546465
$ "script" file
1;1;1;text,1
2;4;;8;some;1;1;1;more
100;text broke
4564;1;1;"also";12,2121;546465

You didn't post any sample input or expected output so this is a guess but it sounds like what you're asking for:
$ cat file
a
b
3
4
c
d
$ awk '{printf "%s%s",(NR>1 && /^[[:digit:]]/ ? ORS : ""),$0} END{print ""}' file
ab
3
4cd
On the OPs newly posted input:
$ awk '{printf "%s%s",(NR>1 && /^[[:digit:]]/ ? ORS : ""),$0} END{print ""}' file
1;1;1;text,1
2;4;;8;some;1;1;1;more
100;textbroke
4564;1;1;"also";12,2121;546465

This might work for you (GNU sed):
sed -r ':a;$!N;s/\n([^0-9]|$)/\1/;ta;P;D' file
Keep two lines in the pattern space and if the start of the second line is empty or does not start with an integer, remove the newline.

if you have Ruby on your system
array = File.open("file").readlines
array.each_with_index do |val,ind|
array[ind-1].chomp! if not val[/^\d/] # just chomp off the previous item's \n
end
puts array.join
output
# ruby test.rb
1;1;1;text,1
2;4;;8;some;1;1;1;more
100;textbroke
4564;1;1;"also";12,2121;546465

how to replace the next string after match (every) two blank lines?

is there a way to do this kind of substitution in Awk, sed, ...?
I have a text file with sections divived into two blank lines;
section1_name_x
dklfjsdklfjsldfjsl
section2_name_x
dlskfjsdklfjsldkjflkj
section_name_X
dfsdjfksdfsdf
I would to replace every "section_name_x" by "#section_name_x", this is, how to replace the next string after match (every) two blank lines?
Thanks,
Steve,

awk '
(NR==1 || blank==2) && $1 ~ /^section/ {sub(/section/, "#&")}
{
print
if (length)
blank = 0
else
blank ++
}
' file
#section1_name_x
dklfjsdklfjsldfjsl
#section2_name_x
dlskfjsdklfjsldkjflkj
#section_name_X
dfsdjfksdfsdf

hm....
Given your example data why not just
sed 's/^section[0-9]*_name.*/#/' file > newFile && mv newFile file
some seds support sed -i OR sed -i"" to overwrite the existing file, avoiding the && mv ... shown above.
The reg ex says, section must be at the beginning of the line, and can optionally contain a number or NO number at all.
IHTH

In gawk you can use the RT builtin variable:
gawk '{$1="#"$1; print $0 RT}' RS='\n\n' file
* Update *
Thanks to #EdMorton I realized that my first version was incorrect.
What happens:
Assigning to $1 causes the record to be rebuildt, which is not good in this cases since any sequence of white space is replaced by a single space between fields, and by the null string in the beginning and at the end of the record.
Using print adds an additional newline to the output.
The correct version:
gawk '{printf "%s", "#" $0 RT}' RS='\n\n\n' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

awk concatenate strings till contain substring - regex

here is another awk $ awk '/START/{ORS=","} /end/ && ORS=RS; ORS!=RS' file START,1,2,3,4,5,end START,1,2,3,end Note that /end/ && ORS=RS; is shortened form of /end/{ORS=RS; print}

You can use this awk: awk '/START/{p=1; x=""} p{x = x (x=="" ? "" : ",") $0} /end/{if (x) print x; p=0}' file START,1,2,3,4,5,end START,1,2,3,end

Since we seem to have gone down the rabbit hole with ways to do this, here's a fairly reasonable approach with GNU awk for multi-char RS, RT, and gensub(): $ awk -v RS='end' -v OFS=',' 'RT{$0=gensub(/.*(START)/,"\\1",1); $NF=$NF OFS RT; print}' file START,1,2,3,4,5,end START,1,2,3,end

Related

Replace a block of text

Using awk to find a domain name containing the longest repeated word

get the last word in body of text

Replace previous when match regular expression

how to replace the next string after match (every) two blank lines?

Categories

Resources