How to repeat a block of lines using awk? - regex

I'm trying to repeat the block of lines avobe the OCCURS word the number of times inticated in the line. The block of lines to repeat have a smaller number at the start of the line.
I mean, with this input:
01 PATIENT-TREATMENTS.
05 PATIENT-NAME PIC X(30).
05 PATIENT-SS-NUMBER PIC 9(9).
05 NUMBER-OF-TREATMENTS PIC 99 COMP-3.
05 TREATMENT-HISTORY OCCURS 2.
10 TREATMENT-DATE OCCURS 3.
15 TREATMENT-DAY PIC 99.
15 TREATMENT-MONTH PIC 99.
15 TREATMENT-YEAR PIC 9(4).
10 TREATING-PHYSICIAN PIC X(30).
10 TREATMENT-CODE PIC 99.
05 HELLO PIC X(9).
05 STACK OCCURS 2.
10 OVERFLOW PIC X(99).
This would be the output:
01 PATIENT-TREATMENTS.
05 PATIENT-NAME PIC X(30).
05 PATIENT-SS-NUMBER PIC 9(9).
05 NUMBER-OF-TREATMENTS PIC 99 COMP-3.
05 TREATMENT-HISTORY OCCURS 2.
10 TREATMENT-DATE OCCURS 3.
15 TREATMENT-DAY PIC 99.
15 TREATMENT-MONTH PIC 99.
15 TREATMENT-YEAR PIC 9(4).
15 TREATMENT-DAY PIC 99.
15 TREATMENT-MONTH PIC 99.
15 TREATMENT-YEAR PIC 9(4).
15 TREATMENT-DAY PIC 99.
15 TREATMENT-MONTH PIC 99.
15 TREATMENT-YEAR PIC 9(4).
10 TREATING-PHYSICIAN PIC X(30).
10 TREATMENT-CODE PIC 99.
15 TREATMENT-DAY PIC 99.
15 TREATMENT-MONTH PIC 99.
15 TREATMENT-YEAR PIC 9(4).
15 TREATMENT-DAY PIC 99.
15 TREATMENT-MONTH PIC 99.
15 TREATMENT-YEAR PIC 9(4).
15 TREATMENT-DAY PIC 99.
15 TREATMENT-MONTH PIC 99.
15 TREATMENT-YEAR PIC 9(4).
10 TREATING-PHYSICIAN PIC X(30).
10 TREATMENT-CODE PIC 99.
05 HELLO PIC X(9).
05 STACK OCCURS 2.
10 OVERFLOW PIC X(99).
10 OVERFLOW PIC X(99).
I tried it by this way:
tac input.txt |
awk '
BEGIN {
lbuff="";
n=0;
}{
if($0 ~ /^\s*$/) {next;}
if ($3 == "OCCURS") {
lev_oc=$1
len_oc=$4
lstart=0
for (x=1; x<n; x++) {
split(saved[x],saved_level," ")
if (saved_level[1] <= lev_oc) {
print saved[x]
lstart=x+1
}
}
for (i=1; i<=len_oc; i++) {
for (x=lstart; x<n; x++) {
print saved[x]
}
}
print $0
}else if ($0) {
saved[n]=$0
n++
}
}' | tac
But I don't get the result what I'm trying to obtain. Is awk the best way to do it? Do you have any alternative?

I used perl for this because it's easy to make arbitrarily complex data structures:
#!/usr/bin/perl
use strict;
use warnings;
# read the file into an array of lines.
open my $f, '<', shift;
my #lines = <$f>;
close $f;
my #occurring;
my #occurs;
# iterate over the lines of the file
for (my $i = 0; $i < #lines; $i++) {
# extract the "level", the first word of the line
my $level = (split ' ', $lines[$i])[0];
# if this line contains the OCCURS string,
# push some info onto a stack.
# This marks the start of something to be repeated
if ($lines[$i] =~ /OCCURS (\d+)/) {
push #occurring, [$1-1, $level, $i+1];
next;
}
# if this line is at the same level as the level of the start of the
# last seen item on the stack, mark the last line of the repeated text
if (#occurring and $level eq $occurring[-1][1]) {
push #occurs, [#{pop #occurring}, $i-1];
}
}
# If there's anything open on the stack, it ends at the last line
while (#occurring) {
push #occurs, [#{pop #occurring}, $#lines];
}
# handle all the lines to be repeated by appending them to the last
# line of the repetition
for (#occurs) {
my $repeated = "";
my ($count, undef, $start, $stop) = #$_;
$repeated .= join "", #lines[$start..$stop] for (1..$count);
$lines[$stop] .= $repeated;
}
print #lines;
For your reading pleasure, here's an awk translation.
BEGIN {
s = 0
f = 0
}
function stack2frame(lineno) {
f++
frame[f,"reps"] = stack[s,"reps"]
frame[f,"start"] = stack[s,"start"]
frame[f,"stop"] = lineno
s--
}
{
lines[NR] = $0
level = $1
}
# if this line contains the OCCURS string, push some info onto a stack.
# This marks the start of something to be repeated
$(NF-1) == "OCCURS" {
s++
stack[s,"reps"] = $NF-1
stack[s,"level"] = level
stack[s,"start"] = NR+1
next
}
# if this line is at the same level as the level of the start of the
# last seen item on the stack, mark the last line of the repeated text
level == stack[s,"level"] {
stack2frame(NR-1)
}
END {
# If there's anything open on the stack, it ends at the last line
while (s) {
stack2frame(NR)
}
# handle all the lines to be repeated by appending them to the last
# line of the repetition
for (i=1; i<=f; i++) {
repeated = ""
for (j=1; j <= frame[i,"reps"]; j++) {
for (k = frame[i,"start"]; k <= frame[i,"stop"]; k++) {
repeated = repeated ORS lines[k]
}
}
lines[frame[i,"stop"]] = lines[frame[i,"stop"]] repeated
}
for (i=1; i <= NR; i++)
print lines[i]
}

Here's a ruby solution:
#!/usr/bin/env ruby
# -*- coding: utf-8 -*-
stack = []
def unwind_frame(stack)
frame = stack.pop
_,occurs,data = *frame
with_each = stack==[] ? ->(l){ puts l} : ->(l){stack.last[2].push l}
occurs.times { data.each &with_each }
end
while gets
$_.chomp! "\n"
if m=$_.match(/OCCURS ([0-9]*)\.\s*$/)
puts $_
occurs=m[1].to_i
level = $_.to_i
stack.push([level,occurs,[]])
next
end
if stack==[]; puts $_; next; end
level = $_.to_i
if level > stack.last[0]
stack.last[2].push $_
next
end
while(stack!=[] && level <= stack.last[0])
unwind_frame(stack)
stack!=[] ? stack.last[2].push($_) : puts($_)
end
end
while(stack!=[])
unwind_frame(stack)
end
The result matches what you expected to get.

Related

How to match variable length number range 0 to $n using perl regex?

I need to match a numeric range from 0 to a number $n where $n can be any random number from 1 - 40.
For example,
if $n = 16, I need to strictly match only the numeric range from 0-16.
I tried m/([0-9]|[1-3][0-9]|40)/ but that is matching all 0-40. Is there a way to use regex to match from 0 to $n ?
The code snippet is attached for context.
$n = getNumber(); #getNumber() returns a random number from 1 to 40.
$answer = getAnswer(); #getAnswer() returns a user input.
#Check whether user enters an integer between 0 and $n
if ($answer =~ m/regex/){
print("Answer is an integer within specified range!\n");
}
I know can probably do something like
if($answer >= 0 && $answer <=$n)
But I am just wondering if there is a regex way of doing it?
I wouldn't pull out the following trick if there's another reasonable way to solve the problem. There is, for instance, Matching Numeric Ranges with a Regular Expression.
The (?(...)true|false) construct is like a regex conditional operator, and you can use one of the regex verbs, (*FAIL), to always fail a subpattern.
For the condition, you can use (?{...}) as the condition:
my $pattern = qr/
\b # anchor somehow
(\d++) # non-backtracking and greedy
(?(?{ $1 > 42 })(*FAIL))
/x;
my #numbers = map { int( rand(100) ) } 0 .. 10;
say "#numbers";
foreach my $n ( #numbers ) {
next unless $n =~ $pattern;
say "Matched $n";
}
Here's a run:
74 69 24 15 23 26 62 18 18 43 80
Matched 24
Matched 15
Matched 23
Matched 26
Matched 18
Matched 18
This is handy when the condition is complex.
I only think about this because it's an encouraged feature in Raku (and I have several examples in Learning Perl 6). Here's some Raku code in the same form, and the pattern syntax is significantly different:
#!raku
my $numbers = map { 100.rand.Int }, 0 .. 20;
say $numbers;
for #$numbers -> $n {
next unless $n ~~ / (<|w> \d+: <?{ $/ <= 42 }>) /;
say $n
}
The result is the same:
(67 43 31 41 89 14 52 71 48 64 5 21 6 31 44 27 39 94 78 15 39)
31
41
14
5
21
6
31
27
39
15
39
You can dynamically create the pattern. I've used a non-capture group (?:) here to keep the start and end of string anchors outside the list of |-ed numbers.
my $n = int rand 40;
my $answer = 42;
my $pattern = join '|', 0 .. $n;
if ($answer =~ m/^(?:$pattern)$/) {
print "Answer is an integer within specified range";
}
Please keep in mind that for your purpose this makes little sense.

searching one file line in other file and the output with word after line number

Two files A and B are below.I want to search line of A file in Bfile and the matched entry should be written to other file only one word after line number
A:
5000cca025884d5
5000cca025a1ee6
B:
0. c0t5000CCA025A1EE6Cd0 <preSUN30G-A2B0-279.40GB>
/scsi_vhci/disk#g5000cca025a1ee6c
1. c0t5000CCA025A28FECd0 <preSUN30G-A2B0-279.40GB>
i/disk#g5000cca025a28fec`
2. c0t5000CCA0258BA1DCd0 <HsdfdsSUN30G-A2B0 cyl 46873 alt 2 hd 20 sec >
i/disk#g5000cca0258ba1dc
3. c0t5000CCA025884D5Cd0 <UN300G cyl 46873 alt 2 hd 20 sec 625> solaris
i/disk#g5000cca025884d5c`
4. c0t5000CCA02592705Cd0 <UN300G cyl 46873 alt 2 hd 20 sec 625> solaris
i/disk#g5000cca02592705c
awk 'FNR == 1 && NR != 1 { start=1 } start != 1 { if ($0 != "" ) { lnes[$0]="" } } start==1 { for ( i in lnes ) { if ( $0 ~ i ) { print $1 >> "Cfile" } } }' Afile Bfile
Where Afile is the file with the two lines and Bfile is the other file, we use awk to process both files at the same time. We first read the Afile in and read each line into an array (lnes). Then at the end of the first file/beginning of the second (FNR == 1 && NR != 1) we loop through each entry in the array lnes and pattern match against the line in Bfile. If there is a match, redirect the print of the first space delimited field to the file Cfile.

Regular Expression to get substrings in PowerShell

I need help with the regular expression. I have 1000's of lines in a file with the following format:
+ + [COMPILED]\SRC\FileCheck.cs - TotalLine: 99 RealLine: 27 Braces: 18 Comment: 49 Empty: 5
+ + [COMPILED]\SRC\FindstringinFile.cpp - TotalLine: 103 RealLine: 26 Braces: 22 Comment: 50 Empty: 5
+ + [COMPILED]\SRC\findingstring.js - TotalLine: 91 RealLine: 22 Braces: 14 Comment: 48 Empty: 7
+ + [COMPILED]\SRC\restinpeace.h - TotalLine: 95 RealLine: 24 Braces: 16 Comment: 48 Empty: 7
+ + [COMPILED]\SRC\Getsomething.h++ - TotalLine: 168 RealLine: 62 Braces: 34 Comment: 51 Empty: 21
+ + [COMPILED]\SRC\MemDataStream.hh - TotalLine: 336 RealLine: 131 Braces: 82 Comment: 72 Empty: 51
+ + [CONTEXT]\SRC\MemDataStream.sql - TotalLine: 36 RealLine: 138 Braces: 80 Comment: 76 Empty: 59
I need a regular expression that can give me:
FilePath i.e. \SRC\FileMap.cpp
Extension i.e. .cpp
RealLine value i.e. 17
I'm using PowerShell to implement this and been successful in getting the results back using Get-Content (to read the file) and Select-String cmdlets.
Problem is its taking a long time to get the various substrings and then writing those in the xml file.(I have not put in the code for generating and xml).
I've never used regular expressions before but I know using a regular expression would be an efficient way to get the strings..
Help would be appreciated.
The Select-String cmdlet accepts the regular expression to search for the string.
Current code is as follows:
function Get-SubString
{
Param ([string]$StringtoSearch, [string]$StartOfTheString, [string]$EndOfTheString)
If($StringtoSearch.IndexOf($StartOfTheString) -eq -1 )
{
return
}
[int]$StartOfIndex = $StringtoSearch.IndexOf($StartOfTheString) + $StartOfTheString.Length
[int]$EndOfIndex = $StringtoSearch.IndexOf($EndOfTheString , $StartOfIndex)
if( $StringtoSearch.IndexOf($StartOfTheString)-ne -1 -and $StringtoSearch.IndexOf($EndOfTheString) -eq -1 )
{
[string]$ExtractedString=$StringtoSearch.Substring($StartOfTheString.Length)
}
else
{
[string]$ExtractedString = $StringtoSearch.Substring($StartOfIndex, $EndOfIndex - $StartOfIndex)
}
Return $ExtractedString
}
function Get-FileExtension
{
Param ( [string]$Path)
[System.IO.Path]::GetExtension($Path)
}
#For each file extension we will be searching all lines starting with + +
$SearchIndividualLines = "+ + ["
$TotalLines = select-string -Pattern $SearchIndividualLines -Path
$StandardOutputFilePath -allmatches -SimpleMatch
for($i = $TotalLines.GetLowerBound(0); $i -le $TotalLines.GetUpperBound(0); $i++)
{
$FileDetailsString = $TotalLines[$i]
#Get File Path
$StartStringForFilePath = "]"
$EndStringforFilePath = "- TotalLine"
$FilePathValue = Get-SubString -StringtoSearch $FileDetailsString -StartOfTheString $StartStringForFilePath -EndOfTheString $EndStringforFilePath
#Write-Host FilePathValue is $FilePathValue
#GetFileExtension
$FileExtensionValue = Get-FileExtension -Path $FilePathValue
#Write-Host FileExtensionValue is $FileExtensionValue
#GetRealLine
$StartStringForRealLine = "RealLine:"
$EndStringforRealLine = "Braces"
$RealLineValue = Get-SubString -StringtoSearch $FileDetailsString -
StartOfTheString $StartStringForRealLine -EndOfTheString $EndStringforRealLine
if([string]::IsNullOrEmpty($RealLineValue))
{
continue
}
}
Assume you have those in C:\temp\sample.txt
Something like this?
PS> (get-content C:\temp\sample.txt) | % { if ($_ -match '.*COMPILED\](\\.*)(\.\w+)\s*.*RealLine:\s*(\d+).*') { [PSCustomObject]#{FilePath=$matches[1]; Extention=$Matches[2]; RealLine=$matches[3]} } }
FilePath Extention RealLine
-------- --------- --------
\SRC\FileCheck .cs 27
\SRC\FindstringinFile .cpp 26
\SRC\findingstring .js 22
\SRC\restinpeace .h 24
\SRC\Getsomething .h 62
\SRC\MemDataStream .hh 131
Update:
Stuff inside paranthesis is captured, so if you want to capture [COMPILED], you will need to just need to add that part into the regex:
Instead of
$_ -match '.*COMPILED\](\\.*)
use
$_ -match '.*(\[COMPILED\]\\.*)
The link in the comment to your question includes a good primer on the regex.
UPDATE 2
Now that you want to capture set of path, I am guessing you sample looks like this:
+ + [COMPILED]C:\project\Rom\Main\Plan\file1.file2.file3\Cmd\Camera.culture.less-Lat‌​e-PP.min.js - TotalLine: 336 RealLine: 131 Braces: 82 Comment: 72 Empty: 51
The technique above will work, you just need to do a very slight adjustment for the first parenthesis like this:
$_ -match (\[COMPILED\].*)
This will tell regex that you want to capture [COMPILED] and everything that comes after it, until
(\.\w+)
i.e to the extension, which as a dot and a couple of letters (which might not works if you had an extension like .3gp)
So, your original one liner would instead be:
(get-content C:\temp\sample.txt) | % { if ($_ -match '.(\[COMPILED\].*)(\.\w+)\s*.*RealLine:\s*(\d+).*') { [PSCustomObject]#{FilePath=$matches[1]; Extention=$Matches[2]; RealLine=$matches[3]} } }

Perl while loop with a regex condition, does not seem to re-enter the loop

I am trying to read a diff file, and group the delete hunks (with leading -<SPACE>) and add hunks (leading +<SPACE>), I have the following, where I want to have an inner while to read all the deletes, or adds, but somehow I see that after the first match, the loop doesn't seem to re-evaluate, i.e after a line matches the condition on line 32, it enters the loop and then I read another line into $_ but the control transfers out of the loop to line 36 instead of re-entering the loop and check the condition again! (I checked from the debugger)
27 while (<$inh>) {
28 next if /^$/;
29 next if /^[^-+]/; # ignore non diff lines
30 chomp;
31 my $tmps = '';
32 while (/^\-\ /) { # Read the entire block (all lines beginning with -<SPACE>
33 $tmps .= getprintables($_);
34 $_ = <$inh>; # Read next line
35 }
36 push(#prevs, $tmps) if $tmps ne '';
37 $tmps = '';
38 while (/^\+\ /) { # Read the entire block (all lines beginning with +<SPACE>
39 $tmps .= getprintables($_);
40 $_ = <$inh>;
41 }
42 push(#curs, $tmps) if $tmps ne '';
43 $tmps = '';
44 }
45 close($inh);
I also tried the longer form while ($_ =~ m/^\+\ /) but same results. I don't see what is wrong here. I have Perl v5.14.2
I would suggest only ever doing your reading from a file handle from a single while loop, and using state variables to perform additional logic.
If I'm reading your intent properly, you can redesign your approach to use the Range operator .. instead like so:
while (<$inh>) {
next if /^$/;
next if /^[^-+]/; # ignore non diff lines
chomp;
if ( my $range = /^- / .. !/^- / ) {
push #prevs, '' if $range == 1;
$prevs[-1] .= getprintables($_) if $range !~ /E/;
}
if ( my $range = /^\+ / .. !/^\+ / ) {
push #curs, '' if $range == 1;
$curs[-1] .= getprintables($_) if $range !~ /E/;
}
}
close($inh);

awk if statement and pattern matching

I have the next input file:
##Names
##Something
FVEG_04063 1265 . AA ATTAT DP=19
FVEG_04063 1266 . AA ATTA DP=45
FVEG_04063 2703 . GTTTTTTTT ATA DP=1
FVEG_15672 2456 . TTG AA DP=71
FVEG_01111 300 . CTATA ATATA DP=7
FVEG_01111 350 . AGAC ATATATG DP=41
My desired output file:
##Names
##Something
FVEG_04063 1266 . AA ATTA DP=45
FVEG_04063 2703 . GTTTTTTTT ATA DP=1
FVEG_15672 2456 . TTG AA DP=71
FVEG_01111 300 . CTATA ATATA DP=7
FVEG_01111 350 . AGAC ATATATG DP=41
Explanation: I want to print in my output file, all the lines begining with "#", all the "unique" lines attending to column 1, and if I have repeated hits in column 1, first: take the number in $2 and sum to length of $5 (in same line), if the result is smaller than the $2 of next line, print both lines; BUT if the result is bigger than the $2 of next line, compare the values of DP and only print the line with best DP.
What I've tried:
awk '/^#/ {print $0;} arr[$1]++; END {for(i in arr){ if(arr[i]>1){ HERE I NEED TO INTRODUCE MORE 'IF' I THINK... } } { if(arr[i]==1){print $0;} } }' file.txt
I'm new in awk world... I think that is more simple to do a little script with multiple lines... or maybe is better a bash solution.
Thanks in advance
As requested, an awk solution. I have commented the code heavily, so hopefully the comments will serve as explanation. As a summary, the basic idea is to:
Match comment lines, print them, and go to the next line.
Match the first line (done by checking if whether we have started remembering col1 yet).
On all subsequent lines, check values against the remembered values from the previous line. The "best" record, ie. the one that should be printed for each unique ID, is remembered each time and updated depending on conditions set forth by the question.
Finally, output the last "best" record of the last unique ID.
Code:
# Print lines starting with '#' and go to next line.
/^#/ { print $0; next; }
# Set up variables on the first line of input and go to next line.
! col1 { # If col1 is unset:
col1 = $1;
col2 = $2;
len5 = length($5);
dp = substr($6, 4) + 0; # Note dp is turned into int here by +0
best = $0;
next;
}
# For all other lines of input:
{
# If col1 is the same as previous line:
if ($1 == col1) {
# Check col2
if (len5 + col2 < $2) # Previous len5 + col2 < current $2
print best; # Print previous record
# Check DP
else if (substr($6, 4) + 0 < dp) # Current dp < previous dp:
next; # Go to next record, do not update variables.
}
else { # Different ids, print best line from previous id and update id.
print best;
col1 = $1;
}
# Update variables to current record.
col2 = $2;
len5 = length($5);
dp = substr($6, 4) + 0;
best = $0;
}
# Print the best record of the last id.
END { print best }
Note: dp is calculated by taking the sub-string of $6 starting at index 4 and going to the end. The + 0 is added to force the value to be converted to an integer, to ensure the comparison will work as expected.
Perl solution. You might need to fix the border cases as you didn't provide data to test them.
#last remembers the last line, #F is the current line.
#!/usr/bin/perl
use warnings;
use strict;
my (#F, #last);
while (<>) {
#F = split;
print and next if /^#/ or not #last;
if ($last[0] eq $F[0]) {
if ($F[1] + length $F[4] > $last[1] + length $last[4]) {
print "#last\n";
} else {
my $dp_l = $last[5];
my $dp_f = $F[5];
s/DP=// for $dp_l, $dp_f;
if ($dp_l > $dp_f) {
#F = #last;
}
}
} else {
print "#last\n" if #last;
}
} continue {
#last = #F;
}
print "#last\n";