Regex substitute multiple lines for a single line - regex

I have a plain text file in which I need to substitute multiple consecutive lines of text with a single replacement line. For example, when I have a date and time, followed by a blank line, followed by a page number,
11/13/2018 08:33:00
Page 1 of 1
I'd like to replace it with a single line (e.g., PAGE BREAK).
I've tried
sed 's/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}\n\nPage \d of \d/PAGE BREAK/g' file1.txt > file2.txt
and
perl -pe 's/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}\n\nPage \d of \d/PAGE BREAK/g' file1.txt > file2.txt
but it leaves the text unchanged.

Both sed and Perl process the input line by line. You can tell Perl to load the whole file into memory by using -0777 (if it's not too large):
perl -0777 -pe 's=[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}\n\nPage [0-9]+ of [0-9]+=PAGE BREAK=g'
Note that I used [0-9], because \d can match ٤, ໖, ६, or 𝟡.
I also used s=== instead of s/// so I don't have to backslash the slashes in the date part.

Another Perl variant
$ cat page_break.txt
123 45 jh kljl
11/13/2018 08:33:00
Page 1 of 1
ghjgjh hkjhj
fhfghfghfh
11/13/2018 08:33:00
Page 1 of 2
ghgigkjkj
$ perl -ne '{ if ( (/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}/ and $x++)or ( /^\s*$/ and $x++) or (/Page \d of \d/ and $x++) ){} if($x==0) { print "$_" } if($x==3) { print "PAGE BREAK\n"; $x=0} }' page_break.txt
123 45 jh kljl
PAGE BREAK
ghjgjh hkjhj
fhfghfghfh
PAGE BREAK
ghgigkjkj
$

Related

How to check last 3 chars of a string are alphabets or not using awk?

I want to check if the last 3 letters in column 1 are alphabets and print those rows. What am I doing wrong?
My code :-
awk -F '|' ' {print str=substr( $1 , length($1) - 2) } END{if ($str ~ /^[A-Za-z]/ ) print}' file
cat file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
.*/|982376
0NRT0|928731
expected output :
12300USD|0392
abc56eur|97834
238aed|23911
aabccxx|38731
73716yen|19287
$ awk -F'|' '$1 ~ /[[:alpha:]]{3}$/' file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
Regarding what's wrong with your script:
You're doing the test for alphabetic characters in the END section for the final line read instead of once per input line.
You're trying to use shell variable syntax $str instead of awk str.
You're testing for literal character ranges in the bracket expression instead of using a character class so YMMV on which characters that includes depending on your locale.
You're testing for a string that starts with a letter instead of a string that ends with 3 letters.
Use grep:
grep -P '^[^|]*[A-Za-z]{3}[|]' in_file > out_file
Here, GNU grep uses the following option:
-P : Use Perl regexes.
The regex means this:
^ : Start of the string.
[^|]* : Any non-pipe character, repeated 0 or more times.
[A-Za-z]{3} : 3 letters.
[|] : Literal pipe.
sed -n '/^[^|]*[a-Z][a-Z][a-Z]|/p' file
grep '^[^|]*[a-Z][a-Z][a-Z]|' file
{m,g}awk '!+FS<NF' FS='^[^|]*[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '$!_!~"[|]"' FS='[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '($!_~"[|]")<NF' FS='[A-Za-z][A-Za-z][A-Za-z][|]' # to play it safe
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287

Search for Pattern in Text String, then Extract Matched Pattern

I am trying to match and then extract a pattern from a text string. I need to extract any pattern that matches the following in the text string:
10289 20244
Text File:
KBOS 032354Z 19012KT 10SM FEW060 SCT200 BKN320 24/17 A3009 RMK AO2 SLP187 CB DSNT NW T02440172 10289 20244 53009
I am trying to achieve this using the following bash code:
Bash Code:
cat text_file | grep -Eow '\s10[0-9].*\s' | head -n 4 | awk '{print $1}'
The above code attempts to search for any group of approximately five numeric characters that begin with 10 followed by three numeric characters. After matching this pattern, the code prints out the rest of text string, capturing the second group of five numeric characters, beginning with 20.
I need a better, more reliable way to accomplish this because currently, this code fails. The numeric groups I need are separated by a space. I have attempted to account for this by inserting \s into the grep portion of the code.
grep solution:
grep -Eow '10[0-9]{3}\b.*\b20[0-9]{3}' text_file
The output:
10289 20244
[0-9]{3} - matches 3 digits
\b - word boundary
awk '{print $(NF-2),$(NF-1)}' text_file
10289 20244
Prints next to last and the one previous.
awk '$17 ~ /^10[0-9]{3}$/ && $18 ~ /^20[0-9]{3}$/ { print $17, $18 }' text_file
This will check field 17 for "10xxx" and field 18 for "20xxx", and when BOTH match, print them.

Regex with perl one liner

I have the following:
XXUM_7_mauve_999119_ser_11.255255
UXUM_566_mauve_999119_ser_11.255255
IXUM_23_mauve_999119_ser_11.255255
and my attempt, which did not work, at a perl one liner to extract the first digit is as follows;
perl -pi -e "s/\S+_(\.+)_.+/Number$1/g" *.txt
I expected the following results:
Number 007
Number 566
Number 023
pls help
I'd use the -n option instead of the -p option and do the printing and formatting in the code:
perl -i~ -ne 'if (($num) = /[0-9]+/g) {
printf "Number %03d\n", $num;
} else {
print
}' *.txt
The problem is that this regex pattern /\S+_(\.+)_.+/ looks for a sequence of one or more literal dots . surrounded by underscores, so something like _..._ would match, but such a sequence doesn't exist in your file. I think you didn't mean to escape the dot. But even then, because the \S+ is greedy, it would find and capture the last field delimited by underscores, and so would capture ser from all three lines. Perhaps you meant to write \d+ instead of \.+, which is pretty much what I have written below.
This will do as you ask. It looks for the first occurrence of an underscore that is followed by a number of decimal digits, and uses printf to format the number as three digits.
You can add the -i qualifier, but I suggest you test it as it is first to save overwriting your data with erroneous results. Of course you could redirect the output to another file if you wished.
perl -ne'/_(\d+)/ and printf "Number %03d\n", $1' myfile
output
Number 007
Number 566
Number 023
cat > /tmp/test
XXUM_7_mauve_999119_ser_11.255255
UXUM_566_mauve_999119_ser_11.255255
IXUM_23_mauve_999119_ser_11.255255
perl -i -ne 'if ($_=~/^\w+\_(\d+)\_mauve/g) { printf "Number %03d\n", $1; }' /tmp/test
cat /tmp/test
Number 007
Number 566
Number 023

print lines between two patterns only when second pattern last column satisfy condition

File is like below:
#########################################
some text
some text
........
/pattern1/ some text here also in this line
some more text
some more text
/pattern2/ some text last_column/file
some text
some text
.........
/pattern1/
some text
.....
.....
/pattern2/ some text last_column/filed
###########################################
NOTE:
Last_column/field is always a numeric value.
Both the patterns pattern1, pattern2 & some lines between the patterns will be present for sure.
could anyone please help me out?
I need the following outputs
I need to print all the lines between pattern1 and pattern2
I need to print the lines between pattern1 and pattern2 only when last column/field in pattern2 matching line is greater than 10, I dont want to print lines between these patterns if conditions not satisfies. i.e., last column/field of pattern2 matching line is lessser than 10.
awk , sed, grep anything is fine.
The first is trivial:
sed -n '/pattern1/,/pattern2/p' input-file
For the second, I would do:
tac input-file |
awk '/pattern2/ && $NF > 10 { p=1} p; /pattern1/{p=0}' |
tac
If you do not have access to tac (which merely reverses lines of input), you could do:
awk '/pattern1/{p=1}
p{ b = sprintf( "%s%s\n", b, $0 )}
/pattern2/ { if( $NF > 10 && p ) printf "%s", b; b=""; p=0 }' input-file
You can do both with some fairly painless regular expressions with grep.
These examples will print to stdout
1: grep -Pzo '(?s)(?<=/pattern1/).*?(?=/pattern2/)' file
2: grep -Pzo '(?s)(?<=/pattern1/).*?(?=/pattern2/.*?[1-9][0-9]+)' file
Explanation
grep flags:
-P --perl-regexp (extended regex functionality)
-z ignore newlines (`\n`) in input
-o print only the matched part
Regular expression:
(?s) #PCRE_DOTALL (. matches any character)
(?<= #Positive look-behind (match this pattern, but don't include in the output)
/pattern1/
)
.*? #Find 0 or more of . (any character) in "non-greedy" mode
(?= #Positive look-ahead (match this pattern, but don't include in the output)
/pattern2/
.*? #Find 0 or more of . (any character) in "non-greedy" mode
[1-9][0-9]+ #Match a number greater than 10 (which would be comprised of
# one digit 1-9 followed by any number of digits 0-9)
)

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file