sed regexp - extra unwanted line in matching output - regex

I have this file
~/ % cat t
---
abc
def DEF
ghi GHI
---
123
456
and I would like to extract the content between the three dashes, so I try
sed -En '{N; /^---\s{5}\w+/,/^---/p}' t
I.e. 3 dashes followed by 5 whitespaces including the newline, followed by one or more word characters and ending with another set of three dashes. This gives me this output
~/ % sed -En '{N; /^---\s{5}\w+/,/^---/p}' t
---
abc
def DEF
ghi GHI
---
123
I don't want the line with "123". Why am I getting that and how do I adjust my expression to get rid of it? [EDIT]: It is important that the four spaces of indentation after the first three dashes are matched in the expression.

This might work for you (GNU sed):
sed -En '/^---/{:a;N;/^ {4}\S/M!D;/\n---/!ba;p}' file
Turn on extended regexp (-E) and off implicit printing (-n).
If a line begins --- and the following line is indented by 4 spaces, gather up the following lines until another begins --- and print them.
If the following line does not match the above criteria, delete the first and repeat.
All other lines will pass through unprinted.
N.B. The M flag on the second regexp for multiline matching , since the first line already begins --- the next must be indented.

No need to use the pattern space here - a range pattern will do fine.
$ sed -n '/^---/,/^---/p' t
---
abc
def DEF
ghi GHI
---
Tested in GNU sed 4.7 and OSX sed.

I believe you can use
perl -0777 -ne '/^---\R(\s{4}\w.*?^---)/gsm && print "$1\n";' t
Details:
-0777 - slurps the file into a single variable
^---\R(\s{4}\w.*?^---) - start of a line (^), ---, a line break, then Group 1: four whitespaces, a word char, then zero or more chars as few as possible, and then --- at the start of a line
gsm - global, all occurrences are returned, s means . matches any chars including line break chars, as m means ^ now matches start of any line, not just string start
&& print "$1\n" - if there is a match, print Group 1 value + a line break.

Related

Unable to match multiple digits in regex

I am simply trying to print 5 or 6 digit number present in each line.
cat file.txt
Random_something xyz ...64763
Random2 Some String abc-778986
Something something 676347
Random string without numbers
cat file.txt | sed 's/^.*\([0-9]\{5,6\}\+\).*$/\1/'
Current Output
64763
78986
76347
Random string without numbers
Expected Output
64763
778986
676347
The regex doesn't seem to work as intended with 6 digit numbers. It skips the first number of the 6 digit number for some reason and it prints the last line which I don't need as it doesn't contain any 5 or 6 digit number whatsoever
grep is a better for this with -o option that prints only matched string:
grep -Eo '[0-9]{5,6}' file
64763
778986
676347
-E is for enabling extended regex mode.
If you really want a sed, this should work:
sed -En 's/(^|.*[^0-9])([0-9]{5,6}).*/\2/p' file
64763
778986
676347
Details:
-n: Suppress normal output
(^|.*[^0-9]): Match start or anything that is followed by a non-digit
([0-9]{5,6}): Match 5 or 6 digits in capture group #2
.* Match remaining text
\2: is replacement that puts matched digits back in replacement
/p prints substituted text
With awk, you could try following. Simple explanation would be, using match function of awk and giving regex to match 5 to 6 digits in each line, if match is found then print the matched part.
awk 'match($0,/[0-9]{5,6}/){print substr($0,RSTART,RLENGTH)}' Input_file

Regex, select the line that starts with my condition, but take only the characters after space

I have a file that has content similiar below:
ptrn: 435324kjlkj34523453
Note1: rtewqtiojdfgkasdktewitogaidfks
Note2: t4rwe3tewrkterqwotkjrekqtrtlltre
I am trying to get characters after space at the line starts with "ptrn:" . I am trying the command below ;
>>> cat daily.txt | grep '^p.*$' > dailynew.txt
and I am getting the result in the new file:
ptrn: 435324kjlkj34523453
But I want only the characters after space, which are " 435324kjlkj34523453" to be written in the new file without "ptrn:" at the beginning.
The result should be like:
435324kjlkj34523453
How can establish this goal with an efficient regex?
You can use
grep -oP '^ptrn:\s*\K.*' daily.txt > dailynew.txt
awk '/^ptrn:/{print $2}' daily.txt > dailynew.txt
sed -n 's/^ptrn:[[:space:]]*\(.*\)/\1/p' daily.txt > dailynew.txt
See the online demo. All output 435324kjlkj34523453.
In the grep PCRE regex (enabled with -P option) the patterns match
^ - the startof string
ptrn: - a ptrn: substring
\s* - zero or more whitespaces
\K - match reset operator that clears the current match value
.* - the rest of the line.
In the awk command, ^ptrn: regex is used to find the line starting with ptrn: and then {print $2} prints the value after the first whitespace, from the second "column" (since the default field separator in awk is whitespace).
In sed, the command means
-n - suppresses the default line output
s - substitution command is used
^ptrn:[[:space:]]*\(.*\) - start of string, ptrn:, zero or more whitespace, and the rest of the line captured into Group 1
\1 - replaces the match with group 1 value
p - prints the result of the substitution.
You can use this sed:
sed -nE 's/^ptrn: (.*)/\1/p' file > output_file.txt

Delete all lines which don't match a pattern

I am looking for a way to delete all lines that do not follow a specific pattern (from a txt file).
Pattern which I need to keep the lines for:
x//x/x/x/5/x/
x could be any amount of characters, numbers or special characters.
5 is always a combination of alphanumeric - 5 characters - e.g Xf1Lh, always appears after the 5th forward slash.
/ are actual forward slashes.
Input:
abc//a/123/gds:/4AdFg/f3dsg34/
y35sdf//x/gd:df/j5je:/x/x/x
yh//x/x/x/5Fsaf/x/
45wuhrt//x/x/dsfhsdfs54uhb/
5ehys//srt/fd/ab/cde/fg/x/x
Desired output:
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
grep selects lines according to a regular expression and your x//x/x/x/5/x/ just needs minor changes to make it into a regular expression:
$ grep -E '.*//.*/.*/.*/[[:alnum:]]{5}/.*/' file
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Explanation:
"x could be any amount of characters, numbers or special characters". In a regular expression that is .* where . means any character and * means zero or more of the preceding character (which in this case is .).
"5 is always a combination of alphanumeric - 5 characters". In POSIX regular expressions, [[:alnum:]] means any alphanumeric character. {5} means five of the preceding. [[:alnum:]] is unicode-safe.
Possible improvements
One issue is how x should be interpreted. In the above, x was allowed to be any character. As triplee points out, however, another reasonable interpretation is that x should be any character except /. In that case:
grep -E '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
Also, we might want this regex to match only complete lines. In that case, we can either surround the regex with ^ an $ or we can use grep's -x option:
grep -xE '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
I was figuring out how to do it in awk at the same time as the other answer and came up with:
awk -F/ 'BEGIN{OFS=FS}$2==""&&$6~/[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]/&&NF=8'
The awk I worked it out on didn't support the {5} regexp frob.
You can use -P option for extended perl support like
grep -P "^(?:[^/]*/){5}[A-Za-z0-9]{5}/(?:/|$)" input
Output
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Regex Breakdown
^ #Start of line
(?: #Non capturing group
[^/]* #Match anything except /
/ #Match / literally
){5} #Repeat this 5 times
[A-Za-z0-9]{5} #Match alphanumerics. You can use \w if you want to allow _ along with [A-Za-z0-9]
(?: #Non capturing group
/ #Next character should be /
| #OR
$ #End of line
)
Using sed and in place edit to delete all lines that do not follow a specific pattern (from a txt file):
$ sed -i.bak -n "/.*\/\/.*\/.*\/.*\/[a-zA-Z0-9]\{5\}\/.*\//p" test.in
$ cat test.in
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
-i.bak in place edit creating a test.in.bak backup file, -n quiet, do not print non-matches to output
and ".../p" print matches.

Filter words starting and ending with hyphen but not when it's found in the middle

I have a list of words I want to filter: only those that starts or ends with a hyphen but not those with a hyphen in the middle. That is, to filter entries like: "a-" or "-cefalia" but not "castellano-manchego".
I have tried with many options and the most similar thing I've found it'sgrep -E '*\-' minilemario.txt however it filters all hyphens. Could you please provide me with a solution?
a
a-
aarónico
aaronita
amuzgo
an-
-án
ana
-ana
ana-
anabaptismo
anabaptista
blablá
bla-bla-bla
blanca
castellano
castellanohablante
castellano-leonés
castellano-manchego
castellanoparlante
cedulario
cedulón
-céfala
cefalalgia
cefalálgico
cefalea
-cefalia
cefálica
cefálico
cefalitis
céfalo
-céfalo
cefalópodo
cefalorraquídeo
cefalotórax
cefea
ciabogar
cian
cian-
cianato
cianea
cianhídrico
cianí
ciánico
cianita
ciano-
cianógeno
cianosis
cianótico
cianuro
ciar
ciática
ciático
zoo
zoo-
zoófago
Using grep, say:
grep -E '^-|-$' filename
to get the words starting and ending with -. And
grep -v -E '^-|-$' filename
to exclude the words starting and ending with -.
^ and $ are anchors denoting the start and end of line respectively. You used '*\-' which would match anything followed by - (it doesn't say that - is at the end of the line).
Here is a bash only solution. Please see the comments for details:
#!/usr/bin/env bash
# Assign the first argument (e.g. a textfile) to a variable
input="$1"
# Bash 4 - read the data line by line into an array
readarray -t data < "$input"
# Bash 3 - read the data line by line into an array
#while read line; do
# data+=("$line")
#done < "$input"
# For each item in the array do something
for item in "${data[#]}"; do
# Line starts with "-" or ends with "-"
[[ "$item" =~ ^-|-$ ]] && echo "$item"
done
This will produce the following output:
$ ./script input.txt
a-
an-
-án
-ana
ana-
-céfala
-cefalia
-céfalo
cian-
ciano-
zoo-

How to find/extract a pattern from a file?

Here are the contents of my text file named 'temp.txt'
---start of file ---
HEROKU_POSTGRESQL_AQUA_URL (DATABASE_URL) ----backup---> b687
Capturing... done
Storing... done
---end of file ----
I want to write a bash script in which I need to capture the string 'b687' in a variable. this is really a pattern (which is the letter 'b' followed by 'n' number of digits). I can do it the hard way by looping through the file and extracting the desired string (b687 in example above). Is there an easy way to do so? Perhaps by using awk or sed?
Try using grep
v=$(grep -oE '\bb[0-9]{3}\b' file)
This will seach for a word starting with b followed by '3' digits.
regex101 demo
Using sed
v=$(sed -nr 's/.*\b(b[0-9]{3})\b.*/\1/p' file)
varname=$(awk '/HEROKU_POSTGRESQL_AQUA_URL/{print $4}' filename)
what this does is reads the file when it matches the pattern HEROKU_POSTGRESQL_AQUA_URL print the 4th token in this case b687
your other option is to use sed
varname=$(sed -n 's/.* \(b[0-9][0-9]*\)/\1/p' filename)
In this case we are looking for the pattern you mentioned b####... and only print that pattern the -n tells sed not to print line that do not have that pattern. the rest of the sed command is a substitution .* is any string at the beginning. followed by a (...) which forms a group in which we put the regex that will match your b##### the second part says out of all that match only print the group 1 and the p at the end tells sed to print the result (since by default we told sed not to print with the -n)