Extract string between combination of words and characters [duplicate] - regex

This question already has answers here:
How to use sed/grep to extract text between two words?
(14 answers)
Closed last year.
The community reviewed whether to reopen this question 10 months ago and left it closed:
Original close reason(s) were not resolved
I would like to keep the strings between (FROM and as), and (From and newline character).
Input:
FROM some_registry as registry1
FROM another_registry
Output:
some_registry
another_registry
Using the following sed command, I can extract the strings. Is there a way to combine the two sed commands?
sed -e 's/.*FROM \(.*\) as.*/\1/' | sed s/"FROM "//

Merging into one regex expression is hard here because POSIX regex does not support lazy quantifiers.
With GNU sed, you can pass the command as
sed 's/.*FROM \(.*\) as.*/\1/;s/FROM //' file
See this online demo.
However, if you have a GNU grep you can use a bit more precise expression:
#!/bin/bash
s='FROM some_registry as registry1
From another_registry'
grep -oP '(?i)\bFROM\s+\K.*?(?=\s+as\b|$)' <<< "$s"
See the online demo. Details:
(?i) - case insensitive matching ON
\b - a word boundary
FROM - a word
\s+ - one or more whitespaces
\K - "forget" all text matched so far
.*? - any zero or more chars other than line break chars as few as possible
(?=\s+as\b|$) - a positive lookahead that matches a location immediately followed with one or more whitespaces and then a whole word as, or end of string.

Related

How to match strings with at most n free characters between two well-defined patterns?

Let's say I have the text from a bunch of articles. I want to be be able to grep for patterns related to COVID-19. How would I search for such a thing considering that some people call it Cov2, CoV-2, COVID-2, COVID19, COVID-19, COVID 19, etc...
Basically, that pattern I have so far is
grep "[Cc][Oo][Vv].{0,3}2\|[Cc][Oo][Vv].{0,3]19" file.txt
But this isn't working. I'm pretty sure the problem is the ".{0,3}" part. I'm not sure how to tell the computer to match up to 3 free characters, followed by 2 or 19, and preceded by [Cc][Oo][Vv]
Assuming you have a GNU grep, your pattern contains several mistakes:
{0,3} - in a POSIX BRE pattern, a range quantifier is defined with a pair of escaped braces, \{0,3\}
{0,3] - same comment, just the closing braces got replaced with ].
You can use
grep -i -E "COV.{0,3}(2|19)" file
Or, a bit more precise:
grep -i -E "COV(ID)?[-[:space:]]?(2|19)"
See an online grep demo #1 and a demo #2.
Details
-i - case insensitive mode
-E - POSIX ERE syntax enabled (to avoid extra \ symbols in the regex pattern)
COV.{0,3}(2|19) - COV substring (case insensitive), then any zero to three chars, and then either 2 or 19
(ID)?[-[:space:]]? - matches an optional ID substring, and then an optional - or a whitespace char.

Find and replace regular expression with alternate format

I have a file that has lines that contain text like this
something,12:3456789,somethingelse
foobar,12:345678,somethingdifferent
For lines where the second item in the line has 6 digits after the : I would like to alternate the format of it by adding a 0 in the front and shifting the :. For example the above would change to:
something,12:3456789,somethingelse
foobar,01:2345678,somethingdifferent
I can't figure out how to do this using sed or any unix command line tool
You just need to match the middle section where you have 2 digits followed by : followed by exactly 6 digits. If you capture the text in individual groups appropriately you can move them around in your result. Note the \b word boundary at the end of the pattern is to ensure that we match on exactly 6 digits and don't match on lines which have the full 7 digits:
/\b(\d)(\d):(\d{6})\b/0\1:\2\3/
|__________________| |______|
pattern replacement
This gives the expected output. You can experiment with it online here
sed doesn't have Perl style specifiers such as \d. Instead, you will need to use [[:digit:]]. Here is the updated regex that works with sed
sed -E 's/\b([[:digit:]])([[:digit:]]):([[:digit:]]{6})\b/0\1:\2\3/g' myfile.txt
As #Jonathan Leffler pointed out, \b doesn't work on Mac's sed so you will instead need to add commas in your regex pattern at the front and back and then replace them back in the replacement pattern

sed: ignore a substring that may or may not be present

With sed, I need to match and ignore a substring, that may or may not exist
Imagine I have these four strings, each on a separate line:
>package-1.22.3.src.tar.gz<
>package-1.22.4.src.tar.gz<
>package-1.23.tar.gz<
>package-1.23.1.tar.gz<
This is what I tried:
sed "s,.*>package-\(.[^<]*\)\(\.src\)\?\.tar.*<,\1,g"
I want a sed regex that will output this:
1.22.3
1.22.4
1.23
1.23.1
However, I get
1.22.3.src
1.22.4.src
1.23
1.23.1
The .[^<]* pattern matches any char with . and then [^<]* matches any 0+ chars other than <. It matches .src part, hence the optional \(\.src\)\? does not need to match, and the .src lands in Group 1.
If you want to fix your current code, just match digits and dots after package- with [0-9.]*:
sed "s,.*>package-\([0-9.]*\)\(\.src\)\?\.tar.*<,\1,g"
^^^^^^^
See the online demo
If you have GNU grep you may also use a PCRE pattern like
grep -oP ">package-\K\d+(\.\d+)+"
See another online demo. Here, after >package- is matched the text is removed from the match with \K operator and then 1+ digits followed with 1 or more repetitions of . and 1+ digits are matched and returned with the help of -o option.
This sed should work:
sed -E -n 's/.*-(.*\.[0-9]+).*<$/\1/p'
Output:
1.22.3
1.22.4
1.23
1.23.1

Why does the regex (aba?)+ not match with abab? [duplicate]

This question already has answers here:
Reference - What does this regex mean?
(1 answer)
Match exact string
(3 answers)
Closed 5 years ago.
Given (aba?)+ as the Regex and abab is the string.
Why does it only matches aba?
Since the last a in the regex is optional, isn't abab a match as well?
tested on https://regex101.com/
The reason (aba?)+ only matches aba out of abab is greedy matching: The optional a in the first loop is tested before the group is tested again, and matches. Therefore, the remaining string is b, which does not match (aba?) again.
If you want to turn off greedy matching for this optional a, use a??, or write your regex differently.
Since (aba?)+ is greedy, your pattern tries to match as much as possible. And since it first matches "aba", the remaining "b" is not matched.
Try the non-greedy version (it will match the first and second "ab"'s):
$ echo "abab" | grep -Po "(aba?)+"
aba
$ echo "abab" | grep -Po "(aba??)+"
abab
The correct regex for this is:
^(aba??)+$
and not (aba??)+ as discussed with #WiktorStribizew and YSelf.

Extract words containing question marks

I have tens of long text files (10k - 100k record each) where some characters were lost by careless handling and got replaced with question marks. I need to build a list of corrupted words.
I'm sure the most effective approach would be to regex the file with sed or awk or some other bash tools, but I'm unable to compose regex that would do the trick.
Here are couple of sample records for processing:
?ilkin, Aleksandr, Zahhar, isa
?igadlo-?van, Maria, Karl, abikaasa, 27.10.45, Veli?anõ raj.
Desired output would be:
?ilkin
?igadlo-?van
Veli?anõ
My best result so far seems to retrieve only words from the beginning of records:
awk '$1 ~/\?/ {print $1}' test.txt
->
?ilkin,
?igadlo-?van,
I need to build a list of corrupted words
If the aim is to only search for matches grep would be the most fast and powerful tool:
grep -Po '(^|)([^?\s]*?\?[^\s,]*?)(?=\s|,|$)' test.txt
The output:
?ilkin
?igadlo-?van
Veli?anõ
Explanation:
-P option, allows perl regular expresssions
-o option, tells to print only matched substrings
(^|) - matches the start of the string or an empty value(we can't use word boundary anchor \b in this case cause question mark ? is considered as a word boundary)
[^?\s]*? - matches any character except ? and whitespace \s if occurs
\?[^\s,]*? - matches a question mark ? followed by any character except whitespace \s and ,(which can be at right word boundary)
(?=\s|,|$) - lookahead positive assertion, ensures that a needed substring is followed by either whitespace \s, comma , or placed at the end of the string