search string preceded by either a space or a slash - regex

I've files with below content:
76a6f0f631888fbd359420796093d19a3928123d remotes/origin/feature/ASC-122356
417435aceb671e41213697055b86d860d9a9a61c remotes/origin/feature/ASC-122356-3762
ae863a41fef068215be1529216e9dbba1314fa6f remotes/origin/master
I want to search if origin/master pattern is there or not in the file.
I'm currently doing like grep -e '^\S\+ origin/master$' but it's not correct. How can I do it?

Following would work with grep. Positive number of non-space characters, followed by a space, followed by a possibly empty sequence of non-space characters and followed by the expected string.
grep -P '\S+ \S*origin/master$' test
Can be improved to make sure the origin is either at the begining of the second column or preceded by a / to eliminate strings like remotes/backup-origin/master
grep -P '^\S+ (|\S*/)origin/master$' test
Note those expressions require (-P) - perl compatible regexes.

The pattern is uses '^\S+ ' to request that ALL characters before origin/master will be non-space (because of the '^').
Consider using similar version, which will ask for ONE space
grep -e ' \S\+origin/master$'

Related

Sed replace string on alphanumeric with certain length that must contain one capitalized letter and one number

I want to do a string replacement on any string that is surrounded by a word boundary that is alphanumeric and is 14 characters long. The string must contain at least one capitalized letter and one number. I know (I think I know) that I'll need to use positive look ahead for the capitalized letter and number. I am sure that I have the right regex pattern. What I don't understand is why sed is not matching. I have used online tools to validate the pattern like regexpal etc. Within those tools, I am matching the string like I expect.
Here is the regex and sed command I'm using.
\b(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{14}\b
The sed command I'm testing with is
echo "asdfASDF1234ds" | sed 's/\b(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{14}\b/NEW_STRING/g'
I would expect this to match on the echoed string.
sed understands a very limited form of regex. It does not have lookahead.
Using a tool with more powerful regex support is the simple solution.
If you must use sed, you could do something like:
$ sed '
# mark delimiters
s/[^a-zA-Z0-9]\{1,\}/\n&\n/g
s/^[^\n]/\n&/
s/[^\n]$/&\n/
# mark 14-character candidates
s/\n[a-zA-Z0-9]\{14\}\n/\n&\n/g
# mark if candidate contains capital
s/\n\n[^\n]*[A-Z][^\n]*\n\n/\n&\n/g
# check for a digit; if found, replace
s/\n\n\n[^\n]*[0-9][^\n]*\n\n\n/NEW_STRING/g
# remove marks
s/\n//g
' <<'EOD'
a234567890123n
,a234567890123n,
xx,a234567890123n,yy
a23456789A123n
XX,a23456789A123n,YY
xx,a23456789A1234n,yy
EOD
a234567890123n
,a234567890123n,
xx,a234567890123n,yy
NEW_STRING
XX,NEW_STRING,YY
xx,a23456789A1234n,yy
$
This might work for you (GNU sed):
sed -E 's/\<[A-Za-z0-9]{14}\>/\n&\n/
s/\n.*(([A-Z].*[0-9])|([0-9].*[A-Z])).*\n/NEW_STRING/
s/\n//g' file
Isolate a 14 alphanumeric word by delimiting it with newlines.
If the string between the newlines contains at least one uppercase alpha character and at least one numeric character, replace the string and its delimiters by NEW_STRING.
Remove the delimiters.
Or if multiple strings, perhaps:
sed -E 's/\b/\n/g
s#.*#echo "&"|sed -E "/^[a-z0-9]{14}$/I{/[A-Z]/{/[0-9]/s/.*/NEW_STRING/}}"#e
s/\n//g' file
sed doesn't support lookaheads, or many many many other modern regex Perlisms. The simple fix is to use Perl.
perl -pe 's/\b(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{14}\b/NEW_STRING/g' <<< "asdfASDF1234ds"

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

Using a regular expression, I need to match only the IPv4 subnet mask from the given input string:
ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
For testing this input string is contained in a text file called file.txt, however the actual use case will be to parse /proc/cmdline, and I will need a solution that starts parsing, counting fields, and matching after encountering "ip=" until the next white space character.
I'm using bash 4.2.46 with GNU grep 2.20 on an EL 7.9 workstation, x86_64 to test the expression.
Based on examples I've seen looking at other questions, I've come up with the following grep command and PCRE regular expression which gives output that is very close to what I need.
[user#ws01 ~]$ grep -o -P '(?<!:)(?:\:[0-9])(.*?)(?=:)' file.txt
:255.255.254.0
My understanding of what I've done here is that, I've started with a negative lookbehind with a ":" character to try and exclude the first "::" field, followed by a non capturing group to match on an escaped ":" character, followed by a number, [0-9], then a capturing group with .*?, for the actual match of the string itself, and finally a look ahead for the next ":" character.
The problem is that this gives the desired string, but includes an extra : character at the beginning of the string.
Expected output should look like this:
255.255.254.0
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters. The reason for this is because a field can have an empty value. For example
:<null>:ip:gw:netmask:hostname:<null>:off
Null is shown here to indicate an omitted value not passed by the user, that the user does not need to provide for the intended purpose.
I've tried a few different expressions as suggested in other answers that use negative look behinds and look aheads to not start matching at a : which is neighbored by another :
For example, see this question:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
If I can start matching at the first single colon, by itself, which is not followed by or preceded by another : character, while excluding the colon character as the delimiter, and continue matching until the next single colon which is also not neighboring another : and without including the colon character, that should match the desired string.
I'm able to match the exact string by including "255" in an expression like this: (Which will work for all of our present use cases)
[user#ws01 ~]$ grep -o -P '(?:)255.*?(?=:)' file.txt
255.255.254.0
The logic problem here is that the subnet mask itself, may not always start with "255", but it should be a number, [0-9] which is why I'm attempting to use that in the expression above. For the sake of simplicity, I don't need to validate that it's not greater than 255.
Using gnu-grep you could write the pattern as:
grep -oP '(?<!:):\K\d{1,3}(?:\.\d{1,3}){3}(?=:(?!:))' file.txt
Output
255.255.254.0
Explanation
(?<!:): Negative lookahead, assert not : to the left and then match :
\K Forget what is matched until now
\d{1,3}(?:\.\d{1,3}){3} Match 4 times 1-3 digits separated by .
(?=:(?!:)) Positive lookahead, assert : that is not followed by :
See a regex demo.
Using grep
$ grep -oP '(?<!:)?:\K([0-9.]+)(?=:[[:alpha:]])' file.txt
View Demo here
or
$ grep -oP '[^:]*:\K[^:[:alpha:]]*' file.txt
Output
255.255.254.0
If these are delimiters, your value should be in a clearly predictable place.
Just treat every colon as a delimiter and select the 4th field.
$: awk -F: '{print $4}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
I'm not sure what you mean by
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters.
If your delimiters aren't predictable and parse-able, they are useless. If you mean the fields can have or not have quotes, but you need to exclude quotes, we can do that. If double colons are one delimiter and single colons are another that's horrible design, but we can probably handle that, too.
$: awk -F'::' '{ split($2,x,":"); print x[2];}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
For quotes, you need to provide an example.
Since the number of fields is always the same, simply separated by ":", you can use cut.
That solution will also work if you have empty fields.
cut -d":" -f4

How to grep an exact string with slash in it?

I'm running macOS.
There are the following strings:
/superman
/superman1
/superman/batman
/superman2/batman
/superman/wonderwoman
/superman3/wonderwoman
/batman/superman
/batman/superman1
/wonderwoman/superman
/wonderwoman/superman2
I want to grep only the bolded words.
I figured doing grep -wr 'superman/|/superman' would yield all of them, but it only yields /superman.
Any idea how to go about this?
You may use
grep -E '(^|/)superman($|/)' file
See the online demo:
s="/superman
/superman1
/superman/batman
/superman2/batman
/superman/wonderwoman
/superman3/wonderwoman
/batman/superman
/batman/superman1
/wonderwoman/superman
/wonderwoman/superman2"
grep -E '(^|/)superman($|/)' <<< "$s"
Output:
/superman
/superman/batman
/superman/wonderwoman
/batman/superman
/wonderwoman/superman
The pattern matches
(^|/) - start of string or a slash
superman - a word
($|/) - end of string or a slash.
grep '/superman\>'
\> is the "end of word marker", and for "superman3", the end of word is not following "man"
The problems with your -w solution:
| is not special in a basic regex. You either need to escape it or use grep -E
read the man page about how -w works:
The test is that the
matching substring must either be at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end of the line or followed by a
non-word constituent character
In the case where the line is /batman/superman,
the pattern superman/ does not appear
the pattern /superman is:
at the end of the line, which is OK, but
is prededed by the character "n" which is a word constituent character.
grep -w superman will give you better results, or if you need to have superman preceded by a slash, then my original answer works.

Extract words containing question marks

I have tens of long text files (10k - 100k record each) where some characters were lost by careless handling and got replaced with question marks. I need to build a list of corrupted words.
I'm sure the most effective approach would be to regex the file with sed or awk or some other bash tools, but I'm unable to compose regex that would do the trick.
Here are couple of sample records for processing:
?ilkin, Aleksandr, Zahhar, isa
?igadlo-?van, Maria, Karl, abikaasa, 27.10.45, Veli?anõ raj.
Desired output would be:
?ilkin
?igadlo-?van
Veli?anõ
My best result so far seems to retrieve only words from the beginning of records:
awk '$1 ~/\?/ {print $1}' test.txt
->
?ilkin,
?igadlo-?van,
I need to build a list of corrupted words
If the aim is to only search for matches grep would be the most fast and powerful tool:
grep -Po '(^|)([^?\s]*?\?[^\s,]*?)(?=\s|,|$)' test.txt
The output:
?ilkin
?igadlo-?van
Veli?anõ
Explanation:
-P option, allows perl regular expresssions
-o option, tells to print only matched substrings
(^|) - matches the start of the string or an empty value(we can't use word boundary anchor \b in this case cause question mark ? is considered as a word boundary)
[^?\s]*? - matches any character except ? and whitespace \s if occurs
\?[^\s,]*? - matches a question mark ? followed by any character except whitespace \s and ,(which can be at right word boundary)
(?=\s|,|$) - lookahead positive assertion, ensures that a needed substring is followed by either whitespace \s, comma , or placed at the end of the string

How to grep for this pattern in Unix

I want to grep for this particular pattern. The pattern is as follows
**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887
inside the file test.txt which has the following data
NNN**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887_20140628.csv
I tried using grep "**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887" test.txt but it's not returning anything. Please advice
EDIT:
Hi, basically i'm inside a loop and only sometimes i get files with this pattern. So currently im putting like grep "$i" test.txt which works in all the cases except when I have to encounter such patterns.
And I'm actually grepping for the exact file_number, file sequence.So if it says 123_29887 it will be 123_29887. Thanks.
You could use:
grep -P "(?i)\*\*[a-z\d]+\*\*[a-z]+_\d+_\d+" somepath
(?i) turns on case-insensitive mode
\*\* matches the two opening stars
[a-z\d]+ matches letters and digits
\*\* matches two more stars
[a-z]+ matches letters
_\d+_\d+ matches underscore, digits, underscore, digits
If you need to be more specific (for instance, you know that a group of digits always has three digits), you can replace parts of the expression: for instance, \d+ becomes \d{3}
Matching a Literal but Yet Unknown Pattern: \Q and \E
If you receive literal patterns that you need to match, such as **xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887, the issue is that special regex characters such as * need to be escaped. If the whole string is a literal, we do this by escaping the whole string between \Q and \E:
grep -P "\Q**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887\E" somepath
And in a loop, of course, you can build that regex programmatically by concatenating \Q and \E on both sides.