Matching string with regex - regex

I keep banging my head against the wall looking for a regex that matches a string like any of these:
--7928ae02-A--
--7928ae02-B--
--7928ae02-F--
--7928ae02-H--
--7928ae02-Z--
the string is two dashes, 8 characters of any letter or number, a dash, an uppercase A-Z and two dashes.
Here's any example of where I'm at:
grep '^--[a-fA-F0-9]{8}-[A-Z]--$'

This might work
grep -E -- '^--[[:alnum:]]{8}-[[:upper:]]--$'

You probably should use the
grep -P '^--[a-fA-F0-9]{8}-[A-Z]--$'
with the -P or -E flag. -P interprets the regex as a Perl regex, -E runs the regex in extended mode. This regex allows all kinds of bells and whistles Like the {}-operator.
Running this on your given tests, they all pass.

Related

How to use square brackets in grep for MINGW64?

Currently, I have a following regex. It should match a string that I am echoing:
echo "TBGFSGFI22800_D_REP_D_RISIKOEINHEIT" | grep -E 'TBGFSGFI\d\d\d\d\d[A-Za-z_]{1,100}'
It works as expected in OsX on my Mac and in Notepad++, but in Bash for windows (MINGW64) I get an empty string. How can I use the grep with flags, or how should I rewrite the regex to match the pattern?
My grep version is 3.1. Bash: 4.4.23(1)
Thanks for help in advance!
You are using a POSIX ERE regex with the -E option, and that flavor does not support \d construct. You also need -o option to actually extract the matches.
Note you do not need to repeat \d five times, you can use a range quantifier, \d{5}.
You can use
echo "TBGFSGFI22800_D_REP_D_RISIKOEINHEIT" | grep -Po "TBGFSGFI\d{5}[A-Za-z_]{1,100}"
Where
-P means the regex is of a PCRE flavor
-o extracts matches only
TBGFSGFI\d{5}[A-Za-z_]{1,100} - a regex that matches TBGFSGFI, then any five digits and then 1-100 ASCII letters or _.

How to replace with one sed command first n letter to uppercase

I would like to replace with one sed command first n letter to uppercase.
Example 'madrid' to 'MADrid'. (n=3)
I know how to change first letter to uppercase with this command:
sed -e "s/\b\(.\)/\U\1/g"
but I dont know how to change this command for my problem.
I tried to change
sed -e "s/\b\(.\)/\U\1/g"
to
sed -e "s/\b\(.\)/\U\3/g"
but this didnt work. Also, I googled and searched on this site but exact answer with my problem I couldnt find.
Thank you.
I infer from your use of \U that you're using GNU sed:
n=3
echo 'madrid' | sed -r 's/\<(.{'"$n"'})/\U\1/g' # -> 'MADrid'
I've omitted the unnecessary -e option
I have added -r to enable support for extended regular expressions, which have more familiar syntax and also offer more features.
I'm using a single-quoted sed script with a shell-variable value spliced in so as to avoid confusion between what the shell expands up front and what is interpreted by sed itself.
\< is used instead of \b, because unlike the latter it only matches at the start of a word.Thanks, Casimir et Hippolyte
The above replaces any 3 characters at the start of a word, however.
To limit it to at most $n letters:
sed -r 's/\<([[:alpha:]]{1,'"$n"'})/\U\1/g'
As for what you've tried:
The \3 in your attempt sed -e "s/\b\(.\)/\U\3/g" refers to the 3rd capture group (parenthesized subexpression, (...)) in the regex (which doesn't exist), it does not refer to 3 repetitions.
Instead, you have to make sure that your one and only capture group (which you can reference as \1 in the substitution) itself captures as many characters as desired - which is what the {<n>} quantifier is for; the related {<m>,<n>} construct matches a range of repetitions.
This might work for you (GNU sed):
sed -r 's/[a-z]/&\n/'"$n"';s/^([^\n]*)\n/\U\1/' file
Where $n is the first n letters. Putting the question of word boundaries aside this converts n letters of a-z consecutive or non-consecutive to upper case i.e. A-Z
N.B. this is two sed commands not one!

sed does not match the regex

I've wrote this regex:
/_([^_+\n][\w]+)_/g
and I wanted to test it out on my terminal with
echo "HELLO ___ _HELO_WORLD_" | sed "/_([^_+\n][\w]+)_/g"
However, it outputs
HELLO ___ _HELO_WORLD_
which means sed does not match anything.
The result needs to be :
_HELLO_WORLD_
I am using OS X, and I tried both -E and -e as suggested by other posts, but that didn't change anything. What am I doing wrong here?
sed is not particularily well suited for this task, as it really is good at applying patterns to lines, less so to words, making the regexes overly complicated.
word-oriented solution
anyhow, here's an attempt, using two replacement patterns:
sed -e 's|\<[^_][^\> ]*[^_]\> *||g' -e 's|\<_*\> *||g'
the first expression replaces any word that is neither starting nor ending with underscores (and any trailing whitespace) by nought. \< indicates the beginning of a word, and \> the ending; so \<\([^_][^\>]*[^_]\)\> translates to "at the beginning \< there is no underscore [^_], followed by any number of characters not ending the word [^\>]. followed by a character that is not an underscore [^_] right before the word ends \>
the second expression is simpler and replaces any word solely consisting of underscores with nought.
line oriented processing
if you can arrange for your data to be one expression per line you can use something like the following
$ cat data.txt
HELLO
___
_HELO_WORLD_
$ cat data.txt | sed -n -e '/_[^_+\s]\w*_/p'
_HELO_WORLD_
$
The sed-term is almost the one you gave (though for some reasons sed doesn't like the +, so I use a workaround with * instead.
The basic trick is to use the -n flag to disable the default printing of lines and to use the p command to explicitely print matching lines.
I am still not sure what you are asking, so I answer what I guess you are asking. My guess is, that you want to find strings surrounded by underscores with Sed. The short answer is: no. The longer is: you can not find overlapping string parts with Sed, because it does not support lookahead.
If you take this string _HELLO_WORLD_ and the following pattern _[^_]*_, the pattern will match _HELLO_ and the remaining string is WORLD_, which will not match, because the leading underscore has already been consumed.
Sed is the wrong tool for this. Use Perl instead. This prints all strings surrounded by underscores:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/_([A-Z]+)(?=_)/print $1/ge'
HELOWORLD
Update reflecting your last comment:
If you want to find strings starting and ending with an underscore at word boundaries, use this one:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/\b_([A-Z]+[_A-Z]*[A-Z]*)_\b/print $1/ge'
HELO_WORLD
There are multiple problem :
your sed command is a condition. It should be an action, as s/pattern/replacement/flags or the condition could be followed by an action, i.e. /_([^_+\n][\w]+)_/p to print the line.
with sed, you either need to escape your parentheses and + or to use the -rregex-extended flag
[\w] : \w is already a character class by itself, no need to encase it in a class
Finally, a shot at what I think you want with GNU grep :
grep -P -o "_[^_+\n\s]\w+_"
$ echo "HELLO ___ _HELO_WORLD_" | grep -P -o "_[^_+\n\s]\w+_"
_HELO_WORLD_
Using grep is enough and easier if you only need to match.
-o will able you to retrieve only the matched part rather than the whole line
-P uses perl regexes so that you can use shorthand classes as \n and \s
I added \s to the negated class, because previously it could match the space before what you want to match, since \w can match the underscore.
If you can't use GNU grep, then it's back to sed, which is already answered by ceving.
As many answers and the downvotes suggest, sed doesn't look like the right tool to use for this question, so I ended up using Python, which worked out really well, so I will just post it here for anyone in the future who might have same problem.
import re
p = re.compile('_([^_+\n][\w ]+)_')
result = p.findall(text)

Match string plus any non-whitespace character and insert whitespace

I'm trying to match and replace a string in a lot of files.
String to search for:
</ANON>[any non-whitespace char], e.g. "</ANON>." or "</ANON>)"
I want to stick a whitespace in between the tag and the non-whitespace char.
I have tried to do it with sed using something like:
sed -i -e 's/<\/ANON>/S/<\/ANON> /S/g'
but alas, that doesn't work.
Any help much appreciated.
Try the following:
sed -i -e 's|\(</ANON>\)\([^[:space:]]\)|\1 \2|g' file
It's not Perl and you can't use \S for non-whitespace characters. Also you should capture groups and use them in replacement part. Also you can't use /S because 1) it's wrong 2) slash used by sed for separating parts with pattern, replacement and flags.
P.S. Or you can use Perl if you like:
perl -p -i -e 's|(</ANON>)(\S)|$1 $2|g' file

Extract numbers from a string using sed and regular expressions

Another question for the sed experts.
I have a string representing an pathname that will have two numbers in it. An example is:
./pentaray_run2/Trace_220560.dat
I need to extract the second of these numbers - ie 220560
I have (with some help from the forums) been able to extract all the numbers together (ie 2220560) with:
sed "s/[^0-9]//g"
or extract only the first number with:
sed -r 's|^([^.]+).*$|\1|; s|^[^0-9]*([0-9]+).*$|\1|'
But what I'm after is the second number!! Any help much appreciated.
PS the number I'm after is always the second number in the string.
is this ok?
sed -r 's/.*_([0-9]*)\..*/\1/g'
with your example:
kent$ echo "./pentaray_run2/Trace_220560.dat"|sed -r 's/.*_([0-9]*)\..*/\1/g'
220560
You can extract the last numbers with this:
sed -e 's/.*[^0-9]\([0-9]\+\)[^0-9]*$/\1/'
It is easier to think this backwards:
From the end of the string, match zero or more non-digit characters
Match (and capture) one or more digit characters
Match at least one non-digit character
Match all the characters to the start of the string
Part 3 of the match is where the "magic" happens, but it also limits your matches to have at least a non-digit before the number (ie. you can't match a string with only one number that is at the start of the string, although there is a simple workaround of inserting a non-digit to the start of the string).
The magic is to counter-act the left-to-right greediness of the .* (part 4). Without part 3, part 4 would consume all it can, which includes the numbers, but with it, matching makes sure that it stops in order to allow at least a non-digit followed by a digit to be consumed by parts 1 and 2, allowing the number to be captured.
If grep is welcome :
$ echo './pentaray_run2/Trace_220560.dat' | grep -oP '\d+\D+\K\d+'
220560
And more portable with Perl with the same regex :
echo './pentaray_run2/Trace_220560.dat' | perl -lne 'print $& if /\d+\D+\K\d+/'
220560
I think the approach is cleaner & more robust than using sed
This might work for you (GNU sed):
sed -r 's/([^0-9]*([0-9]*)){2}.*/\2/' file
This extracts the second number:
sed -r 's/([^0-9]*([0-9]*)){1}.*/\2/' file
and this extracts the first.