Sed not pattern matching - regex

I have a directory with these files in it:
abc12345abc
abc1234567abc
abc123456789abc
I want to grab the file that has 7 numerals in it. I need to do this using sed via the pipe. I thought this would work:
ls -l | sed -n '/[0-9]\{7\}/p'
It returns:
abc1234567abc
abc123456789abc

Regular expressions aren't anchored. abc123456789abc has a string of exactly 7 digits, 3 of them in fact: 1234567, 2345678, and 3456789. If you want the file names that don't have any longer matches, you need to check for non-digits before and after.
sed -n '/[^0-9][0-9]{7}[^0-9]/p'

You want to match seven digits that are not enclosed with another digit.
You may use
sed -En '/(^|[^0-9])[0-9]{7}($|[^0-9])/p'
See the online demo.
Details
-E - enables POSIX ERE syntax (now, there is no need to escape {x} interval quantifiers)
(^|[^0-9]) - start of string or a non-digit char
[0-9]{7} - seven digits
($|[^0-9]) - end of string or a non-digit char.

First rule of scripting, don't parse ls
If you are trying to match files in a directory, use find, that's what it's meant for
find dir/ -regextype posix-extended -type f \
-regex ".*[^[:digit:]][[:digit:]]{7}[^[:digit:]].*"

This might work for you (GNU sed):
sed '/[0-9]\{7\}/!d;/[0-9]\{8\}/d' file
If there are not 7 consecutive digits or there are more, delete the line.

Related

replace last n parts after spliting on delimiter using sed or regex

I need to replace last 2 parts of the string separated by delimiter with empty space to clean up the name.
Example:
something-useful-a12356-78929
=>
something-useful
something-more-useful-v35f62-2728902
=>
something-more-useful
I tried the following:
echo "something-useful-12345-67890" | sed -re 's/(-([0-9])+)//g'
This works if my last 2 elements of delimiter are numbers only, but wouldn't work for the example above. I need to remove the last 2 parts after splitting it on "-"
I can only use sed or regex to solve this.
Does sed 's/\(-[^-]*\)\{2\}$//' file does what you want?
Use [^-] to match anything other than -. Use $ to match the end of the string. Match hyphen followed by non-hyphens twice at the end.
sed -r 's/(-[^-]+){2}$//'
This might work for you (GNU sed):
sed -re 's/-[^-]*//2g' file
Removes globally from the second occurrence of - followed by non - characters.

sed from constant regex

I tried to remove the unwanted symbols
%H1256
*+E1111
*;E2311
+-'E3211
{E4511
DE4513
so I tried by using this command
sed 's/+E[0-9]/E/g
but it won't remove the blank spaces, and the digits need to be preserved.
expected:
H1256
E1111
E2311
E3211
E4511
E4513
EDIT
Special thanks to https://stackoverflow.com/users/3832970/wiktor-stribiżew my days have been saved by him
sed -n 's/.*\([A-Z][0-9]*\).*/\1/p' file or grep -oE '[A-Z][0-9]+' file
You may use either sed:
sed -n 's/.*\([[:upper:]][[:digit:]]*\).*/\1/p' file
or grep:
grep -oE '[[:upper:]][[:digit:]]+' file
See the online demo
Basically, the patterns match an uppercase letter ([[:upper:]]) followed with digits ([[:digit:]]* matches 0 or more digits in the POSIX BRE sed solution and [[:digit:]]+ matches 1+ digits in an POSIX ERE grep solution).
While sed solution will extract a single value (last one) from each line, grep will extract all values it finds from all lines.
This should do the job:
sed -E 's/^[^[:alnum:]]+//' file
Or if it is only the last 5 characters you need
sed -E 's/.*(.{5})$/\1/' file

How to replace with one sed command first n letter to uppercase

I would like to replace with one sed command first n letter to uppercase.
Example 'madrid' to 'MADrid'. (n=3)
I know how to change first letter to uppercase with this command:
sed -e "s/\b\(.\)/\U\1/g"
but I dont know how to change this command for my problem.
I tried to change
sed -e "s/\b\(.\)/\U\1/g"
to
sed -e "s/\b\(.\)/\U\3/g"
but this didnt work. Also, I googled and searched on this site but exact answer with my problem I couldnt find.
Thank you.
I infer from your use of \U that you're using GNU sed:
n=3
echo 'madrid' | sed -r 's/\<(.{'"$n"'})/\U\1/g' # -> 'MADrid'
I've omitted the unnecessary -e option
I have added -r to enable support for extended regular expressions, which have more familiar syntax and also offer more features.
I'm using a single-quoted sed script with a shell-variable value spliced in so as to avoid confusion between what the shell expands up front and what is interpreted by sed itself.
\< is used instead of \b, because unlike the latter it only matches at the start of a word.Thanks, Casimir et Hippolyte
The above replaces any 3 characters at the start of a word, however.
To limit it to at most $n letters:
sed -r 's/\<([[:alpha:]]{1,'"$n"'})/\U\1/g'
As for what you've tried:
The \3 in your attempt sed -e "s/\b\(.\)/\U\3/g" refers to the 3rd capture group (parenthesized subexpression, (...)) in the regex (which doesn't exist), it does not refer to 3 repetitions.
Instead, you have to make sure that your one and only capture group (which you can reference as \1 in the substitution) itself captures as many characters as desired - which is what the {<n>} quantifier is for; the related {<m>,<n>} construct matches a range of repetitions.
This might work for you (GNU sed):
sed -r 's/[a-z]/&\n/'"$n"';s/^([^\n]*)\n/\U\1/' file
Where $n is the first n letters. Putting the question of word boundaries aside this converts n letters of a-z consecutive or non-consecutive to upper case i.e. A-Z
N.B. this is two sed commands not one!

Extract numbers from a string using sed and regular expressions

Another question for the sed experts.
I have a string representing an pathname that will have two numbers in it. An example is:
./pentaray_run2/Trace_220560.dat
I need to extract the second of these numbers - ie 220560
I have (with some help from the forums) been able to extract all the numbers together (ie 2220560) with:
sed "s/[^0-9]//g"
or extract only the first number with:
sed -r 's|^([^.]+).*$|\1|; s|^[^0-9]*([0-9]+).*$|\1|'
But what I'm after is the second number!! Any help much appreciated.
PS the number I'm after is always the second number in the string.
is this ok?
sed -r 's/.*_([0-9]*)\..*/\1/g'
with your example:
kent$ echo "./pentaray_run2/Trace_220560.dat"|sed -r 's/.*_([0-9]*)\..*/\1/g'
220560
You can extract the last numbers with this:
sed -e 's/.*[^0-9]\([0-9]\+\)[^0-9]*$/\1/'
It is easier to think this backwards:
From the end of the string, match zero or more non-digit characters
Match (and capture) one or more digit characters
Match at least one non-digit character
Match all the characters to the start of the string
Part 3 of the match is where the "magic" happens, but it also limits your matches to have at least a non-digit before the number (ie. you can't match a string with only one number that is at the start of the string, although there is a simple workaround of inserting a non-digit to the start of the string).
The magic is to counter-act the left-to-right greediness of the .* (part 4). Without part 3, part 4 would consume all it can, which includes the numbers, but with it, matching makes sure that it stops in order to allow at least a non-digit followed by a digit to be consumed by parts 1 and 2, allowing the number to be captured.
If grep is welcome :
$ echo './pentaray_run2/Trace_220560.dat' | grep -oP '\d+\D+\K\d+'
220560
And more portable with Perl with the same regex :
echo './pentaray_run2/Trace_220560.dat' | perl -lne 'print $& if /\d+\D+\K\d+/'
220560
I think the approach is cleaner & more robust than using sed
This might work for you (GNU sed):
sed -r 's/([^0-9]*([0-9]*)){2}.*/\2/' file
This extracts the second number:
sed -r 's/([^0-9]*([0-9]*)){1}.*/\2/' file
and this extracts the first.

Regex to match unique substrings

Here's a basic regex technique that I've never managed to remember. Let's say I'm using a fairly generic regex implementation (e.g., grep or grep -E). If I were to do a list of files and match any that end in either .sty or .cls, how would I do that?
ls | grep -E "\.(sty|cls)$"
\. matches literally a "." - an unescaped . matches any character
(sty|cls) - match "sty" or "cls" - the | is an or and the brackets limit the expression.
$ forces the match to be at the end of the line
Note, you want grep -E or egrep, not grep -e as that's a different option for lists of patterns.
egrep "\.sty$|\.cls$"
This regex:
\.(sty|cls)\z
will match any string ends with .sty or .cls
EDIT:
for grep \z should be replaced with $ i.e.
\.(sty|cls)$
as jelovirt suggested.