Extract numbers from a string using sed and regular expressions

Extract numbers from a string using sed and regular expressions - regex

Another question for the sed experts.
I have a string representing an pathname that will have two numbers in it. An example is:
./pentaray_run2/Trace_220560.dat
I need to extract the second of these numbers - ie 220560
I have (with some help from the forums) been able to extract all the numbers together (ie 2220560) with:
sed "s/[^0-9]//g"
or extract only the first number with:
sed -r 's|^([^.]+).*$|\1|; s|^[^0-9]*([0-9]+).*$|\1|'
But what I'm after is the second number!! Any help much appreciated.
PS the number I'm after is always the second number in the string.

is this ok?
sed -r 's/.*_([0-9]*)\..*/\1/g'
with your example:
kent$ echo "./pentaray_run2/Trace_220560.dat"|sed -r 's/.*_([0-9]*)\..*/\1/g'
220560

You can extract the last numbers with this:
sed -e 's/.*[^0-9]\([0-9]\+\)[^0-9]*$/\1/'
It is easier to think this backwards:
From the end of the string, match zero or more non-digit characters
Match (and capture) one or more digit characters
Match at least one non-digit character
Match all the characters to the start of the string
Part 3 of the match is where the "magic" happens, but it also limits your matches to have at least a non-digit before the number (ie. you can't match a string with only one number that is at the start of the string, although there is a simple workaround of inserting a non-digit to the start of the string).
The magic is to counter-act the left-to-right greediness of the .* (part 4). Without part 3, part 4 would consume all it can, which includes the numbers, but with it, matching makes sure that it stops in order to allow at least a non-digit followed by a digit to be consumed by parts 1 and 2, allowing the number to be captured.

If grep is welcome :
$ echo './pentaray_run2/Trace_220560.dat' | grep -oP '\d+\D+\K\d+'
220560
And more portable with Perl with the same regex :
echo './pentaray_run2/Trace_220560.dat' | perl -lne 'print $& if /\d+\D+\K\d+/'
220560
I think the approach is cleaner & more robust than using sed

This might work for you (GNU sed):
sed -r 's/([^0-9]*([0-9]*)){2}.*/\2/' file
This extracts the second number:
sed -r 's/([^0-9]*([0-9]*)){1}.*/\2/' file
and this extracts the first.

Related

replace last n parts after spliting on delimiter using sed or regex

I need to replace last 2 parts of the string separated by delimiter with empty space to clean up the name.
Example:
something-useful-a12356-78929
=>
something-useful
something-more-useful-v35f62-2728902
=>
something-more-useful
I tried the following:
echo "something-useful-12345-67890" | sed -re 's/(-([0-9])+)//g'
This works if my last 2 elements of delimiter are numbers only, but wouldn't work for the example above. I need to remove the last 2 parts after splitting it on "-"
I can only use sed or regex to solve this.

Does sed 's/\(-[^-]*\)\{2\}$//' file does what you want?

Use [^-] to match anything other than -. Use $ to match the end of the string. Match hyphen followed by non-hyphens twice at the end.
sed -r 's/(-[^-]+){2}$//'

This might work for you (GNU sed):
sed -re 's/-[^-]*//2g' file
Removes globally from the second occurrence of - followed by non - characters.

Why doesn't grep work in pattern with colon

I know a colon: should be literal, so I'm not clear why a grep matches all lines. Here's a file called "test":
cat test
123|4444
4546|4444
666666|5678
7777777|7890675::1
I need to match the line with::1. Of course, the real case is more complicated, so I can't simply search for "::1". I tried many iterations, like
grep -E '^[0-9]|[0-9]:' test
grep -E '^[0-9]|[0-9]::1' test
But they return all lines:
123|4444
4546|4444
666666|5678
7777777|7890675::1
I am expecting to match just the last line. Any idea why that is?
This is GNU/Linux bash.

The pipe needs to be escaped and you need to allow repeated digits:
grep -E '^[0-9]+\|[0-9]+:' test
Otherwise ^[0-9] is all that needs to match for a line to be retained by the grep.

Given:
$ echo "$txt"
123|4444
4546|4444
666666|5678
7777777|7890675::1
Use repetition (+ means 'one or more') and character classes:
$ echo "$txt" | grep -E '^[[:digit:]]+[|][[:digit:]]+[:]+'
7777777|7890675::1
Since | is a regex meta character, it has to be either escaped (\|) or in a character class.

There are two issues:
The regex [0-9] matches any single digit. Since you have multiple digits, you need to replace those parts with [0-9]+, which matches one or more digits. If you want to allow an empty sequence with no digits, replace the + with a *, which means “zero or more”.
The pipe character | means “alternative”s in regex. What you provided will match either a digit at the start of the line, or a digit followed by a colon. Since every line has at least one of those, you match every line. To get a literal | character, you can use either [|] or \|; the second option is usually preferred in most styles.
Applying both of these, you get ^[0-9]+\|[0-9]+::1.

Another approach is to use a tool like awk that can process the fields of each line, and match lines where the 2nd field ends with "::1"
awk -F'|' '$2 ~ /::1$/' test

How to replace with one sed command first n letter to uppercase

I would like to replace with one sed command first n letter to uppercase.
Example 'madrid' to 'MADrid'. (n=3)
I know how to change first letter to uppercase with this command:
sed -e "s/\b\(.\)/\U\1/g"
but I dont know how to change this command for my problem.
I tried to change
sed -e "s/\b\(.\)/\U\1/g"
to
sed -e "s/\b\(.\)/\U\3/g"
but this didnt work. Also, I googled and searched on this site but exact answer with my problem I couldnt find.
Thank you.

I infer from your use of \U that you're using GNU sed:
n=3
echo 'madrid' | sed -r 's/\<(.{'"$n"'})/\U\1/g' # -> 'MADrid'
I've omitted the unnecessary -e option
I have added -r to enable support for extended regular expressions, which have more familiar syntax and also offer more features.
I'm using a single-quoted sed script with a shell-variable value spliced in so as to avoid confusion between what the shell expands up front and what is interpreted by sed itself.
\< is used instead of \b, because unlike the latter it only matches at the start of a word.Thanks, Casimir et Hippolyte
The above replaces any 3 characters at the start of a word, however.
To limit it to at most $n letters:
sed -r 's/\<([[:alpha:]]{1,'"$n"'})/\U\1/g'
As for what you've tried:
The \3 in your attempt sed -e "s/\b\(.\)/\U\3/g" refers to the 3rd capture group (parenthesized subexpression, (...)) in the regex (which doesn't exist), it does not refer to 3 repetitions.
Instead, you have to make sure that your one and only capture group (which you can reference as \1 in the substitution) itself captures as many characters as desired - which is what the {<n>} quantifier is for; the related {<m>,<n>} construct matches a range of repetitions.

This might work for you (GNU sed):
sed -r 's/[a-z]/&\n/'"$n"';s/^([^\n]*)\n/\U\1/' file
Where $n is the first n letters. Putting the question of word boundaries aside this converts n letters of a-z consecutive or non-consecutive to upper case i.e. A-Z
N.B. this is two sed commands not one!

Sed replace character after every number

I want to replace a character after every integer number with sed.
Example:
444 d
should go:
444,d
EDIT: The answer of stribizhev helped me out to find a solution with sed (GNU sed) 4.2.2
sed -r 's/([0-9]+)./\1,/g'
replaces an arbitary character after a number with a comma. The only problem is, that also a number at the end of the line creates an additional comma.

You can use capturing groups to do that:
sed 's/\(\d\+\)./\1,/g'
or (since with GNU sed you can avoid all the escaped parenthesis by using extended regular expressions)
sed -r 's/([0-9]+)./\1,/g'
Here is a demo showing what the regex does.
The [0-9]+ pattern matches an integer number (without decimals) even Iinside longer strings, even within longer words.

grep or sed for word containing string

example file:
blahblah 123.a.site.com some-junk
yoyoyoyo 456.a.site.com more-junk
hihohiho 123.a.site.org junk-in-the-trunk
lalalala 456.a.site.org monkey-junk
I want to grep out all those domains in the middle of each line, they all have a common part a.site with which I can grep for, but I can't work out how to do it without returning the whole line?
Maybe sed or a regex is need here as a simple grep isn't enough?

You can do:
grep -o '[^ ]*a\.site[^ ]*' input
or
awk '{print $2}' input
or
sed -e 's/.*\([^ ]*a\.site[^ ]*\).*/\1/g' input

Try this to find anything in that position
$ sed -r "s/.* ([0-9]*)\.(.*)\.(.*)/\2/g"
[0-9]* - For match number zero or more time.
.* - Match anything zero or more time.
\. - Match the exact dot.
() - Which contain the value particular expression in parenthesis, it can be printed using \1,\2..\9. It contain only 1 to 9 buffer space. \0 means it contain all the expressed pattern in the expression.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract numbers from a string using sed and regular expressions - regex

is this ok? sed -r 's/._([0-9])\../\1/g' with your example: kent$ echo "./pentaray_run2/Trace_220560.dat"|sed -r 's/._([0-9])\../\1/g' 220560

If grep is welcome : $ echo './pentaray_run2/Trace_220560.dat' | grep -oP '\d+\D+\K\d+' 220560 And more portable with Perl with the same regex : echo './pentaray_run2/Trace_220560.dat' | perl -lne 'print $& if /\d+\D+\K\d+/' 220560 I think the approach is cleaner & more robust than using sed

This might work for you (GNU sed): sed -r 's/([^0-9]([0-9])){2}./\2/' file This extracts the second number: sed -r 's/([^0-9]([0-9])){1}./\2/' file and this extracts the first.

Related

replace last n parts after spliting on delimiter using sed or regex

Why doesn't grep work in pattern with colon

How to replace with one sed command first n letter to uppercase

Sed replace character after every number

grep or sed for word containing string

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract numbers from a string using sed and regular expressions - regex

is this ok? sed -r 's/.*_([0-9]*)\..*/\1/g' with your example: kent$ echo "./pentaray_run2/Trace_220560.dat"|sed -r 's/.*_([0-9]*)\..*/\1/g' 220560

If grep is welcome : $ echo './pentaray_run2/Trace_220560.dat' | grep -oP '\d+\D+\K\d+' 220560 And more portable with Perl with the same regex : echo './pentaray_run2/Trace_220560.dat' | perl -lne 'print $& if /\d+\D+\K\d+/' 220560 I think the approach is cleaner & more robust than using sed

This might work for you (GNU sed): sed -r 's/([^0-9]*([0-9]*)){2}.*/\2/' file This extracts the second number: sed -r 's/([^0-9]*([0-9]*)){1}.*/\2/' file and this extracts the first.

Related

replace last n parts after spliting on delimiter using sed or regex

Why doesn't grep work in pattern with colon

How to replace with one sed command first n letter to uppercase

Sed replace character after every number

grep or sed for word containing string

Categories

Resources

is this ok? sed -r 's/._([0-9])\../\1/g' with your example: kent$ echo "./pentaray_run2/Trace_220560.dat"|sed -r 's/._([0-9])\../\1/g' 220560

This might work for you (GNU sed): sed -r 's/([^0-9]([0-9])){2}./\2/' file This extracts the second number: sed -r 's/([^0-9]([0-9])){1}./\2/' file and this extracts the first.