Replace first four characters of a string - regex

I would need to mask, with Regular Expression in Sed Editor, a string after 4 characters.
Example: 1234567890
Result: 1234XXXXXX
Can you help me?

In GNU sed:
echo "1234567890" | sed "s/./X/5g"

Following awk may help you here.
echo "1234567890" | awk '{for(i=5;i<=length($0);i++){val=val?val "X":"X"};print substr($0,1,4) val;val=""}'
Output will be as follows.
1234XXXXXX

$echo "1234567890"|awk '{for(i=5;i<=NF;i++)$i="X"}7' FS="" OFS=""
1234XXXXXX

With perl, you can put arbitrary perl code in the replacement text:
perl -pe 's/^(....)(.*)/$1 . "X" x length($2)/e' <<END
1
12
123
1234
12345
123456
1234567
12345678
123456789
1234567890
END
1
12
123
1234
1234X
1234XX
1234XXX
1234XXXX
1234XXXXX
1234XXXXXX

With sed
count=4
echo "1234567890" | \
sed ':A;s/\(\(.\)\{'"$count"'\}\)\(X*\)[^X]/\1\3X/;tA'
or
sed -E ':A;s/((.){'"$count"'})(X*)[^X]/\1\3X/;tA'

Related

How to extract a number out of a string preceded by zeroes

I got a string that looks like this SOMETHING00000076XYZ
How can I extract the number 76 out of the string using a shell script? Note that 76 is preceded by zeroes and followed by letters.
1st solution: If you are ok with awk could you please try following.
echo "SOMETHING00000076XYZ" | awk 'match($0,/0+[0-9]+/){val=substr($0,RSTART,RLENGTH);sub(/0+/,"",val);print val;val=""}'
In case you want to save this into a variable use following.
variable="$(echo "SOMETHING00000076XYZ" | awk '{sub(/.*[^1-9]0+/,"");sub(/[a-zA-Z]+/,"")} 1')"
2nd solution: Adding 1 more awk solution here(keeping your sample in mind).
echo "SOMETHING00000076XYZ" | awk '{sub(/.*[^1-9]0+/,"");sub(/[a-zA-Z]+/,"")} 1'
Here is a sed option:
echo "SOMETHING00000076XYZ" | sed -r 's/[^0-9]*0*([0-9]+).*/\1/g';
76
Here is an explanation of the regex pattern used:
[^0-9]* match zero or more non digits
0* match zero or more 0's
([0-9]+) match AND capture any quantity of non zero digits
.* match the remainder of the string
Then, we just replace with \1, which is the first (and only) capture group.
echo 'SOMETHING00000076XYZ' | grep -o '[1-9][0-9]*'
Using gnu grep:
grep -oP '0+\K\d+' <<< 'SOMETHING00000076XYZ'
76
\K resets any matched information.
Here is another variant of awk:
awk -F '0+' 'match($2, /^[0-9]+/){print substr($2, 1, RLENGTH)}' <<< 'SOMETHING00000076XYZ'
76
You can try Perl as well
$ echo "SOMETHING00000076XYZ" | perl -ne ' /\D+0+(\d+)/ and print $1 '
76
$ a=$(echo "SOMETHING00000076XYZ" | perl -ne ' /\D+0+(\d+)/ and print $1 ')
$ echo $a
76
$
$ echo 'SOMETHING00000076XYZ' | awk '{sub(/^[^0-9]+/,""); print $0+0}'
76
You can use sed as
echo "SOMETHING00000076XYZ" | sed "s/[a-zA-Z]//g" | sed "s/^0*//"
The first step is for removing all letters
The second step is for removing leading zeroes

Extract substring from strings (in a particular format) from a file using bash or sed or awk

I have an input file, an example of which is shown below : (?U0 ?U2 ?U9 ?U11 ?U21) I want to extract all the numbers after ?U to an output file as: 0 2 9 11 21 Please help me in this regard, I am new to it.
Thanks
You could use grep but it produces output in each per line.
grep -oP '\?U\K\d+' file
or
$ echo '(?U0 ?U2 ?U9 ?U11 ?U21)' | grep -oP '\?U\K\d+' | paste -s -d " " -
0 2 9 11 21
Using sed you can do:
s='(?U0 ?U2 ?U9 ?U11 ?U21)'
sed 's/?U\([0-9]\+\)/\1/g; s/[()]//g' <<< "$s"
0 2 9 11 21
simple sed
echo "(?U0 ?U2 ?U9 ?U11 ?U21)" | sed 's/[()?U]//g'
output
0 2 9 11 21
deleting all unneeded characters, you can put in set [...] if needed another characters
or more universal
echo "(?U0 ?U2 ?U9 ?U11 ?U21)" | sed 's/[^0-9 ]*//g'
deleting all non-digit characters (and not space)
With grep:
out="$(grep -oP '(?<=\?U)\d+' filepath |tr -s '\n' ' ')"
out="${out% }"
echo "$out" >outfilepath

Using regex to extract a substring while excluding a certain phrase

Say for the string:
test.1234.mp4
I would like to extract the numbers
1234
without extracting the 4 in mp4
What would the regex be for this?
The numbers aren't always in the second position and can be in different positions and might not always be four digits. I would like to extract the number without extracting the 4 in mp4 essentially.
More examples:
test.abc.1234.mp4
test.456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4
Essentially only the numbers would be extracted. Hence, for the last example, 666 from e666 would not be extracte and only 123.
To extract I have been using
echo "example.123.mp4" | grep -o "REGEX"
Edit: test456 was meant to be test.456
The accepted answer will fail on "test.e666.123.mp4" (print 666).
This should work
$ cat | perl -ne '/\.(\d+)\./; print "$1\n"'
test.abc.1234.mp4
test.456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4
1234
456
111
123
Note that this will only print the first group of numbers, if we have test.123.456.mp4 only 123 will be printed.
The idea is to match a dot followed by numbers which we are interested in (parentheses to save the match), followed by another dot. This means that it will fail on 123.mp4.
To fix this you could have:
$ cat | perl -ne '/(^|\.)(\d+)\./; print "$2\n"'
test.abc.1234.mp4
test.456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4
781.test.mp4
1234
456
111
123
781
First match is either beginning of line (^) or a dot, followed by numbers and a dot. We use $2 here since $1 is either beginning of a line or a dot.
cut can make it:
$ echo "test.1234.mp4" | cut -d. -f2
1234
where
cut -d'.' -f2
delimiter 2nd field
If you provide more examples we can improve the output. With the current code you would extract any something in blablabla.something.blablabla.
Update: from your question update we can do this:
grep -o '\.[0-9]*\.' | sed 's/\.//g'
test:
$ echo "test.abc.1234.mp4
test456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4" | grep -o '\.[0-9]*\.' | sed 's/\.//g'
1234
111
123
grep -Po "(?<=\.)\d+(?=\.)"
echo "test.1234.mp4" | perl -lpe 's/[^.\d]+\d*//g;s/\D*(\d+).*/$1/'
or:
echo "1321.test.mp4" | perl -lpe 's/.*(?:^|\.)(\d+)\..*/$1/'
p is to print by default so that we don't need explicit print.
e says we have an expression, not a script file
l puts the newline
These will also work if you have a number at the first part of the name.
perl -F'\.' -lane 'print "$F[scalar(#F)-2]" if(/\d+\.mp4$/)' your_file
tested:
> perl -F'\.' -lane 'print "$F[scalar(#F)-2]" if(/\d+\.mp4$/)' temp
1234
111
123
$ cat file
test.abc.1234.mp4
test.456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4
$ sed 's/.*\.\([0-9][0-9]*\)\..*/\1/' file
1234
456
111
123

AWK regex convert 3 letter word beginning with 'a' to uppercase

I have my regex expression to find 3 letter words beginning with "a"...
\b[aA][a-z]{2}\b
(seems to work, according to this! check it out: http://rubular.com/r/Jil0E4WZnW)
Now I need to know how to take that result and replace the lowercase word with the three letter word in uppercase.
Thanks!
call toupper function in awk:
echo "Abc" | awk '{print toupper($0)}'
gets you:
ABC
You can make use of the uc($string); command of PERL.
You can do it with Sed like this:
echo 'Ass ass ant Ant' | sed -re 's/\ba[a-z]{2}\b/\U&/gI'
(with your example string)
Another way is to use tr:
echo "Abc" | tr 'a-z' 'A-Z'
This solution "cheats" because it uses a loop and sub instead of gsub, but it is in awk and it works.
echo "abc Ape baaa ab abcd ant" | awk '{for (i=1;i<=NF;i++) if (length($i)==3){sub(/[aA][a-z]{2}/,toupper($i),$i)};print}'
perl -pe '$_=~s/\b([aA][a-z]{2})\b/\U$1/g;' your_file
tested:
> echo "Abc ab Ab" | perl -pe '$_=~s/\b([aA][a-z]{2})\b/\U$1/g;'
ABC ab Ab
>
Taken from here
Here is the awk version:
awk '{for(i=1;i<=NF;i++)
if((length($i)==3) && $i~/[aA][a-zA-Z][a-zA-Z]/)
$i=toupper($i)
}1' your_file

How can I output only captured groups with sed?

Is there a way to tell sed to output only captured groups?
For example, given the input:
This is a sample 123 text and some 987 numbers
And pattern:
/([\d]+)/
Could I get only 123 and 987 output in the way formatted by back references?
The key to getting this to work is to tell sed to exclude what you don't want to be output as well as specifying what you do want. This technique depends on knowing how many matches you're looking for. The grep command below works for an unspecified number of matches.
string='This is a sample 123 text and some 987 numbers'
echo "$string" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
This says:
don't default to printing each line (-n)
exclude zero or more non-digits
include one or more digits
exclude one or more non-digits
include one or more digits
exclude zero or more non-digits
print the substitution (p) (on one line)
In general, in sed you capture groups using parentheses and output what you capture using a back reference:
echo "foobarbaz" | sed 's/^foo\(.*\)baz$/\1/'
will output "bar". If you use -r (-E for OS X) for extended regex, you don't need to escape the parentheses:
echo "foobarbaz" | sed -r 's/^foo(.*)baz$/\1/'
There can be up to 9 capture groups and their back references. The back references are numbered in the order the groups appear, but they can be used in any order and can be repeated:
echo "foobarbaz" | sed -r 's/^foo(.*)b(.)z$/\2 \1 \2/'
outputs "a bar a".
If you have GNU grep:
echo "$string" | grep -Po '\d+'
It may also work in BSD, including OS X:
echo "$string" | grep -Eo '\d+'
These commands will match any number of digit sequences. The output will be on multiple lines.
or variations such as:
echo "$string" | grep -Po '(?<=\D )(\d+)'
The -P option enables Perl Compatible Regular Expressions. See man 3 pcrepattern or man 3 pcresyntax.
Sed has up to nine remembered patterns but you need to use escaped parentheses to remember portions of the regular expression.
See here for examples and more detail
you can use grep
grep -Eow "[0-9]+" file
run(s) of digits
This answer works with any count of digit groups. Example:
$ echo 'Num123that456are7899900contained0018166intext' \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Expanded answer.
Is there any way to tell sed to output only captured groups?
Yes. replace all text by the capture group:
$ echo 'Number 123 inside text' \
| sed 's/[^0-9]*\([0-9]\{1,\}\)[^0-9]*/\1/'
123
s/[^0-9]* # several non-digits
\([0-9]\{1,\}\) # followed by one or more digits
[^0-9]* # and followed by more non-digits.
/\1/ # gets replaced only by the digits.
Or with extended syntax (less backquotes and allow the use of +):
$ echo 'Number 123 in text' \
| sed -E 's/[^0-9]*([0-9]+)[^0-9]*/\1/'
123
To avoid printing the original text when there is no number, use:
$ echo 'Number xxx in text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1/p'
(-n) Do not print the input by default.
(/p) print only if a replacement was done.
And to match several numbers (and also print them):
$ echo 'N 123 in 456 text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1 /gp'
123 456
That works for any count of digit runs:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Which is very similar to the grep command:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" | grep -Po '\d+'
123
456
7899900
0018166
About \d
and pattern: /([\d]+)/
Sed does not recognize the '\d' (shortcut) syntax. The ascii equivalent used above [0-9] is not exactly equivalent. The only alternative solution is to use a character class: '[[:digit:]]`.
The selected answer use such "character classes" to build a solution:
$ str='This is a sample 123 text and some 987 numbers'
$ echo "$str" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
That solution only works for (exactly) two runs of digits.
Of course, as the answer is being executed inside the shell, we can define a couple of variables to make such answer shorter:
$ str='This is a sample 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D+($d+)$D*/\1 \2/p"
But, as has been already explained, using a s/…/…/gp command is better:
$ str='This is 75577 a sam33ple 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D*/\1 /gp"
75577 33 123 987
That will cover both repeated runs of digits and writing a short(er) command.
Give up and use Perl
Since sed does not cut it, let's just throw the towel and use Perl, at least it is LSB while grep GNU extensions are not :-)
Print the entire matching part, no matching groups or lookbehind needed:
cat <<EOS | perl -lane 'print m/\d+/g'
a1 b2
a34 b56
EOS
Output:
12
3456
Single match per line, often structured data fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*/$1/g'
a1 b2
a34 b56
EOS
Output:
1
34
With lookbehind:
cat <<EOS | perl -lane 'print m/(?<=a)(\d+)/'
a1 b2
a34 b56
EOS
Multiple fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*?b(\d+).*/$1 $2/g'
a1 c0 b2 c0
a34 c0 b56 c0
EOS
Output:
1 2
34 56
Multiple matches per line, often unstructured data:
cat <<EOS | perl -lape 's/.*?a(\d+)|.*/$1 /g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
34 78
With lookbehind:
cat EOS<< | perl -lane 'print m/(?<=a)(\d+)/g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
3478
I believe the pattern given in the question was by way of example only, and the goal was to match any pattern.
If you have a sed with the GNU extension allowing insertion of a newline in the pattern space, one suggestion is:
> set string = "This is a sample 123 text and some 987 numbers"
>
> set pattern = "[0-9][0-9]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
123
987
> set pattern = "[a-z][a-z]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
his
is
a
sample
text
and
some
numbers
These examples are with tcsh (yes, I know its the wrong shell) with CYGWIN. (Edit: For bash, remove set, and the spaces around =.)
Try
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
I got this under cygwin:
$ (echo "asdf"; \
echo "1234"; \
echo "asdf1234adsf1234asdf"; \
echo "1m2m3m4m5m6m7m8m9m0m1m2m3m4m5m6m7m8m9") | \
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
1234
1234 1234
1 2 3 4 5 6 7 8 9
$
You need include whole line to print group, which you're doing at the second command but you don't need to group the first wildcard. This will work as well:
echo "/home/me/myfile-99" | sed -r 's/.*myfile-(.*)$/\1/'
It's not what the OP asked for (capturing groups) but you can extract the numbers using:
S='This is a sample 123 text and some 987 numbers'
echo "$S" | sed 's/ /\n/g' | sed -r '/([0-9]+)/ !d'
Gives the following:
123
987
I want to give a simpler example on "output only captured groups with sed"
I have /home/me/myfile-99 and wish to output the serial number of the file: 99
My first try, which didn't work was:
echo "/home/me/myfile-99" | sed -r 's/myfile-(.*)$/\1/'
# output: /home/me/99
To make this work, we need to capture the unwanted portion in capture group as well:
echo "/home/me/myfile-99" | sed -r 's/^(.*)myfile-(.*)$/\2/'
# output: 99
*) Note that sed doesn't have \d
You can use ripgrep, which also seems to be a sed replacement for simple substitutions, like this
rg '(\d+)' -or '$1'
where ripgrep uses -o or --only matching and -r or --replace to output only the first capture group with $1 (quoted to be avoid intepretation as a variable by the shell) two times due to two matches.