sed - capture a group and replace only one character

sed - capture a group and replace only one character - regex

I have the following question, a file with pattern like this:
1XYZ00
so the result would be
2XYZ00
I want to change only the first number with another number, for example 9 and not changing anything for the rest of the patern.
I really need to capture this pattern 1XYZ00 and only this one and replace the first number with another one.
I this file I can have numbers but with different pattern and those must not be modified.
The OS is CentOS 7.
Here is what I have tested
sed -E 's/1{1}[A-Z]{3}[0-9]{2}/9{1}[A-Z]{3}[0-9]{2}/g' ~/input.txt > ~/output.txt
I also tried with capture group:
sed --debug -E -r 's/1\(1{1}[A-Z]{3}[0-9]{2}\)/9\1/g' ~/input.txt > ~/output.txt
The sed's debug mode tells me that the pattern matches but no replacement is made.
I think I am pretty close, does any expert could help me please ?
Thanks a lot,

$ cat ip.txt
foo 1XYZ00
xyz 1 2 3
hi 3XYZ00
1XYZ0A
cool 3ABC23
$ # matches any number followed by 3 uppercase and 2 digit characters
$ sed -E 's/[0-9]([A-Z]{3}[0-9]{2})/9\1/' ip.txt
foo 9XYZ00
xyz 1 2 3
hi 9XYZ00
1XYZ0A
cool 9ABC23
$ # matches digit '1' followed by 3 uppercase and 2 digit characters
$ sed -E 's/1([A-Z]{3}[0-9]{2})/9\1/' ip.txt
foo 9XYZ00
xyz 1 2 3
hi 3XYZ00
1XYZ0A
cool 3ABC23
Issue with OP's attempts:
1{1}[A-Z]{3}[0-9]{2} is same as 1[A-Z]{3}[0-9]{2}
Using 9{1}[A-Z]{3}[0-9]{2} in replacement section will give you those characters literally. They don't have any special meaning.
s/1\(1{1}[A-Z]{3}[0-9]{2}\)/9\1/ this one does use capture groups but () shouldn't be escaped with -E option active and 1{1} shouldn't be part of the capture group

I'm not sure if this is enough for you.
sed -i 's/1/9/' input.txt

Related

Extract string between underscores and dot

I have strings like these:
/my/directory/file1_AAA_123_k.txt
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
So basically, the number of underscores is not fixed. I would like to extract the string between the first underscore and the dot. So the output should be something like this:
AAA_123_k
CCC
KK_45
I found this solution that works:
string='/my/directory/file1_AAA_123_k.txt'
tmp="${string%.*}"
echo $tmp | sed 's/^[^_:]*[_:]//'
But I am wondering if there is a more 'elegant' solution (e.g. 1 line code).

With bash version >= 3.0 and a regex:
[[ "$string" =~ _(.+)\. ]] && echo "${BASH_REMATCH[1]}"

You can use a single sed command like
sed -n 's~^.*/[^_/]*_\([^/]*\)\.[^./]*$~\1~p' <<< "$string"
sed -nE 's~^.*/[^_/]*_([^/]*)\.[^./]*$~\1~p' <<< "$string"
See the online demo. Details:
^ - start of string
.* - any text
/ - a / char
[^_/]* - zero or more chars other than / and _
_ - a _ char
\([^/]*\) (POSIX BRE) / ([^/]*) (POSIX ERE, enabled with E option) - Group 1: any zero or more chars other than /
\. - a dot
[^./]* - zero or more chars other than . and /
$ - end of string.
With -n, default line output is suppressed and p only prints the result of successful substitution.

With your shown samples, with GNU grep you could try following code.
grep -oP '.*?_\K([^.]*)' Input_file
Explanation: Using GNU grep's -oP options here to print exact match and to enable PCRE regex respectively. In main program using regex .*?_\K([^.]*) to get value between 1st _ and first occurrence of .. Explanation of regex is as follows:
Explanation of regex:
.*?_ ##Matching from starting of line to till first occurrence of _ by using lazy match .*?
\K ##\K will forget all previous matched values by regex to make sure only needed values are printed.
([^.]*) ##Matching everything till first occurrence of dot as per need.

A simpler sed solution without any capturing group:
sed -E 's/^[^_]*_|\.[^.]*$//g' file
AAA_123_k
CCC
KK_45

If you need to process the file names one at a time (eg, within a while read loop) you can perform two parameter expansions, eg:
$ string='/my/directory/file1_AAA_123_k.txt.2'
$ tmp="${string#*_}"
$ tmp="${tmp%%.*}"
$ echo "${tmp}"
AAA_123_k
One idea to parse a list of file names at the same time:
$ cat file.list
/my/directory/file1_AAA_123_k.txt.2
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
$ sed -En 's/[^_]*_([^.]+).*/\1/p' file.list
AAA_123_k
CCC
KK_45

Using sed
$ sed 's/[^_]*_//;s/\..*//' input_file
AAA_123_k
CCC
KK_45

This is easy, except that it includes the initial underscore:
ls | grep -o "_[^.]*"

Bash script to extract 10 most common double-vowels word form a file

So I have tried to write a Bash script to extract the 10 most common double-vowels words from a file, like good, teeth, etc.
Here is what I have so far:
grep -E -o '[aeiou]{2}' $1|tr 'A-Z' 'a-z' |sort|uniq -c|sort -n | tail -10
I tried to use grep with flag E, then find the pattern match, such as 'aa', 'ee', 'ii' , etc, but it is not working at all,
enter image description here, what I got back, just 'ai', 'ea', something like this. Can anyone help me figure how to do pattern match in bash script?

You can simply match any amount of letters before or after a repeated vowel with this POSIX ERE regex with a GNU grep:
grep -oE '[[:alpha:]]*([aeiou])\1[[:alpha:]]*' words.txt
FreeBSD (non-GNU) grep does not support a backreference in the pattern, so you will have to list all possible vowel sequences:
grep -oE '[[:alpha:]]*(aa|ee|ii|oo|uu)[[:alpha:]]*' words.txt
See the online demo:
#!/bin/bash
s='Some good feed
Soot and weed'
grep -oE '[[:alpha:]]*([aeiou])\1[[:alpha:]]*' <<< "$s"
Details:
[[:alpha:]]* - zero or more letters
(aa|ee|ii|oo|uu) - one of the char sequences, aa, ee, ii, oo or uu (| is an alternation operator in a POSIX ERE regex)
([aeiou]) - Group 1: a vowel
\1 - the same vowel as in Group 1
[[:alpha:]]* - zero or more letters
See the diagram:

Simple way to change your regex: replace [aeiou]{2} with aa|ee|ii|oo|uu. (This does not fix the issue of only finding the match rather than the full word.)

Building on Andrew's answer (re: matching double vowels):
$ cat words.txt
good food;foul make chicken,eek too brave
eye you yuu something:three food too tu too
$ grep -E -o '\<[[:alnum:]]*(aa|ee|ii|oo|uu)[[:alnum:]]*\>' words.txt
good
food
eek
too
yuu
three
food
too
too
The grep finds only words (\< and \> represent word boundaries) with letters and/or digits and containing a dual vowel, printing each word on a separate line.
Applying the rest of OP's counting/sorting logic:
$ grep -E -o '\<[[:alnum:]]*(aa|ee|ii|oo|uu)[[:alnum:]]*\>' words.txt | sort | uniq -c | sort -n
1 eek
1 good
1 three
1 yuu
2 food
3 too

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?

You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991

1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'

Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991

You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991

Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

Return last [0-9]\{6\} from a string with sed

I want to pass a long list of filenames in the form
something_0230232_long_5160mK.csv
something_0230232_long-025160mK.csv
simething_0230342_lingk425460mK.csv
to sed (or similar linux shell tools) and get always the
last array of digits before mK per line
This works, if there are exactly 6 digits. how can I enhance it for n digits?
echo "something_0230232_long_025160mK.csv" | sed -e "s/S.*\([0-9]\{6\}\)mK\.csv/\1/p"

Solution using GNU grep:
$ grep -Po '[0-9]+(?=mK)' file
5160
025160
425460
Explanation:
-o show only the part of the line that matches.
-P use perl regexp.
[0-9]+ # Match a string of digits (at least one)
(?=mK) # Followed by mK (positive lookahead)
And with sed (since you asked):
sed -E 's/.*[^0-9]([0-9]+)mK.*/\1/' file
-E use extended regexp (alias for -r but more portability).
s/ # Subsitution -
.* # Match everything
[^0-9] # That's not a digit
([0-9]+) # Capture the last digit string
mK # Followed by the string mK
.* # Match everything left
/ # Replace with -
\1 # The captured digit string only
/ #

You're on the right track with your sed command:
echo "something_0230232_long_025160mK.csv" |
sed -e 's/^.*[^0-9]\([0-9]\{1,\}\)mK\.csv/\1/'
Differences:
Replace S with ^. This matches at the start (there is no S in the data, so the original would never match).
Replace 6 with 1,. This means 'one or more digits' given the context (strictly, one or more repeats of the previous regex, but the previous regex was [0-9]).
Insert the [^0-9] to stop the .* from being too greedy. When the number of digits matched was fixed (\{6\}), the rigidity prevented the .* from being too greedy. When you have two flexible ranges, the first will be the longest possible. Without the [^0-9], you get a 0 printed for the sample string.
Drop the 'p' so the value is printed once. Alternatively, keep the p and add -n as an option.
Reminder to self: test before (or shortly after) you post.

echo "something_0230232_long_025160mK.csv" | sed 's/^.*_//' | sed 's/mK.csv//'

Use grep to match a pattern in a line only once

I have this:
echo 12345 | grep -o '[[:digit:]]\{1,4\}'
Which gives this:
1234
5
I understand whats happening. How do I stop grep from trying to continue matching after 1 successful match?
How do I get only
1234

Do you want grep to stop matching or do you only care about the first match. You could use head if the later is true...
`grep stuff | head -n 1`
Grep is a line based util so the -m 1 flag tells grep to stop after it matches the first line which when combined with head is pretty good in practice.

You need to do the grouping: \(...\) followed by the exact number of occurrence: \{<n>\} to do the job:
maci:~ san$ echo 12345 | grep -o '\([[:digit:]]\)\{4\}'
1234
Hope it helps. Cheers!!

Use sed instead of grep:
echo 12345 | sed -n '/^\([0-9]\{1,4\}\).*/s//\1/p'
This matches up to 4 digits at the beginning of the line, followed by anything, keeps just the digits, and prints them. The -n prevents lines from being printed otherwise. If the digit string might also appear mid-line, then you need a slightly more complex command.
In fact, ideally you'll use a sed with PCRE regular expressions since you really need a non-greedy match. However, up to a reasonable approximation, you can use: (A semi-solution to a considerably more complex problem...now removed!)
Since you want the first string of up to 4 digits on the line, simply use sed to remove any non-digits and then print the digit string:
echo abc12345 | sed -n '/^[^0-9]*\([0-9]\{1,4\}\).*/s//\1/p'
This matches a string of non-digits followed by 1-4 digits followed by anything, keeps just the digits, and prints them.

If – as in your example – your numeric expression will appear at the beginning of the string you're starting with, you could just add a start-of-line anchor ^:
echo 12345 | grep -o '^\([[:digit:]]\)\{1,4\}'
Depending on which exact digits you want, an end-of-line anchor $ might help also.

grep manpage says on this topic (see chapter 'regular expressions'):
(…)
{n,}
The preceding item is matched n or more times.
{n,m}
The preceding item is matched at least n times, but not more than m times.
(…)
So the answer should be:
echo 12345 | grep -o '[[:digit:]]\{4\}'
I just tested it on cygwin terminal (2018) and it worked!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

sed - capture a group and replace only one character - regex

I'm not sure if this is enough for you. sed -i 's/1/9/' input.txt

Related

Extract string between underscores and dot

Bash script to extract 10 most common double-vowels word form a file

How do I take only the first occurrence of a hyphen in sed?

Return last [0-9]\{6\} from a string with sed

Use grep to match a pattern in a line only once

Categories

Resources