Find and replace regular expression with alternate format

Find and replace regular expression with alternate format - regex

I have a file that has lines that contain text like this
something,12:3456789,somethingelse
foobar,12:345678,somethingdifferent
For lines where the second item in the line has 6 digits after the : I would like to alternate the format of it by adding a 0 in the front and shifting the :. For example the above would change to:
something,12:3456789,somethingelse
foobar,01:2345678,somethingdifferent
I can't figure out how to do this using sed or any unix command line tool

You just need to match the middle section where you have 2 digits followed by : followed by exactly 6 digits. If you capture the text in individual groups appropriately you can move them around in your result. Note the \b word boundary at the end of the pattern is to ensure that we match on exactly 6 digits and don't match on lines which have the full 7 digits:
/\b(\d)(\d):(\d{6})\b/0\1:\2\3/
|__________________| |______|
pattern replacement
This gives the expected output. You can experiment with it online here
sed doesn't have Perl style specifiers such as \d. Instead, you will need to use [[:digit:]]. Here is the updated regex that works with sed
sed -E 's/\b([[:digit:]])([[:digit:]]):([[:digit:]]{6})\b/0\1:\2\3/g' myfile.txt
As #Jonathan Leffler pointed out, \b doesn't work on Mac's sed so you will instead need to add commas in your regex pattern at the front and back and then replace them back in the replacement pattern

Related

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

Using a regular expression, I need to match only the IPv4 subnet mask from the given input string:
ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
For testing this input string is contained in a text file called file.txt, however the actual use case will be to parse /proc/cmdline, and I will need a solution that starts parsing, counting fields, and matching after encountering "ip=" until the next white space character.
I'm using bash 4.2.46 with GNU grep 2.20 on an EL 7.9 workstation, x86_64 to test the expression.
Based on examples I've seen looking at other questions, I've come up with the following grep command and PCRE regular expression which gives output that is very close to what I need.
[user#ws01 ~]$ grep -o -P '(?<!:)(?:\:[0-9])(.*?)(?=:)' file.txt
:255.255.254.0
My understanding of what I've done here is that, I've started with a negative lookbehind with a ":" character to try and exclude the first "::" field, followed by a non capturing group to match on an escaped ":" character, followed by a number, [0-9], then a capturing group with .*?, for the actual match of the string itself, and finally a look ahead for the next ":" character.
The problem is that this gives the desired string, but includes an extra : character at the beginning of the string.
Expected output should look like this:
255.255.254.0
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters. The reason for this is because a field can have an empty value. For example
:<null>:ip:gw:netmask:hostname:<null>:off
Null is shown here to indicate an omitted value not passed by the user, that the user does not need to provide for the intended purpose.
I've tried a few different expressions as suggested in other answers that use negative look behinds and look aheads to not start matching at a : which is neighbored by another :
For example, see this question:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
If I can start matching at the first single colon, by itself, which is not followed by or preceded by another : character, while excluding the colon character as the delimiter, and continue matching until the next single colon which is also not neighboring another : and without including the colon character, that should match the desired string.
I'm able to match the exact string by including "255" in an expression like this: (Which will work for all of our present use cases)
[user#ws01 ~]$ grep -o -P '(?:)255.*?(?=:)' file.txt
255.255.254.0
The logic problem here is that the subnet mask itself, may not always start with "255", but it should be a number, [0-9] which is why I'm attempting to use that in the expression above. For the sake of simplicity, I don't need to validate that it's not greater than 255.

Using gnu-grep you could write the pattern as:
grep -oP '(?<!:):\K\d{1,3}(?:\.\d{1,3}){3}(?=:(?!:))' file.txt
Output
255.255.254.0
Explanation
(?<!:): Negative lookahead, assert not : to the left and then match :
\K Forget what is matched until now
\d{1,3}(?:\.\d{1,3}){3} Match 4 times 1-3 digits separated by .
(?=:(?!:)) Positive lookahead, assert : that is not followed by :
See a regex demo.

Using grep
$ grep -oP '(?<!:)?:\K([0-9.]+)(?=:[[:alpha:]])' file.txt
View Demo here
or
$ grep -oP '[^:]*:\K[^:[:alpha:]]*' file.txt
Output
255.255.254.0

If these are delimiters, your value should be in a clearly predictable place.
Just treat every colon as a delimiter and select the 4th field.
$: awk -F: '{print $4}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
I'm not sure what you mean by
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters.
If your delimiters aren't predictable and parse-able, they are useless. If you mean the fields can have or not have quotes, but you need to exclude quotes, we can do that. If double colons are one delimiter and single colons are another that's horrible design, but we can probably handle that, too.
$: awk -F'::' '{ split($2,x,":"); print x[2];}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
For quotes, you need to provide an example.

Since the number of fields is always the same, simply separated by ":", you can use cut.
That solution will also work if you have empty fields.
cut -d":" -f4

adding string after regular expression

I have files with lines like this:
123.45 234.56 A foo
bar boo
As part of a bash shell script I want to replace spaces following a
number with another string (XYZ, let's say). I can do this which
replaces all spaces (no good)
sed 's/ /XYZ/g' foo.txt
Or I can do this which replaces the right spaces but also gets rid of
the last digit (also no good)
perl -pe 's/\d /XYZ/g' foo.txt
How can I achieve the effect I'm after?

Judging by your attempts, you need to replace a single space after a digit.
It is enough to use the following expression with sed:
sed 's/\([0-9]\) /\1XYZ/g'
See the online demo.
The \([0-9]\) is a capturing group matching a digit and storing it in a memory buffer, and a space is just matched. The replacement pattern contains the backreference to the value stored inside Group 1 buffer, so the digit is not lost, but restored in the result.
Note that if you need to replace all consequent spaces after a digit with XYZ, you may use
sed 's/\([0-9]\) */\1XYZ/g'
or
sed 's/\([0-9]\) \+/\1XYZ/g'

How does this sed command parse numbers with commas?

I'm having difficulty understanding a number-parsing sed command I saw in this article:
sed -i ':a;s/\B[0-9]\{3\}\>/,&/;ta' numbers.txt
I'm a sed newbie, so this is what I've been able to figure out:
& adds to what's already there rather than substitutes
the :a; ... ;ta calls the substitution recursively on the line until the search finds no more returns
Here's what I am hoping folks can explain
What does -i do? I can't seem to find it on the man pages though I'm sure it's there.
I'm a little fuzzy on what the \B is accomplishing here? Perhaps it helps with the left-right parsing priority, but I don't see how. So lastly...
Most importantly, why does this execute right to left instead of left to right? For example, which part of the command keeps this from doing something like: 1234566778,9 ---> 1234,566,778,9

Bisecting this command:
sed -i ':a;s/\B[0-9]\{3\}\>/,&/;ta' numbers.txt
-i # inline editing to save changes in input file
\B # opposite of \b (word boundary) - to match between words
[0-9] # match any digit
\{3,\} # match exact 3 digits
\> # word boundary
& # use matched pattern in replacement
:a # start label a
ta # go back to label a until \B[0-9]\{3\}\> is matches
Yes indeed this sed command starts match/replacement from right most 3 digits and keeps going left till it finds 3 digits.
Update: However looking at this inefficient sed command in a loop I recommend this much simpler and faster awk instead:
awk '/^[0-9]+$/{printf "%\047.f\n", $1}' file
20,130,607,215,015
607,220,701
992,171
Where input file is:
cat file
20130607215015
607220701
992171

The matching is greedy, i.e. it matches the leftmost three digits NOT preceded by a word boundary and followed by the word boundary, i.e. the rightmost three digits. After inserting the comma, the "goto" makes it match again, but the comma introduced a new word boundary, so the match happens earlier.

How do I write a SED regex to extract a string delimited by another string?

I am using GNU sed version 4.2.1 and I am trying to write a non-greedy SED regex to extract a string that delimited by two other strings. This is easy when the delimiting strings are single-character:
s:{\([^}]*\)}:\1:g
In that example the string is delimited by '{' on the left and '}' on the right.
If the delimiting strings are multiple characters, say '{{{' and '}}}' I can adjust the above expression like this:
s:{{{\([^}}}]*\)}}}:\1:g
so the centre expression matches anything not containing the '}}}' closing string. But this only works if the match string does not contain '}' at all. Something like:
{{{cannot match {this broken} example}}}
will not work but
{{{can match this example}}}
does work. Of course
s:{{{\(.*\)}}}:\1:g
always works but is greedy so isn't suitable where multiple patterns occur on the same line.
I understand [^a] to mean anything except a and [^ab] to mean anything except a or b so, despite it appearing to work, I don't think [^}}}] is the correct way to exclude that sequence of 3 consecutive characters.
So how to I write a regex for SED that matches a string that is delimited bt two other strings ?

You are correct that [^}}}] doesn't work. A negated character class matches anything that is not one of the characters inside it. Repeating characters doesn't change the logic. So what you wrote is the same as [^}]. (It is easy to see why this works when there are no braces inside the expression).
In Perl and compatible regular expressions, you can use ? to make a * or + non-greedy:
s:{{{(.*?)}}}:$1:g
This will always match the first }}} after the opening {{{.
However, this is not possible in Sed. In fact, I don't think there is any way in Sed of doing this match. The only other way to do this is use advanced features like look-ahead, which Sed also does not have.
You can easily use Perl in a sed-like fashion with the -pe options, which cause it to take a single line of code from the command line (-e) and automatically loop over each line and print the result (-p).
perl -pe 's:{{{(.*?)}}}:$1:g'
The -i option for in-place editing of files is also useful, but make sure your regex is correct first!
For more information see perlrun.

With sed you could do something like:
sed -e :a -e 's/\(.*\){{{\(.*\)}}}/\1\2/ ; ta'
With:
{{{can match this example}}} {{{can match this 2nd example}}}
This gives:
can match this example can match this 2nd example
It is not lazy matching, but by replacing from right to left we can make use of sed's greediness.

While replacing using regex, How to keep a part of matched string?

I have
12.hello.mp3
21.true.mp3
35.good.mp3
.
.
.
so on as file names in listed in a text file.
I need to replace only those dots(.) infront of numbers with a space.(e.g. 12.hello.mp3 => 12 hello.mp3).
If I have regex as "[0-9].", it replaces number also.
Please help me.

Replace
^(\d+)\.(.*mp3)$
with
\1 \2
Also, in recent versions of notepad++, it will also accept the following, which is also accepted by other IDEs/editors (eg. JetBrains products like Intellij IDEA):
$1 $2
This assumes that the notepad++ regex matching engine supports groups. What the regex basically means is: match the digits in front of the first dot as group 1 and everything after it as group 2 (but only if it ends with mp3)

I tested with vscode. You must use groups with parentheses (group of regex)
Practical example
start with sample data
1 a text
2 another text
3 yet more text
Do the Regex to find/Search the numerical digits and spaces. The group here will be the digits as it is surrounded in parenthesis
(\d)\s
Run a replace regex ops. Replace spaces for a dash but keep the numbers or digits in each line
$1-
Outputs
1-a text
2-another text
3-yet more text

Using the basic pattern, well described in the accepted answer here is an example to add the class="odd" and class="even" to every <tr> element in Notepad++ or any other regex compatible editor:
Find what: (<tr><td>)(.*?\r\n)(<tr><td>)(.*?\r\n)
Replace with: <tr class="odd"><td>\2<tr class="even"><td>\4

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Find and replace regular expression with alternate format - regex

Related

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

adding string after regular expression

How does this sed command parse numbers with commas?

How do I write a SED regex to extract a string delimited by another string?

While replacing using regex, How to keep a part of matched string?

Categories

Resources