sed replacement dependent on surroundings - regex

How do I do a sed replacement on a file containing things like
f[3][2,3][4] which should be replaced by f[3][2,3]["string"][4]. Of course the numbers could be different numbers and I want the same replacement to happen. Here we need to give full pattern, something like
f\[[0-9]*\]\[[0-9]*,[0-9]*\]
to pick out the right part but then I need the replacement to give the same numbers that where matched earlier (as well as attaching ["string"]). What is the best way to do this?

You may use
sed 's/\(f\[[0-9]*]\[[0-9]*,[0-9]*]\)\[\([0-9]*\)]/\1["string"][\2]/g'
Or
sed -E 's/(f\[[0-9]*]\[[0-9]*,[0-9]*])\[([0-9]*)]/\1["string"][\2]/g'
See the online sed demo
Basically, capture the first part from f to the third index, and capture the number in the third index, too. Then, reference these values with \1 and \2 placeholders from the RHS.
Details
(f\[[0-9]*]\[[0-9]*,[0-9]*]) (ERE, in BRE, the (...) must be escaped) - Group 1 matching f[, 0+ digits, ][, 0+ digits, , and 0+ digits followed with ]
\[ - a [
([0-9]*) - Group 2: 0+ digits
] - a ] char.

Related

How to delete two groups of characters with regex?

I have this type of string:
First part: [[archive 726|The Archive]] is a great start
And I want to print:
First part: The Archive is a great start
Here is what I've come to far:
input.gsub!(/\[\[(.*?)\|/,"")
print input
> "First part: The Archive]] is a great start"
How can I also match the ]]?
You may use
input.gsub!(/\[\[[^\]\[]*\|(.*?)\]\]/, '\1')
See the Rubular demo and a Ruby demo.
Details
\[\[ - a [[ substring
[^\]\[]* - any 0 or more chars other than [ and ], as many as possible (if there are multiple | chars inside [[...]], replace * with *? to match as few as possible)
\| - a | char
(.*?) - Group 1 (the group value is referred to with \1 from the replacement pattern, mind the single quotes around \1): any 0 or more chars other than line break chars, as few as possible
\]\] - a ]] substring.

Using regex replacement in Sublime 3

I am trying to use replace in Sublime using regular expressions but I'm stuck. I tried various combinations but don't seem to be getting there.
This is the input and my desired output:
Input: N_BBP_c_46137_n
Output : BBP
I tried combinations of:
[^BBP]+\b
\*BBP*+\g
But none of the above (and many others) don't seem to work.
To turn N_BBP_c_46137_n into BBP and according to the comment just want that entire long name such as N_BBP_ to be replaced by only BBP* you might also use a capture group to keep BBP.
\bN_(BBP)_\S*
\bN_ Match N preceded by a word boundary
(BBP) Capture group 1, match BBP (or use [A-Z]+ to match 1+ uppercase chars)
_\S* Match _ followed by 0+ times a non whitespace char
In the replacement use the first capturing group $1
Regex demo
You may use
(N_)[^_]*(_c_\d+_n)
Replace with ${1}some new value$2.
Details
(N_) - Group 1 ($1 or ${1} if the next char is a digit): N_
[^_]* - any 0 or more chars other than _
-(_c_\d+_n) - Group 2 ($2): _c_, 1 or more digits and then _n.
See the regex demo.

last year occurrence from string

I have strings like this:
ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar
I'm trying to get the last occurrence of a single year (from 1900 to 2050), so I need to extract only 1934 from that string.
I'm trying with:
grep -P -o '\s(19|20)[0-9]{2}\s(?!\s(19|20)[0-9]{2}\s)'
or
grep -P -o '((19|20)[0-9]{2})(?!\s\1\s)'
But it matches: 1910 and 1934
Here's the Regex101 example:
https://regex101.com/r/UetMl0/3
https://regex101.com/r/UetMl0/4
Plus: how can I extract the year without the surrounding spaces without doing an extra grep to filter them?
Have you ever heard this saying:
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
Keep it simple - you're interested in finding a number between 2 numbers so just use a numeric comparison, not a regexp:
$ awk -v min=1900 -v max=2050 '{yr=""; for (i=1;i<=NF;i++) if ( ($i ~ /^[0-9]{4}$/) && ($i >= min) && ($i <= max) ) yr=$i; print yr}' file
1934
You didn't say what to do if no date within your range is present so the above outputs a blank line if that happens but is easily tweaked to do anything else.
To change the above script to find the first instead of the last date is trivial (move the print inside the if), to use different start or end dates in your range is trivial (change the min and/or max values), etc., etc. which is a strong indication that this is the right approach. Try changing any of those requirements with a regexp-based solution.
I don't see a way to do this with grep because it doesn't let you output just one of the capture groups, only the whole match.
Wit perl I'd do something like
perl -lpe 'if (/^.*\b(19\d\d|20(?:0-4\d|50))\b/) { print $1 }'
Idea: Use ^.* (greedy) to consume as much of the string up front as possible, thus finding the last possible match. Use \b (word boundary) around the matched number to prevent matching 01900 or X1911D. Only print the first capture group ($1).
I tried to implement your requirement of 1900-2050; if that's too complicated, ((?:19|20)\d\d) will do (but also match e.g. 2099).
The regex to do your task using grep can be as follows:
\b(?:19\d{2}|20[0-4]\d|2050)\b(?!.*\b(?:19\d{2}|20[0-4]\d|2050)\b)
Details:
\b - Word boundary.
(?: - Start of a non-capturing group, needed as a container for
alternatives.
19\d{2}| - The first alternative (1900 - 1999).
20[0-4]\d| - The second alternative (2000 - 2049).
2050 - The third alternative, just 2050.
) - End of the non-capturing group.
\b - Word boundary.
(?! - Negative lookahead for:
.* - A sequence of any chars, meaning actually "what follows
can occur anywhere further".
\b(?:19\d{2}|20[0-4]\d|2050)\b - The same expression as before.
) - End of the negative lookahead.
The word boundary anchors provide that you will not match numbers - parts
of longer words, e.g. X1911D.
The negative lookahead provides that you will match just the last
occurrence of the required year.
If you can use other tool than grep, supporting call to a previous
numbered group (?n), where n is the number of another capturing
group, the regex can be a bit simpler:
(\b(?:19\d{2}|20[0-4]\d|2050)\b)(?!.*(?1))
Details:
(\b(?:19\d{2}|20[0-4]\d|2050)\b) - The regex like before, but
enclosed within a capturing group (it will be "called" later).
(?!.*(?1)) - Negative lookahead for capturing group No 1,
located anywhere further.
This way you avoid writing the same expression again.
For a working example in regex101 see https://regex101.com/r/fvVnZl/1
You may use a PCRE regex without any groups to only return the last occurrence of a pattern you need if you prepend the pattern with ^.*\K, or, in your case, since you expect a whitespace boundary, ^(?:.*\s)?\K:
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' file
See the regex demo.
Details
^ - start of line
(?:.*\s)? - an optional non-capturing group matching 1 or 0 occurrences of
.* - any 0+ chars other than line break chars, as many as possible
\s - a whitespace char
\K - match reset operator discarding the text matched so far
(?:19\d{2}|20(?:[0-4]\d|50)) - 19 and any two digits or 20 followed with either a digit from 0 to 4 and then any digit (00 to 49) or 50.
(?!\S) - a whitespace or end of string.
See an online demo:
s="ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar"
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' <<< "$s"
# => 1934

Looking for regex to match before and after a number

Given the string
170905-CBM-238.pdf
I'm trying to match 170905-CBM and .pdf so that I can replace/remove them and be left with 238.
I've searched and found pieces that work but can't put it all together.
This-> (.*-) will match the first section and
This-> (.[^/.]+$) will match the last section
But I can't figure out how to tie them together so that it matches everything before, including the second dash and everything after, including the period (or the extension) but does not match the numbers between.
help :) and thank you for your kind consideration.
There are several options to achieve what you need in Nintex.
If you use Extract operation, use (?<=^.*-)\d+(?=\.[^.]*$) as Pattern.
See the regex demo.
Details
(?<=^.*-) - a positive lookbehind requiring, immediately to the left of the current location, the start of string (^), then any 0+ chars other than LF as many as possible up to the last occurrence of - and the subsequent subpatterns
\d+ - 1 or more digits
(?=\.[^.]*$) - a positive lookahead requiring, immediately to the right of the current location, the presence of a . and 0+ chars other than . up to the end of the string.
If you use Replace text operation, use
Pattern: ^.*-([0-9]+)\.[^.]+$
Replacement text: $1
See another regex demo (the Context tab shows the result of the replacement).
Details
^ - a start of string anchor
.* - any 0+ chars other than LF up to the last occurrence of the subsequent subpatterns...
- - a hyphen
([0-9]+) - Group 1: one or more ASCII digits
\. - a literal .
[^.]+ - 1 or more chars other than .
$ - end of string.
The replacement $1 references the value stored in Group 1.
I don't know ninetex regex, but a sed type regex:
$ echo "170905-CBM-238.pdf" | sed -E 's/^.*-([0-9]*)\.[^.]*$/\1/'
238
Same works in Perl:
$ echo "170905-CBM-238.pdf" | perl -pe 's/^.*-([0-9]*)\.[^.]*$/$1/'
238

R- regex extracting a string between a dash and a period

First of all I apologize if this question is too naive or has been repeated earlier. I tried to find it in the forum but I'm posting it as a question because I failed to find an answer.
I have a data frame with column names as follows;
head(rownames(u))
[1] "A17-R-Null-C-3.AT2G41240" "A18-R-Null-C-3.AT2G41240" "B19-R-Null-C-3.AT2G41240"
[4] "B20-R-Null-C-3.AT2G41240" "A21-R-Transgenic-C-3.AT2G41240" "A22-R-Transgenic-C-3.AT2G41240"
What I want is to use regex in R to extract the string in between the first dash and the last period.
Anticipated results are,
[1] "R-Null-C-3" "R-Null-C-3" "R-Null-C-3"
[4] "R-Null-C-3" "R-Transgenic-C-3" "R-Transgenic-C-3"
I tried following with no luck...
gsub("^[^-]*-|.+\\.","\\2", rownames(u))
gsub("^.+-","", rownames(u))
sub("^[^-]*.|\\..","", rownames(u))
Would someone be able to help me with this problem?
Thanks a lot in advance.
Shani.
Here is a solution to be used with gsub:
v <- c("A17-R-Null-C-3.AT2G41240", "A18-R-Null-C-3.AT2G41240", "B19-R-Null-C-3.AT2G41240", "B20-R-Null-C-3.AT2G41240", "A21-R-Transgenic-C-3.AT2G41240", "A22-R-Transgenic-C-3.AT2G41240")
gsub("^[^-]*-([^.]+).*", "\\1", v)
See IDEONE demo
The regex matches:
^[^-]* - zero or more characters other than -
- - a hyphen
([^.]+) - Group 1 matching and capturing one or more characters other than a dot
.* - any characters (even including a newline since perl=T is not used), any number of occurrences up to the end of the string.
This can easily be achieved with the following regex:
-([^.]+)
# look for a dash
# then match everything that is not a dot
# and save it to the first group
See a demo on regex101.com. Outputs are:
R-Null-C-3
R-Null-C-3
R-Null-C-3
R-Null-C-3
R-Transgenic-C-3
R-Transgenic-C-3
Regex
-([^.]+)\\.
Description
- matches the character - literally
1st Capturing group ([^\\.]+)
[^\.]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
. matches the character . literally
\\. matches the character . literally
Debuggex Demo
Output
MATCH 1
1. [4-14] `R-Null-C-3`
MATCH 2
1. [29-39] `R-Null-C-3`
MATCH 3
1. [54-64] `R-Null-C-3`
MATCH 4
1. [85-95] `R-Null-C-3`
MATCH 5
1. [110-126] `R-Transgenic-C-3`
MATCH 6
1. [141-157] `R-Transgenic-C-3`
This seems an appropriate case for lookarounds:
library(stringr)
str_extract(v, '(?<=-).*(?=\\.)')
where
(?<= ... ) is a positive lookbehind, i.e. it looks for a - immediately before the next captured group;
.* is any character . repeated 0 or more times *;
(?= ... ) is a positive lookahead, i.e. it looks for a period (escaped as \\.) following what is actually captured.
I used stringr::str_extract above because it's more direct in terms of what you're trying to do. It is possible to do the same thing with sub (or gsub), but the regex has to be uglier:
sub('.*?(?<=-)(.*)(?=\\.).*', '\\1', v, perl = TRUE)
.*? looks for any character . from 0 to as few as possible times *? (lazy evaluation);
the lookbehind (?<=-) is the same as above;
now the part we want .* is put in a captured group (...), which we'll need later;
the lookahead (?=\\.) is the same;
.* captures any character, repeated 0 to as many as possible times (here the end of the string).
The replacement is \\1, which refers to the first captured group from the pattern regex.