Regular expression replace, back referencing replaced characters sublime text 3 - regex

I have a file with the following lines:
A 123
B 323
Each line starts with either A or B, and is followed by a blank and a number.
I am trying to convert this into
'C [a-z]*A 123'
for each line. I use a regex in Find and replace. The regex [AB] [0-9]* selects all the lines without a problem. I'm trying to replace it with 'C [a-z]*$1' that does not print $1 in the replaced string, and returns:
'C [a-z]*'
What am I missing?

You regex - [AB] [0-9]* - has no round brackets (i.e. no capturing groups that must be present if you wish to reference the captured subtexts later in the relacement string), and thus, you do not get the expected result.
You can use
(?m)^[AB][ ]([0-9]{3})
Or, if the digits are optional, use * quantifier that means match 0 or more characters as defined with the preceding subpattern
(?m)^[AB][ ]([0-9]*)
And replace with
'C [a-z]*$1'
See demo

Related

How can I get the first and last part of one wordcombination using regex

How can I get only the middle part of a combined name with PCRE regex?
name: 211103_TV_storyname_TYPE
result: storyname
I have used this single line: .(\d)+.(_TV_) to remove the first part: 211103_TV_
Another idea is to use (_TYPE)$ but the problem is that I donĀ“t have in all variations of names a space to declare a second word to use the ^ for the first word and $ for the second.
The variation of the combined name is fix for _TYPE and the TV.
The numbers are changing according to the date. And the storyname is variable.
Any ideas?
Thanks
With your shown samples, please try following regex, this creates one capturing group which contains matched values in it.
.*?_TV_([^_]*)(?=_TYPE)
OR(adding a small variation of above solution with fourth bird's nice suggestion), following is without lazy match .*? unlike above:
_TV_([^_]*)(?=_TYPE)
Here is the Online demo for above regex
Explanation: Adding detailed explanation for above.
.*?_ ##Using Lazy match to match till 1st occurrence of _ here.
TV_ ##Matching TV_ here.
([^_]*) ##Creating 1st capturing group which has everything before next occurrence of _ here.
(?=_TYPE) ##Making sure previous values are followed by _TYPE here.
You could match as least as possible chars after _TV_ until you match _TYPE
\d_TV_\K.*?(?=_TYPE)
\d_TV_ Match a digit and _TV_
\K Forget what is matched until now
.*? Match as least as possible characters
(?=_TYPE) Assert _TYPE to the right
Regex demo
Another option without a non greedy quantifier, and leaving out the digit at the start:
_TV_\K[^_]*+(?>_(?!TYPE)[^_]*)*(?=_TYPE)
_TV_ Match literally
\K[^_]*+ Forget what is matched until now and optionally match any char except _
(?>_(?!TYPE)[^_]*)* Only allow matching _ when not directly followed by TYPE
(?=_TYPE) Assert _TYPE to the right
Regex demo
Edit
If you want to replace the 2 parts, you can use an alternation and replace with an empty string.
If it should be at the start and the end of the string, you can prepend ^ and append $ to the pattern.
\b\d{6}_TV_|_TYPE\b
\b\d{6}_TV_ A word boundary, match 6 digits and _TV_
| Or
_TYPE\b Match _TYPE followed by a word boundary
Regex demo
Here i put some additional Screenshots to the post. With the Documentation that appears on the help button. And you see the forms and what i see.
Documentation
The regular expressions we use are based on PCRE - Perl Compatible Regular Expressions. Full specification can be found here: http://www.pcere.org and http://perldoc.perl.org/perlre.html
Summary of some useful terms:
Metacharacters
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
Quantifiers
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
Charcter Classes
\w Match a "word" character (alphanumeric plus mao}
\W Match a non-"word" character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
Capture buffers
The bracketing construct (...) creates capture buffers. To refer to
Within the same pattern, use \1 for the first, \2 for the second, and so on. Outside the match use "$" instead of "". The \ notation works in certain circumstances outside the match. See the warning below about \1 vs $1 for details.
Referring back to another part of the match is called a backreference.
Examples
Replace story with certain prefix letters M N or E to have the prefix "AA":
`srcPattern "(M|N|E ) ([A-Za-z0-9\s]*)"`
`trgPattern "AA$2" `
`"N StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
`"E StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
`"M StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
"NoMatchWord StoryWord1 StoryWord2" -> "NoMatchWord StoryWord1 StoryWord2" (no match found, name remains the same)

How to use Ruby gsub with regex to do partial string substitution

I have a pipe delimited file which has a line
H||CUSTCHQH2H||PHPCCIPHP|1010032000|28092017|25001853||||
I want to substitute the date (28092017) with a regex "[0-9]{8}" if the first character is "H"
I tried the following example to test my understanding where Im trying to subtitute "a" with "i".
str = "|123||a|"
str.gsub /\|(.*?)\|(.*?)\|(.*?)\|/, "\|\\1\|\|\\1\|i\|"
But this is giving o/p as
"|123||123|i|"
Any clue how this can be achieved?
You may replace the first occurrence of 8 digits inside pipes if a string starts with H using
s = "H||CUSTCHQH2H||PHPCCIPHP|1010032000|28092017|25001853||||"
p s.gsub(/\A(H.*?\|)[0-9]{8}(?=\|)/, '\100000000')
# or
p s.gsub(/\AH.*?\|\K[0-9]{8}(?=\|)/, '00000000')
See the Ruby demo. Here, the value is replaced with 8 zeros.
Pattern details
\A - start of string (^ is the start of a line in Ruby)
(H.*?\|) - Capturing group 1 (you do not need it when using the variation with \K): H and then any 0+ chars as few as possible
\K - match reset operator that discards the text matched so far
[0-9]{8} - eight digits
(?=\|) - the next char must be |, but it is not added to the match value since it is a positive lookahead that does not consume text.
The \1 in the first gsub is a replacement backreference to the value in Group 1.

Match all non-numeric characters between two underscores

I am using a regular expression to extract all non-numeric characters between two underscores from a string.
JohnDoe_King234_sample
I need the following output from the string: King
I have tried the following regular expression: (?<=_).\D*(?=_)
(Look positively forward for _ then match non numeric characters then look positively behind _ )
If my string is:
JohnDoe_King_sample
then my expression returns King. If my string is:
JohnDoe_King234_sample
then my expression does not match.
(?<=_).\D*(?=_)
Expected results: King
Actual results:
You may use
(?<=_)[^_\d]+(?=\d*_)
See the regex demo
Details
(?<=_) - a _ should be right before the current location
[^_\d]+ - any 1 or more chars other than _ and digits -
(?=\d*_) - there must be 0 or more digits followed with one _ immediately to the right of the current location.
NOTE: In case you may have digits anywhere inside that substring between underscores, if you have a way to process the string with some programming language, you might consider a _([^_]+)_ regex to extract the first match, then grab Group 1 value and remove all digits from it using a simple \d+ pattern with a regex replace method/function.

Regular expression not working

I want to extract from the following regex (?<=^\d+\s*).*?\t trying to extract from the following text just the resources\blahblah:
10 _Resources\index.test FAIL
11 _Resources\index.test FAIL
12 Resources\index.test FAIL
13set\Relicensing Statement.test FAIL
but it captures the following text:
0 _Resources\index.test
1 _Resources\index.test
2 Resources\index.test
3set\Relicensing Statement.test
I just want the lines like Resources\index.test and not the starting numbers, no spaces, why is failing? If I just execute ^\d+\s*and matches with the any number of digits and space, but do not works with prefix.
Since you commented you were using Notepad++, how about matching ^\d+\s*([^\t]*).*$ and replacing by \1 ?
From NSRegularExpression (I saw it was tagged):
Look-behind assertion. True if the parenthesized pattern matches text
preceding the current input position, with the last character of the
match being the input character just before the current position. Does
not alter the input position. The length of possible strings matched
by the look-behind pattern must not be unbounded (no * or +
operators.)
The same problem holds in most of the languages.
Can't you extract $1 from (?:^\d+\s*)(.*?\t)?

Do parentheses change the length of a regular expression?

Let Sigma = {a,b}. The regular expression RE = (ab)(ab)*(aa|bb)*b over Sigma.
Give a string of length 5 in the set denoted by RE.
Correct answer: abaab
My answer: (ab)aab
I placed the parentheses there because they are in the RE. I understand why I don't need to, but is my answer incorrect? I tested it using RegEx, and the expression (ab)aab matched the text abaab, but it did not match when I reversed this.
() is syntax of regex and has its semantic meaning, you may have a look here and here
Similar to ^ or & and other reserved character in regex, you have to special handle to match them using regex, for example: Regex to Match Symbols: !$%^&*()_+|~-=`{}[]:";'<>?,./
Also, specifically in your question context, () should not appear as part of the string as it is not in the charater set (alphabet) {a,b}. And the string you provide has a lengh of 7 instead of 5, so it is correct to say it is wrong.
Your answer is wrong because the parentheses do not belong to your set of symbols. The string (ab)aab cannot be generated using only symbols present in the {a,b} set.
Even more, you were asked to provide a string of 5 symbols but (ab)aab has length 7.
Parentheses have special meaning in regex. They create sub-regexps and capturing groups. For example, (ab)* means ab can be matched any number of times, including zero. Without parentheses, ab* means the regex matches one a followed by any number of bs. That's a different expression.
For example:
the regular expression (ab)* matches the empty string (ab zero times), ab, abab, ababab, abababab and so on;
the regular expression ab* matches a (followed by zero bs), ab, abb, abbb, abbbb and so on.
The first set of parentheses in your example is useless if you are looking only for sub-regexps. Both (ab) and ab expressions match only the ab string. But they can be used to capture the matched part of the string and re-use it either with back references or for replacement.
When parentheses are used for sub-expressions in regular expressions, they are meta-characters, do not match anything in the string. In order to match an open parenthesis character ( (found in the string) you have to escape it in the regex: \(.
Several strings that match the regular expression (ab)(ab)*(aa|bb)*b over Sigma = { 'a', 'b' }: abb, ababb, abababababb, ababababaabbaaaabbb.
The last string (ababababaabbaaaabbb) matches the regex pieces as follows:
ab - (ab)
ababab - (ab)* - ('ab' 3 times)
aabbaaaabb - (aa|bb)* - ('aa' or 'bb', 5 times in total)
b - b
A regex that matches the string (ab)aab is \(ab\)(ab)*(aa|bb)*b but in this case
Sigma = { 'a', 'b', '(', ')' }