perl regex, remove what is captured - regex

I've successfully captured data with this:
/^.{144}(.{15}).{34}(.{1})/
which results in this:
TTGGCCCCCACTCTC T
I want to remove the same characters from the same locations. I tried a simple substitution:
s/^.{144}(.{15}).{34}(.{1})//
That removes everything described. How do I remove only (...)?

Substitution works like
s/match/replace/
So it will replace youre complete "match" with "replace". If you want to keep part of your match, you must set references of the groups in the replacement string.
s/^.{144}(.{15}).{34}(.{1})// # replace all with nothing
s/^.{144}(.{15}).{34}(.{1})/$1/ # replace all with group 1 (.{15}) -> not what you want
s/^(.{144}).{15}(.{34}).{1}/$1$2/ # keeps group 1 and 2 and removes ".{15}" between them and all at the end.
The last one you need.
Try regex101. There you can give a pattern and it shows you the groups. There is a debugger, too.

The replacement side in the regex is substituted instead of everything that was matched (while there are ways to alter this to some extent), so you need to capture things intended to be kept as well, and put them back in the replacement side. Like
$var =~ s/^(.{144})(.{15})(.{34})(.)(.*)/$1$3$5/;
(the last capture was added in a comment)   or
$var =~ s/^(.{144})\K(.{15})(.{34})(.)(.*)/$3$5/;
Now the 15 chars and the single char are removed from $var, while you still have all of $N (1--5) available to work with as needed. (In the second version the \K keeps all matches previous to it so that they are not getting replaced, and thus we don't need $1 in the replacement side.) Please see perlretut for details.
However, as a comment enlightens us, there is a problem with this: It is not known before runtime which groups need be kept! So it could be 1,3,5 or perhaps 2 and 4 (or 7 and 11?).
What need be kept becomes known, and need be set, before the regex runs.
One way to do that: once the list of capture groups to keep is known store their indices in an array, then capture all matches into an array† and form the replacement and rewrite the string by hand
my #keep_idx = qw(0 2 4); # indices of capture groups to keep
my #captures = $var =~ /^(.{144})(.{15})(.{34})(.)(.*)/;
# Rewrite the variable using only #keep_idx -indexed captures
$var = join '', grep { defined } #captures[#keep_idx];
# Use #captures as needed...
The code above simply filters by grep any possibly non-existent "captures" -- a pattern may allow for a variable number of capture groups (so there may not exist group #5 for example). But I'd rather check those #captures explicitly (were there as many as expected? were they all of the expected form? etc).
There are other ways to do this.‡
† In newer perls (from version 5.25.7) there is the #{^CAPTURE} predefined variable with all captures, so one can run the match $var =~ /.../; and then use it. No need to assign captures.
‡ I'd like to mention one way that may be tempting, and can be seen around, but is best avoided.
One can form a string for the replacement side and double-evaluate it, like so
my $keep = q($1.$3.$5); # perl *code*, concatenating variables
$var =~ s/.../$keep/ee; # DANGEROUS. Runs any code in $keep
Here the modifiers /ee evaluate the right-hand side, and in a way that exposes the program to evaluating code (in $keep) that may have been slipped to it. Search for this for more information but I'd say best don't use it where it matters.

Thanks for everyone's help. I don't get how the comments work and kept fowling those up. I've decided that the cleanest (if not most elegant) way is to create two patterns. I'm keeping other solutions for future study. This is a different example,
The list of data I want to note, then delete:
/.{41}.{24}(\D{4}).{63}.{16}(\D{2}).{22}.{228}/
Data I want to keep:
/(.{41})(.{24})\D{4}(.{63})(.{16})\D{2}(.{22})(.{228})/
It's genetic data I'm working with. I need to note insertions then delete them to re-establish the original positions for alignment purposes.
If I understand correctly, I need to upvote this to close. An idiot as myself can only do what he can do. I'll try. :)

Related

Autohotkey variable assignment with dynamic hotstrings module

I have been pointed toward this AutoHotKey module which allows for dynamic hotstrings.
One of the examples is the calculation of percentages while typing (e.g. 5/40% will become 8%). To do this, the following code is necessary:
hotstrings("(\d+)\/(\d+)%", "percent")
percent:
p := Round($1 / $2 * 100)
Send, %p%`%
Return
I want to use this module to replace dots . with middle dots ∙ within words. I have figured out how to "find" the text, but not how to replace it correctly. I need to reference the initial text in order to put it in the replacement text. In the above code, using p := Round($1 / $2 * 100) uses the numbers input to calculate the percentage, but I can't figure out how to do the same with letters.
My code is the following:
hotstrings("[a-z]\.[a-z](\.[a-z])*", "word")
word:
a := $1
b := $2
Send, a{U+22C5}b
Return
But this just replaces the whole thing with a single middle dot and doesn't replace the surrounding letters. Also, I don't know how to consider the possibility of multiple dots (a.b.c.d for example). In python, I'd do a for loop but I don't really know AutoHotKey.
How can I do this?
Thanks
Few problems here.
First one is the regex.
Firstly, you don't really want to think of the regex as if it was matching an infinitely long string of a.b.c.d.e.f.g.h.i.j.k.l.... instead you just want to think of a single case x.y. Those cases can be right next to each other.
So get rid of (\.[a-z])*.
Secondly, you don't have capture groups. Or well, you do have one in there, but I'm assuming you accidentally did it. If you're not yet familiar with Regex capture groups, I'd recommend learning them, they're quite useful in certain cases (like here!).
But anyway, to create capture groups, you just put ( ) around the part of the Regex you want to capture.
So you want to capture the characters before and after the . (or well, actually only the latter one, this approach will have a problem, more on that later). So your Regex would now look like this:
([a-z])\.([a-z])
Upon a match, the hotstrings() function would output two variables, $1 and $2 (that's all they are, names of variables).
When you refer to the variables, $1 gives you the character before the ., and $2 gives you the character after the ..
So now we get onto the second problem, referring to the capture group variables.
a := $1
b := $2
Send, a{U+22C5}b
Here you create the variables a and b for no reason, though that's not an issue of course, but how you try to refer to the variables a and b is a problem.
You're using a send command, so you're in legacy AHK syntax. In legacy AHK syntax you refer to variables by wrapping them around in %%.
So your send command would look like this:
Send, %a%{U+22C5}%b%
But lets not write legacy AHK (even though the hotstrings() function totally is legacy AHK).
To switch over to modern AHK (expression syntax) we would do specify a single % followed up by a space. And then we can do this:
SendInput, % $1 "{U+22C5}" $2
Also skipped defining the useless variables a and b and switched over SendInput due to it being the recommended faster and more reliable send mode.
And now would have an almost working script like so:
hotstrings("([a-z])\.([a-z])", "word")
return
word:
SendInput, % $1 "{U+22C5}" $2
Return
It just would have the problem of chaining multiple a.b.c.d.e.f.g... not working very well. But that's fine, since the Regex could do with more improvements.
We want to use a positive lookbehind and capture only the character after the . like so:
(?<=[a-z])\.([a-z])
Also, I'd say it would be fitting to replace [a-z] with \w (match any word character). So the Regex and the whole script would be:
hotstrings("(?<=\w)\.(\w)", "word")
return
word:
SendInput, % "{U+22C5}" $1
Return
And now it should work just as requested.
And if my talks about legacy vs modern AHK confuse you (that's to be expected if you don't know the difference), I'd recommend giving e.g. this a read:
https://www.autohotkey.com/docs/Language.htm

auto-generating substitution in perl

I'm trying to autogenerate a regex pattern in perl based on some input, to handle various variables that are created by token pasting in a Makefile... So, for example, I might have a pattern such as:
foo_1_$(AB)_$(CB)
Given this pattern, I need to create a regex that will convert all instances of foo_1_\$(\w+)_\$(\w+) to bar_1_\$($1)_\$($2). The main issue I'm having is with the "to" side of the pattern -- I need to increment the $ number reference each time -- notice that there may be a variable number of tokens in any given pattern.
So... I'm thinking something like this:
foreach $pattern (#patterns) {
my $from = $pattern;
# foo_1_$(AB)_$(CD)
$from =~ s/\$\(\w+\)/\$\(\\w\\\+\)/g;
# foo_1_$(\w+)_$(\w+)
my $to = $pattern =~ s/foo/bar/r;
# bar_1_$(AB)_$(CD);
$to =~ s/\$\(\w+\)/\\\$\(\$?)/g; #???
# bar_1_\$($1)_\$($2)
# ^ ^
#this next part is done outside of this loop, but for the example code:
$line ~= s/\Q$from\E/$to/;
}
How do I cause each subsequent replacement in my to to have an incremental index?
Writing code to generate regex off of a given pattern is a complex undertaking (except in simplest cases), and that's when it is precisely specified what that pattern can be. In this case I also don't see why one can't solve the problem by writing the regex for a given type of a pattern (instead of writing code that would write regex).†
In either case one would need those regex so here's some of that. Since no precise rules for what the patterns may be are given, I use some basic assumptions drawn from hints in the question.
I take it that the pattern to replace (foo_) is followed by a number, and then by the pattern _$(AB) (literal dollar and parens with chars inside), repeated any number of times ("there may be a variable number of tokens").
One way to approach this is by matching the whole following pattern (all repetitions). With lookahead
s/[a-z]+_([0-9]+)(?=_(\$\(\w+\))+)/XXX_$1/;
A simple minded test in a one-liner
perl -wE'$_=q{foo_1_$(AB)_$(CB)}; s/[a-z]+_([0-9]+)(?=_(\$\(\w+\))+)/XXX_$1/; say'
replaces foo to XXX. It works for only one group _$(AB), and for more than two, as well.
This does not match the lone foo_1, without following _$(AB), decided based on the "spirit" of the question (since such a requirement is not spelled out). If such a case in fact should be matched as well then that is possible with a few small changes (mostly related to moving _ into the pattern to be replaced, as optional ([a-z]+_[0-9]+_?))
Update If the "tokens" that follow foo_ (to be replaced) can in fact be anything (so not necessarily $(..)), except that they are strung together with _, then we can use a modification like
/[a-z]+_(\d?)(?=(_[^_]+)*)/XXX_$1/;
where the number after foo_ is optional, per example given in a comment. But then it's simpler
/[a-z]+(?=(_[^_]+)*)/XXX/;
Example
perl -wE'
$_=q{foo_$(AB)_123_$(CD)_foo_$(EF)}; say;
s/[a-z]+(?=(_[^_]+)*)/XXX/; say'
prints
foo_$(AB)_123_$(CD)_foo_$(EF)
XXX_$(AB)_123_$(CD)_foo_$(EF)
Note: what the above regex does is also done by /[a-z]+(?=_)/XXX/. However, the more detailed regex above can be tweaked and adapted for more precise requirements and I'd use that, or its variations, as a main building block for complete solutions.
If the rules for what may be a pattern are less structured (less than "any tokens connected with _") then we need to know them, and probably very precisely.
This clearly doesn't generate the regex from a given pattern, as asked, but is a regex to match such a (class of) patterns. That can solve the problem given sufficient specification for what those patterns may be like -- which would be necessary for regex generation as well.
† Another option is that some templating system is used but then you are again directly writing regex to match given types of patterns.

Notepad++ masschange using regular expressions

I have issues to perform a mass change in a huge logfile.
Except the filesize which is causing issues to Notepad++ I have a problem to use more than 10 parameters for replacement, up to 9 its working fine.
I need to change numerical values in a file where these values are located within quotation marks and with leading and ending comma: ."123,456,789,012.999",
I used this exp to find and replace the format to:
,123456789012.999, (so that there are no quotation marks and no comma within the num.value)
The exp used to find is:
([,])(["])([0-9]+)([,])([0-9]+)([,])([0-9]+)([,])([0-9]+)([\.])([0-9]+)(["])([,])
and the exp to replace is:
\1\3\5\7\9\10\11\13
The problem is parameters \11 \13 are not working (the chars eg .999 as in the example will not appear in the changed values).
So now the question is - is there any limit for parameters?
It seems for me as its not working above 10. For shorter num.values where I need to use only up to 9 parameters the string for serach and replacement works fine, for the example above the search works but not the replacement, the end of the changed value gets corrupted.
Also, it came to my mind that instead of using Notepad++ I could maybe change the logfile on the unix server directly, howerver I had issues to build the correct perl syntax. Anyone who could help with that maybe?
After having a little play myself, it looks like back-references \11-\99 are invalid in notepad++ (which is not that surprising, since this is commonly omitted from regex languages.) However, there are several things you can do to improve that regular expression, in order to make this work.
Firstly, you should consider using less groups, or alternatively non-capture groups. Did you really need to store 13 variables in that regex, in order to do the replacement? Clearly not, since you're not even using half of them!
To put it simply, you could just remove some brackets from the regex:
[,]["]([0-9]+)[,]([0-9]+)[,]([0-9]+)[,]([0-9]+)[.]([0-9]+)["][,]
And replace with:
,\1\2\3\4.\5,
...But that's not all! Why are you using square brackets to say "match anything inside", if there's only one thing inside?? We can get rid of these, too:
,"([0-9]+),([0-9]+),([0-9]+),([0-9]+)\.([0-9]+)",
(Note I added a "\" before the ".", so that it matches a literal "." rather than "anything".)
Also, although this isn't a big deal, you can use "\d" instead of "[0-9]".
This makes your final, optimised regex:
,"(\d+),(\d+),(\d+),(\d+)\.(\d+)",
And replace with:
,\1\2\3\4.\5,
Not sure if the regex groups has limitations, but you could use lookarounds to save 2 groups, you could also merge some groups in your example. But first, let's get ride of some useless character classes
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+)(\.)([0-9]+)(")(,)
We could merge those groups:
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+)(\.)([0-9]+)(")(,)
^^^^^^^^^^^^^^^^^^^^
We get:
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+\.[0-9]+)(")(,)
Let's add lookarounds:
(?<=\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+\.[0-9]+)(")(?=,)
The replacement would be \2\4\6\8.
If you have a fixed length of digits at all times, its fairly simple to do what you have done. Even though your expression is poorly written, it does the job. If this is the case, look at Tom Lords answer.
I played around with it a little bit myself, and I would probably use two expressions - makes it much easier. If you have to do it in one, this would work, but be pretty unsafe:
(?:"|(\d+),)|(\.\d+)"(?=,) replace by \1\2
Live demo: http://regex101.com/r/zL3fY5

Regex exception

I'd like to have regex that would match every [[ except these starting with some word, ex.:
Match [[DEF, but not match [[ABC:DEF.
Thanks for help and sorry for my English.
EDIT:
My regex (Python) is (\[\[)|(\{\{([Tt]emplate:|)[Cc]ategory).
It match every [[ and {{category}} or {{Template:Category}} or {{template:category}}, but I don't want to match [[ if it starting by ex. ABC. More examples:
Match [[SOMETHING, but not match [[ABC: SOMETHING,
Match [[EXAMPLE, but not match [[ABC: EXAMPLE.
EDIT2: "define ex. ABC"
I want match every [[ not followed by some string, for example ABC.
This depends heavily on the regex engine you are using. If I can assume it can handle look-arounds, the regex would probably be \[\[(?!ABC) for matching two opening brackets not followed by the three characters ABC.
match every [[ but don't match [[ if it starting by ex. ABC
Maybe you mean:
\[\[(?!ABC)
...or maybe something more like:
\[\[(?!\w+:)
Finally, after 8 years, here's an easy copy-paste code that should cover every possible case.
Watch out for:
Be careful when using this for "any-word-except", make sure to put \b in the theREGEX_BEFORE part, as you should be doing anyways for finding words.
If your regex is really complex, and you need to use this code in two different places in one regex expression, make sure to use exceptions_group_1 for the first time, exceptions_group_2 for the second time, etc. Read the explanation below to understand this better.
Copy/Paste Code:
In the following regex, ONLY replace the all-caps sections with your regex.
Python regex
pattern = r"REGEX_BEFORE(?>(?P<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER"
Ruby regex
pattern = /REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(<exceptions_group_1>)always(?<=fail)|)REGEX_AFTER/
PCRE regex
REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER
JavaScript
Impossible as of 6/17/2020, and probably won't be possible in the near future.
Full Examples
REGEX_BEFORE = [[
YOUR_NORMAL_PATTERN = \w+\d*
REGEX_AFTER = ]]
EXCEPTION_PATTERN = MyKeyword\d+
Python regex
pattern = r"\[\[(?>(?P<exceptions_group_1>MyKeyword\d+)|\w+\d*)(?(exceptions_group_1)always(?<=fail)|)\]\]"
Ruby regex
pattern = /\[\[(?>(?<exceptions_group_1>MyKeyword\d+)|\w+\d*)(?(<exceptions_group_1>)always(?<=fail)|)\]\]/
PCRE regex
\[\[(?>(?<exceptions_group_1>MyKeyword\d+)|\w+\d*)(?(exceptions_group_1)always(?<=fail)|)\]\]
How does it work?
This uses decently complicated regex, namely Atomic Groups, Conditionals, Lookbehinds, and Named Groups.
The (?> is the start of an atomic group, which means its not allowed to backtrack: which means, If that group matches once, but then later gets invalidated because a lookbehind failed, then the whole group will fail to match. (We want this behavior in this case).
The (?<exceptions_group_1> creates a named capture group. Its just easier than using numbers. Note that the pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
Note that the atomic pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
The real magic is in the (?(exceptions_group_1). This is a conditional asking whether or not exceptions_group_1 was successfully matched. If it was, then it tries to find always(?<=fail). That pattern (as it says) will always fail, because its looking for the word "always" and then it checks 'does "ways"=="fail"', which it never will.
Because the conditional fails, this means the atomic group fails, and because it's atomic that means its not allowed to backtrack (to try to look for the normal pattern) because it already matched the exception.
This is definitely not how these tools were intended to be used, but it should work reliably and efficiently.
Exact answer to the original question:
pattern = r"(\[\[(?>(?P<exceptions_group_1>ABC: )|(SOMETHING|EXAMPLE))(?(exceptions_group_1)always(?<=fail)|))"

Regex: Matching against groups in different order without repeating the group

Let's say I have two strings like this:
XABY
XBAY
A simple regex that matches both would go like this:
X(AB|BA)Y
However, I have a case where A and B are complicated strings, and I'm looking for a way to avoid having to specify each of them twice (on each side of the |). Is there a way to do this (that presumably is simpler than having to specify them twice)?
Thanks
X(?:A()|B()){2}\1\2Y
Basically, you use an empty capturing group to check off each item when it's matched, then the back-references ensure that everything's been checked off.
Be aware that this relies on undocumented regex behavior, so there's no guarantee that it will work in your regex flavor--and if it does, there's no guarantee that it will continue to work as that flavor evolves. But as far as I know, it works in every flavor that supports back-references. (EDIT: It does not work in JavaScript.)
EDIT: You say you're using named groups to capture parts of the match, which adds a lot of visual clutter to the regex, if not real complexity. Well, if you happen to be using .NET regexes, you can still use simple numbered groups for the "check boxes". Here's a simplistic example that finds and picks apart a bunch of month-day strings without knowing their internal order:
Regex r = new Regex(
#"(?:
(?<MONTH>Jan|Feb|Mar|Apr|May|Jun|Jul|Sep|Oct|Nov|Dec)()
|
(?<DAY>\d+)()
){2}
\1\2",
RegexOptions.IgnorePatternWhitespace);
string input = #"30Jan Feb12 Mar23 4Apr May09 11Jun";
foreach (Match m in r.Matches(input))
{
Console.WriteLine("{0} {1}", m.Groups["MONTH"], m.Groups["DAY"]);
}
This works because in .NET, the presence of named groups has no effect on the ordering of the non-named groups. Named groups have numbers assigned to them, but those numbers start after the last of the non-named groups. (I know that seems gratuitously complicated, but there are good reasons for doing it that way.)
Normally you want to avoid using named and non-named capturing groups together, especially if you're using back-references, but I think this case could be a legitimate exception.
You can store regex pieces in variables, and do:
A=/* relevant regex pattern */
B=/* other regex pattern */
regex = X($A$B|$B$A)Y
This way you only have to specify each regex once, on its own line, which should make it easier to maintain.
Sidenote: You're trying to find permutations, which is ok since you're only looking at 2 subregexes. But if you wanted to add a third (or fourth), your regex permutations grow drastically - (abc|acb|bac|bca|cab|cba) - or worse. If you need to go down the road of permutations, there's some good discussion on that here on stackoverflow. It's for letter permutation, and the solutions use awk/bash/perl, but that at least gives you a starting point.
try this
X((A|B){2})Y
If there are several strings, with any kind of characters in there, you'll be better with:
X(.)+Y
Only numbers then
X([0-9])+Y
Only letters
X([a-zA-Z])+Y
Letters and numbers
X([a-zA-Z][0-9])+Y