In emacs, can I use alternation in the regexp for align-regexp? - regex

For example, I have the following snippet:
'abc' => 1,
'abcabc' =>2,
'abcabcabc' => 3,
And I want to format it to:
'abc' => 1,
'abcabc' => 2,
'abcabcabc' => 3,
I know there are easier ways to do it but here I'm just want to practice my understanding of align-regexp. I've tried this command but it does not work:
C-u M-x align-regexp \(\s-+\)=\|\(>\s-*\)\d 1 1 y
Where I'm wrong?
Thanks.

So the question is: With \(\s-+\)=\|\(>\s-*\)\d matching \(\s-+\)= or \(>\s-*\)\d1, can we use align-regexp to align on each of those alternatives throughout a line.
The answer is no -- align-regexp modifies one specific matched group of the regexp. In this case it was group 1, and group 1 is the \(\s-+\) at the beginning. Group 1 of the regexp does not vary depending on what was actually matched, and so it never refers to \(>\s-*\)2.
If you can express your regexp such that it really is a single group of the regexp which should be replaced for every match throughout the line, you can get the effect you want, however.
e.g. >?\(\s-*\)[0-9=] would -- at least for the data shown -- give the desired result.
1 In Emacs \d matches d. That should be [0-9].
2 You generally don't want any non-whitespace in the alignment group, as Emacs replaces the content of that group.

Related

Get value inbetween underscores at a certain occurance

I'm trying to get the value in between 2 underscores at a certain occurrence.
Ex:
HOL_1026-03_OW_01.9000_01.3400_0.2800_CL_32, 0"_0, 0"_0, 0"_RR_NORM_CR-HSR_CR-SUP_ALLHOL-013_FCNO_NOFIN_VRA-010_HXHHH_.
I'm trying to extract the "CR-HSR" and "CR-SUP" from this. I originally came up with this
(?!(.*?_){8}).*(?=(.*?_){7}) and (?!(.*?_){7}).*(?=(.*?_){6})
which works in regexr.com
I'm using this with PL/SQL and when I run the REGEXP_SUBSTR() method, It returns null.
Without lookarounds, you can repeat a group matching not an _ followed by matching it.
^([^_]+_){12}([^_]+)
Then get the capture group 2 value.
SELECT regexp_substr('HOL_1026-03_OW_01.9000_01.3400_0.2800_CL_32, 0"_0, 0"_0, 0"_RR_NORM_CR-HSR_CR-SUP_ALLHOL-013_FCNO_NOFIN_VRA-010_HXHHH_.', '^([^_]+_){12}([^_]+)', 1,1,NULL,2) from dual;
Output
CR-HSR
To match CR-SUP you can change the quantifier from {12} to {13}
As an alternative, as shorter way might also be
REGEXP_SUBSTR(c, '[^_]+', 1, 13)

Regex Erasing all except numbers with limited digits

What I want to do is erase everything except \d{4,7} only by replacing.
Any ideas to get this?
ex)
G-A15239L → 15239
(G-A and L should be selected and replaced by empty strings)
now200316stillcovid19asdf → 200316
(now and stillcovid19asdf should be selected and replaced by empty strings)
Also, replacing text is not limited as empty string.
substitutions such as $1 are possible too.
Using Regex in 'Kustom' apps. (including KLCK, KLWP, KWGT)
I don't know which engine it's using because there are no information about it
You may use
(\d{4,7})?.?
Or
(\d{4,7})|.
and replace with $1. See the regex demo.
Details
(\d{4,7})? - an optional (due to ? at the end - if it is missing, then the group is obligatory) capturing group matching 1 or 0 occurrences of 4 to 7 digits
| - or
.? - any one char other than line break chars, 1 or 0 times when ? is right after it.
So, any match of 4 to 7 digits is kept (since $1 refers to the Group 1 value) and if there is a char after it, it is removed.
It looks as if the regex is Java based since all non-matching groups are replaced with null:
So, the only possible solution is to use a second pass to post-process the results, just replace null with some kind of a delimiter, a newline for example.
Search: .*?(\d{4,7})[^\d]+|.*
Replace: $1
in for instance Notepad++ 6.0 or better (which comes with built-in PCRE support) works with your examples:
jalsdkfilwsehf
now200316stillcovid19asdf
G-A15239L
becomes:
200316
15239

Replacing a single term in a regex pattern

I am using regexp_filter in Sphinx to replace terms
In most cases I can do so e.g. misspellings are easy:
regexp_filter = Backround => Background
Even swapping using capturing group notation:
regexp_filter = (Left)(Right) => \2\1
However I am having more trouble when using a pattern match to find a given words I want to replace:
regexp_filter = (PatternWord1|PatternWord2)\W+(?:\w+\W+){1,6}?(SearchTerm)\b => NewSearchTerm
Where NewSearchTerm would be the term I want to replace just \2 with (leaving \1 and the rest of the pattern alone). So
So if I had text 'Pizza and Taco Parlor' then:
regexp_filter = (Pizza)\W+(?:\w+\W+){1,6}?(Parlor)\b => Store
Would convert to 'Pizza and Taco Store'
I know in this case the SearchTerm is /2 but not sure how to convert. I know I could append e.g. /2s to make it plural but how can I in fact replace it since it is just a single capturing group of several and I just want to replace that group?
So, if I understand the question. You have a strings that match the following criteria:
Begin with PattenWord1 or PatternWord2
Immediately followed by an uppercase word
Maybe followed by another word that is between 1 and 6 characters -- recommend using [A-z] rather than \w+\W+
Followed by "SearchTerm"
Let use this as a baseline:
PatternWord1HelloSearchTerm
And you only want to replace SearchTerm from the string.
So you need another pattern group around everything you want to keep:
regexp_filter = ((PatternWord1|PatternWord2)\W+(?:\w+\W+){1,6}?)(SearchTerm)\b => \1World
Your pattern group matches would be:
PatternWord1Hello
PatternWord1
SearchTerm
Your result would be:
PatternWord1HelloWorld

Tcl regular expressions

set d(aa1) 1
set d(aa2) 1
set d(aa3) 1
set d(aa4) 1
set d(aa5) 1
set d(aa6) 1
set d(aa7) 1
set d(aa8) 1
set d(aa9) 1
set d(aa10) 1
set d(aa11) 1
set regexp "a*\[1-9\]"
set res [array names d -glob $regexp]
puts "res = $res"
In this case, the result is:
res = aa11 aa6 aa2 aa7 aa3 aa8 aa4 aa9 aa5 aa1
But when I change the regexp from a*\[1-9\] to a*\[1-10\], the result becomes:
res = aa11 aa10 aa1
You have an error in your character class.
[1-10] does not mean a digit from 1 to 10
It means 1-1, which is a character ranging from 1 to 1 (i.e., simply a 1), or a 0. This explains your output.
to express a digit from 1 to 10, use this: (?:10?|[2-9]) (as one of several ways to do it.
therefore your regex becomes a*(?:10?|[2-9])
note that if your engine does not allow non-capturing group, you need to remove the ?:, for: a*(?:10?|[2-9])
You need to be sure what you're trying to match because glob style matching and regexp style matching are different in many aspects.
From the docs, glob has the following:
* matches any sequence of characters in string, including a null string.
? matches any single character in string.
[chars] matches any character in the set given by chars. If a sequence of the form x-y appears in chars, then any character between x and y, inclusive, will match. When used with -nocase, the end points of the range are converted to lower case first. Whereas {[A-z]} matches _ when matching case-sensitively (since _ falls between the Z and a), with -nocase this is considered like {[A-Za-z]} (and probably what was meant in the first place).
\x matches the single character x. This provides a way of avoiding the special interpretation of the characters *?[]\ in pattern.
Since you are using glob style matching, your current expression (a*\[1-9\]) matches an a, followed by any characters and any one of 1 through 9 (meaning it would also match something like abcjdne1).
If you want to match at least one a followed by numbers from 1 through 10, you will need something like this, using the -regexp mode:
set regexp {a+(?:[1-9]|10)}
set res [array names d -regexp $regexp]
Now, this regexp is I believe the more natural one for a beginner ((?:[1-9]|10) meaning either 1 through 9, or 10, but you can use the form that zx81 suggested with (?:10?|[2-9]) meaning 1, with an optional 0 for 10, or 2 through 9).
+ means that a must appear at least once for the array name to match.
If you now need to match the full names, you will need to use anchors:
^a+(?:[1-9]|10)$
Note: You cannot use glob matching if you want to match at least one a followed by digits, and alternation (the pipe used |) and quantifiers (? or + or *) the way they behave in regexp are not supported by glob matching.
One last thing, use braces to avoid escaping your pattern (unless you have a variable or running a function in your pattern and can't do otherwise).

greedy matching in regexp

I have the following output:
Player name: RON_98
Player name: RON_97
player name: RON_96
I need to get the RON part and the digital part after it(for example 98),I used the following regexp: regexp "(RON)_(\[0-9]*)",does this will match the RON_96 of the last line? "*" is greedy match, how to match only the first line of the output? do we have something like (RON)_(only match digital)? and can prevent it to match the rest of the line?
Always put regular expressions in braces in Tcl.
It's not technically necessary (you can use Tcl's language definition to exactly work out what backslashes would be needed to do it any other way) but it's simpler in all cases that you're likely to encounter normally.
The examples below will use this.
Regular expressions start matching as soon as they can. Then, under normal (greedy) circumstances they match as much text as they can. Thus, with your sample code and text, the matcher starts trying to match at the R on the first line and goes on to consume up to the 8, at which point it has a match and stops. You can verify this by asking regexp to report the indices into the string where the match happened instead of the substring that was matched (via the -indices option, documented on the manual page).
To get all the matches in a string, you have two options:
Pass the -all -inline options to regexp and process the list of results with foreach:
# Three variables in foreach; one for whole match, one for each substring
foreach {a b c} [regexp -all -inline {(RON)_([0-9]*)} $thedata] {
puts "matched '$a', with b=$b and c=$c"
}
Use the -indices option together with the -start option, all in a while loop, so you step through the string:
set idx 0
while {[regexp -start $idx -indices {(RON)_([0-9]*)} $thedata a b c]} {
puts "matched at '$a', with subranges '$b' and '$c'"
set extracted [string range $thedata {*}$c]
puts "the extracted value is '$extracted'"
# Advance the place where the next search will start from
set idx [expr {[lindex $a 1] + 1}]
}
I'd normally recommend using the first option; it's much easier to use! Sometimes the second is better as it provides more information and uses less intermediate storage, but it's also much trickier to get right.
Even if you select your stated regex to match multiple lines, it will not match more than the first occurance of what you stated, and this is "RON_98". It will stop after the last digit of the first match. You could even force it to stop after reading a line by using $ at the end of your RegEx (matches an end of line).
For reference, the [0-9] can be written easier as \d (Digit):
(RON)_\d*
is easier to read.