How to refer to the match order in the replace string? - regex

I'm looking for a way to type the order of match of the being-replaced-string among all other found-strings. For Example, for the 1st matched string, the value should be 1, for the 2nd it's 2 … for the nth it should be n. The value I'm looking for is the order of the matched string among all other matched strings.
Example for what I'm trying to get
Let's say that I have this original content ...
<"BOY"(GUN)><"GIRL"(BAG)><"SISTERS"(CANDY)><"JOHN"(HAT)>
... and I want it to be manipulated to be like this ...
1
BOY
GUN
2
GIRL
BAG
3
SISTERS
CANDY
4
JOHN
HAT
I already know that I need <"(.*?)"\((.*?)\)> to match each element. For the replace code I think I need something like #MATCH ORDER REFERENCE#\n\$1\n$2\n.
Note
I'm using Perl on Windows.

Use the /e modifier to evaluate the replacement. See Regexp Quote-Like Operators.
Then you can increase a counter on each replacement.
Code
my $text = '<"BOY"(GUN)><"GIRL"(BAG)><"SISTERS"(CANDY)><"JOHN"(HAT)>';
my $counter = 1;
$text =~ s/<"([^"]+)"\(([^()]+)\)>/$counter++."\n$1\n$2\n\n"/ge;
print $text;
Output
1
BOY
GUN
2
GIRL
BAG
3
SISTERS
CANDY
4
JOHN
HAT

Related

Combine category with code name [DS code format]

Some DS code systems don't readily support categories. Is this expression the most efficient way to programmatically combine the category with code name?
perl -ne '$data = $_ ; $cat = $1 if $data =~ /CAT (.*)/ ; $cde = $1 if $data =~ /CODE \d (.*)/ ; print "$cat, $cde\n" if /CODE \d /' 'Mario Kart DS (USA).mch'
Example 1 - melonDS, Mario Kart DS (USA).mch
CAT Mission 1 Codes
CODE 0 3 Star Rank - Mission 1-1
223D00C4 0000000F
CODE 0 3 Star Rank - Mission 1-2
223D00C5 0000000F
CAT Mission 2 Codes
CODE 0 3 Star Rank - Mission 2-1
223D00CD 0000000F
CAT Mission 3 Codes
CODE 0 3 Star Rank - Mission 3-1
223D00D6 0000000F
Output:
Mission 1 Codes, 3 Star Rank - Mission 1-1
Mission 1 Codes, 3 Star Rank - Mission 1-2
Mission 2 Codes, 3 Star Rank - Mission 2-1
Mission 3 Codes, 3 Star Rank - Mission 3-1
Regex can't capture the CAT and prepend it to CODE. This was the best expression I could come up with:
perl -0777 -pe 's/CAT (.*)(?s).+?(?-s)(?:CODE \d (.*)(?s).+?(?-s))+(?=CAT|CODE|\z)/\1, \2\n/gi' 'Mario Kart DS (USA).mch'
In order to search and replace, I have to capture each group of CODE preceded by CAT. perl -0777 and (?s)(?-s) allows me to slurp the input file and anchor CODE matches to the initial CAT match while stepping across the end of line. I can repeat the CODE match, as capture group 2, but it will only ever get the last one.
The expression above reads like so:
For a line starting with 'CAT ' capture to end of line, step across lines in the least greedy way until we reach CODE. For every group that starts with 'CODE [number] ' capture to the end of line, then step across lines until reaching either CAT, CODE, or the end of file. Repeat the code group as many times as possible.
With example above, this is the output:
Mission 1 Codes, 3 Star Rank - Mission 1-2
Mission 2 Codes, 3 Star Rank - Mission 2-1
Mission 3 Codes, 3 Star Rank - Mission 3-1
Debating what is most efficient or not is perhaps not too interesting in this case. If you have a solution that works, that should perhaps suffice.
Here is another solution, based on paragraph mode.
-00: sets input record separator to empty string $/ = '', which enables paragraph mode. Line endings are considered \n\n.
-l automatic chomp
-E enable say (since there is an interaction with print and -l)
Then just store the header if /^CAT/, else clean up and print.
$ perl -00 -nlwE'if (s/^CAT //) { $k = $_ } else { s/^CODE \d+ //; s/\n.*//; say "$k, $_"; }' mission.txt
Mission 1 Codes, 3 Star Rank - Mission 1-1
Mission 1 Codes, 3 Star Rank - Mission 1-2
Mission 2 Codes, 3 Star Rank - Mission 2-1
Mission 3 Codes, 3 Star Rank - Mission 3-1
As a file:
use strict;
use warnings;
use feature 'say';
$/ = '';
my $key;
while (<DATA>) {
chomp;
if (s/^CAT //) {
$key = $_;
} else {
s/CODE \d+ //;
s/\n.*//;
say "$key, $_";
}
}
To elaborate on the initial question, it's important to note that I know some regex and no Perl, so I don't know what an efficient Perl expression looks like. From my experience, regular expressions are great at capturing 'one this or one that' but we need 'one this and many that'.
If I were talking about the title of a book chapter and each subsequent paragraph, the goal would be to merge the title as the first sentence of each paragraph for each chapter.
A regular expression could capture the title and indent of each paragraph but must limit itself to one chapter at a time. The title becomes capture group 1 while the paragraphs are capture group 2. We can't have 'one and many'; 'one or the other' would return all chapters and paragraphs (as capture group 1 or 2) but wouldn't allow them to be merged together.
Perl language allows this simply by storing the title in a variable to be added as part of the substitution for each paragraph. Since the title occurs first, and only once, per chapter, it can easily be merged in a 'one this many that' situation.
The initial example was flawed in that it was extracting information when it should have removed the categories and merged them with the code names. With that goal, an expression like this would suffice:
perl -pe '$cat = $1 if s/(?:^CAT ([^\v]+).*\n)// ; s/(^CODE \d )/$1$cat, /'
For the non-capture group (?:...) that starts with 'CAT ' store every character that doesn't match the end of line [vertical whitespace] ([^\v]+) up to the end of line .*\n (which captures all modern line endings for Win, MacOS X+, and Linux since each ends in \n or linefeed) and remove the entire match including the final linefeed //. This expression captures the category while removing the line.
The next expression (separated by semicolon) captures the phrase 'CODE # ' (^CODE \d ), for each line that matches, then repeats the phrase /$1$cat, / while adding the result of the category variable. This is the result for Example 1:
CODE 0 Mission 1 Codes, 3 Star Rank - Mission 1-1
223D00C4 0000000F
CODE 0 Mission 1 Codes, 3 Star Rank - Mission 1-2
223D00C5 0000000F
CODE 0 Mission 2 Codes, 3 Star Rank - Mission 2-1
223D00CD 0000000F
CODE 0 Mission 3 Codes, 3 Star Rank - Mission 3-1
223D00D6 0000000F
Unfortunately, the melonDS code format insists there be at least one category for the file to be read properly so we'd have to add something generic back in on the first line e.g., CAT Cheats.
A better use case would be a RetroArch formatted cheat file since it doesn't directly support categories. The cheat files that ship with the program use a trick to simulate this in the form of a numbered cheat description that lacks a subsequent code.
Example 2: RetroArch, Mario Kart DS (USA).cht
cheats = 514
cheat0_desc = "Misc Codes"
cheat1_desc = "Freeze Time"
cheat1_code = "621755FC+00000000+B21755FC+00000000+10000000+00000000+D2000000+00000000"
cheat1_enable = false
cheat2_desc = "Start for Final Lap"
cheat2_code = "94000130+FFF70000+023CDD3F+00000001+D2000000+00000000"
cheat2_enable = false
With this expression:
perl -0777 -pe 's|(cheat(\d+)_)desc(?=.*\n(?!cheat\2_code))|\1cat|gi' 'Mario Kart DS (USA).cht' | perl -pe '$cat = $1 if s/(?:^cheat\d+_cat = \"(.*)\".*\n)// ; s/(^cheat\d+_desc = \")/$1$cat, /'
The result is:
cheats = 514
cheat1_desc = "Misc Codes, Freeze Time"
cheat1_code = "621755FC+00000000+B21755FC+00000000+10000000+00000000+D2000000+00000000"
cheat1_enable = false
cheat2_desc = "Misc Codes, Start for Final Lap"
cheat2_code = "94000130+FFF70000+023CDD3F+00000001+D2000000+00000000"
cheat2_enable = false
The expression, from a high level, slurps the input file and for each numbered cheat description cheat0_desc that is not immediately followed by a cheat code name cheat0_code we rename it from cheat0_desc to cheat0_cat then send the changes to the next expression (basically a repeat of the one shown above) that replaces on 'cheat#_desc = "' with itself and the category.
I feel the question was valuable but poorly asked due to lack of knowledge and the continuing learning process.

Regexmatch to find all string cells that match multiple words

I'm using ArrayFormula and FILTER combination to list all cells in a column that contain all of the search term words. I'm using REGEXMATCH rather than QUERY/CONTAINS/LIKE because my FILTER has other criteria that return TRUE/FALSE.
My problem seems to be precedence. So the following regex works in a limited way.
=ArrayFormula(filter(A1:A5,regexmatch(A1:A5,"(?i)^"&"(.*?\bbob\b)(.*?\bcat\b)"&".*$")))
It will find Bob and cat but only if Bob precedes cat.
Google sheets fails if I try to use lookahead ?= ie
=ArrayFormula(filter(A1:A5,regexmatch(A1:A5,"(?i)^"&"(?=.*?\bbob\b)(?=.*?\bcat\b)"&".*$")))
I don't want to use the '|' alternation in the string (repeat and reverse) as the input words may be many more than two so alternation becomes exponentially more complex.
Here's the test search array (each row is a single cell containing a string)...
Bob ate the dead cat
The cat ate live bob
No cat ate live dog
Bob is dead
Bob and the cat are alive
... and the desired results I'm after.
Bob ate the dead cat
The cat ate live bob
Bob and the cat are alive
Once I have the regex sorted out, the final solution will be a user input text box where they simply enter the words that must be found in a string ie 'Bob cat'. This input string I think I can unpack into its separate words and concatenate to the above expression, however, if there's a 'best practice' way of doing this I'd like to hear.
Find 2 strings
Try:
=FILTER(A:A,REGEXMATCH(A:A,"(?i)bob.*cat|cat.*bob"))
You don't need to use ArrayFormula because filter is array formula itself.
(?i) - to make search case insensitive
bob.*cat|cat.*bob - match "bob→cat" or "cat→bob"
Find multiple strings
There's more complex formula for more words to match then 2.
Suppose we have a list in column A:
Bob ate the dead cat
The cat ate live bob
No cat ate live dog
Bob is dead
Bob and the cat are alive
Cat is Bob
ate Cat bob
And need to find all matches of 3 words, put them in column C:
cat
ate
bob
The formula is:
=FILTER(A:A,MMULT(--REGEXMATCH(A:A,
"(?i)"&TRANSPOSE(C1:C3)),ROW(INDIRECT("a1:a"&COUNTA(C1:C3)))^0)=COUNTA(C1:C3))
It uses RegexMatch of transposed list of words C1:C3, and then mmult function sums matches and =COUNTA(C1:C3) compares the number of matches with the number of words in list.
The result is:
Bob ate the dead cat
The cat ate live bob
ate Cat bob
See if this does what you want. In B1 enter:
=arrayformula(filter(A1:A5,regexmatch(A1:A5,lower(index(split(C2," "),0,1)))*regexmatch(lower(A1:A5),lower(index(split(C2," "),0,2)))))
In C2 enter your search words with a space between them (cat Bob).
All words are changed to lower case. The index split separates the words in C2 and the separate words go in the regexmatch. Below is my shared test spreadsheet:
https://docs.google.com/spreadsheets/d/1sDNnSeqHbi0vLosxhyr8t8KXa3MzWC_WJ26eSVNnG80/edit?usp=sharing
Expanding on Max's very good answer, this will change the formula for the list of words in column C. I added an example to the shared spreadsheet (Sheet2).
=FILTER(A:A,MMULT(--REGEXMATCH(A:A,"(?i)"&TRANSPOSE(INDIRECT( "C1:C" & counta(C1:C ) ))),ROW(INDIRECT("a1:a"&COUNTA(INDIRECT( "C1:C" & counta(C1:C ) ))))^0)=COUNTA(INDIRECT( "C1:C" & counta(C1:C ) )))
Maybe a bit easier to understand (I hate MMULT)
=query({A1:A},"select Col1 where "&join(" and ",arrayformula("Col1 matches '."&filter(B:B,B:B<>"")&".'")))
Where A contains your list of phrases and B contains your criteria words.
This part of the formula, =join(" and ",arrayformula("Col1 matches '."&filter(D3:D,D3:D<>"")&".'")) builds a query string from terms in B. for example:
Col1 matches '.cats.' and Col1 matches '.dogs.'
And then this list gets concatenated into the whole "select" expression:
select Col1 where Col1 matches '.cats.' and Col1 matches '.dogs.'

How can I identify a number of variable length with a regex?

I need a Perl regex to pull a number of between six and ten digits out of a string. The number will always follow a particular word followed by a space (case unknown).
For example, if the word I was looking for is 'string':
some random text blah blah blahSTRING 1234567890some more random text
Desired output:
1234567890
Another example:
yet more random textra ra rastring 654321hey hey my my
Desired output:
654321
I want to load the result into a variable.
/string ([0-9]{6,10})/i
string matches STRING and string as the expression ends with i (case insenstive matching)
matches a space
(starts a capture group to capture the number you trying to get
[0-9]{6,10}matches a number with 6 to 10 places
https://regex101.com/r/mB1zF4/1
Group 1 should contain your number with
/^.*string (\d+).*$/i
Thanks everyone, between all the responses and a bit of googling I ended up with
#!/usr/local/bin/perl -w
use strict;
my $string = 'sgtusadl;fdsas;adlhstring 12345678daf;slkdfja;dflk';
my ( $number ) = $string =~ m/string\s\d{6,10}/gi;
$number =~ s/[^0-9]//g;
print "number is $number\n";
exit 0;

Regex: match when string has repeated letter pattern

I'm using the Regex interpreter found in XYplorer file browser. I want to match any string (in this case a filename) that has repeated groups of 'several' characters. More specifically, I want a match on the string:
jack johnny - mary joe ken johnny bill
because it has 'johnny' at least twice. Note that it has spaces and a dash too.
It would be nice to be able to specify the length of the group to match, but in general 4, 5 or 6 will do.
I have looked at several previous questions here, but either they are for specific patterns or involve some language as well. The one that almost worked is:
RegEx: words with two letters repeated twice (eg. ABpoiuyAB, xnvXYlsdjsdXYmsd)
where the answer was:
\b\w*(\w{2})\w*\1
However, this fails when there are spaces in the strings.
I'd also like to limit my searches to .jpg files, but XYplorer has a built-in filter to only look at image files so that isn't so important to me here.
Any help will be appreciated, thanks.
.
.
.
EDIT -
The regex by OnlineCop below answered my original question, thanks very much:
(\b\w+.\b).(\1)
I see that it matches words, not arbitrary string chunks, but that works for my present need. And I am not interested in capturing anything, just in detecting a match.
As a refinement, I wonder if it can be changed or extended to allow me to specify the length of words (or string chunks) that must be the same in order to declare a match. So, if I specified a match length of 5 and my filenames are:
1) jack john peter paul mary johnnie.jpg
2) jack johnnie peter paul mary johnnie.jpg
the first one would not match since no substring of five characters or more is repeated. The second one would match since 'johnnie' is repeated and is more than 5 chars long.
Do you wish to capture the word 'johnny' or the stuff between them (or both)?
This example shows that it selects everything from the first 'johnny' to the last, but it does not capture the stuff between:
Re: (\b\w+\b).*(\1)
Result: jack bill
This example allows some whitespace between names/words:
Re: (\b\w+.*\b).*(\1)
String: Jackie Chan fought The Dragon who was fighting Jackie Chan
Result: Jackie Chan Jackie Chan
Use perl:
#!/usr/bin/perl
use strict;
use warnings;
while ( my $line = <STDIN> ) {
chomp $line;
my #words = split ( /\s+/, $line );
my %seen;
foreach my $word ( #words ) {
if ( $seen{$word} ) { print "Match: $line\n"; last }
$seen{$word}++;
}
}
And yes, it's not as neat as a one line regexp, but it's also hopefully a bit clearer what's going on.

perl regexp with sas - exact match one or the other

I need to extract numbers written in words or in figures in a text.
I have a table that looks like that,
... 1 child ...
... three children ...
...four children ...
...2 children...
...five children
I want capture a number written in words or in numeric figures. There is one number per line. So the desired output would be:
1
three
four
2
five
My regex looks like that:
prxparse("/one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|child|\d\d?/")
Any help ?
Description
This regex will match the numbers in the string providing the numbers are surrounded by whitespace or the symbols.
(?<=\s|^)(?:[0-9]+|one|two|three|four|five|six|seven|eight|nine|ten)(?=\s|$)
Live Example: http://www.rubular.com/r/6ua7fTb8IS
To include the spelled out word version of numbers outside of one - ten, you'll need to include those. This regex will capture the numbers from zero to one hundred [baring any typos]
(?<=\s|^)(?:[0-9]+|(?:(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)\s)?(?:one(?:[\s-]hundred)?|two|three|four|five|six|seven|eight|nine)|ten|eleven|twelve|(?:thir|four|fif|six|seven|eight|nine)teen|twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|zero)(?=\s|$)
Live Example: http://www.rubular.com/r/EIa18nx731
Perl Example
$string = <<END;
... 1 child ...
... three children ...
... four children ...
... 2 children...
... five children
END
#matches = $string =~ m/(?<=\s|^)[0-9]+|one|two|three|four|five|six|seven|eight|nine|ten(?=\s|$)/gi;
print join("\n", #matches);
Yields
1
three
four
2
five