Regex that extract string of length that is encoded in string - regex

I have the following string to parse:
X4IitemX6Nabc123
that is structured as follows:
X... marker for 'field identifier'
4... length of item (name), will change according to length of item name
I... identifier for item name, must not be extracted, fixed
item... value that should be extraced as "name"
X... marker for 'field identifier'
6... length of item (name), will change according to length of item name
N... identifier for item number, must not be extracted, fixed
abc123... value that should be extraced as "num"
Only these two values will be contained in the string, the sequence is also always the same (name, nmuber).
What I have so far is
\AX(?I<namelen>\d+)U(?<name>.+)X(?<numlen>\d+)N(?<num>.+)$
But that does not take into account that the length of the name is contained in the string itself. Somehow the .+ in the name group should be replaced by .{4}. I tried {$1}, {${namlen}} but that does not yield the result I expect (on rubular.com or regex.191)
Any ideas or further references?

What you ask for is only possible in languages that allow code insertions in the regex pattern.
Here is a Perl example:
#!/usr/bin/perl
use warnings;
use strict;
my $text = "X4IitemX6Nabc123";
if ($text =~ m/^X(?<namelen>[0-9]+)I(?<name>(??{".{".$^N."}"}))X(?<numlen>[0-9]+)N(?<num>.+)$/) {
print $text . ": PASS!\n";
} else {
print $text . ": FAIL!\n"
}
# -> X4IitemX6Nabc123: PASS!
In other languages, use a two-step approach:
Extract the number after X,
Build a regex dynamically using the result of the first step.
See a JavaScript example:
const text = "X4IitemX6Nabc123";
const rx1 = /^X(\d+)/;
const m1 = rx1.exec(text)
if (m1) {
const rx2 = new RegExp(`^X(?<namelen>\\d+)I(?<name>.{${m1[1]}})X(?<numlen>\\d+)N(?<num>.+)$`)
if (rx2.test(text)) {
console.log(text, '-> MATCH!')
} else console.log(text, '-> FAIL!');
} else {
console.log(text, '-> FAIL!')
}
See the Python demo:
import re
text = "X4IitemX6Nabc123"
rx1 = r'^X(\d+)'
m1 = re.search(rx1, text)
if m1:
rx2 = fr'^X(?P<namelen>\d+)I(?P<name>.{{{m1.group(1)}}})X(?P<numlen>\d+)N(?P<num>.+)$'
if re.search(rx2, text):
print(text, '-> MATCH!')
else:
print(text, '-> FAIL!')
else:
print(text, '-> FAIL!')
# => X4IitemX6Nabc123 -> MATCH!

Related

How to remove and ID from a string

I have a string that looks like this, they are ids in a table:
1,2,3,4,5,6,7,8,9
If someone deletes something from the database, I will need to update the string. I know that doing this it will remove the value, but not the commas. Any idea how can I check if the id has a comma before and after so my string doesn't break?
$new_values = $original_values[0];
$new_values =~ s/$car_id//;
Result: 1,2,,4,5,6,7,8,9 using the above sample (bad). It should be 1,2,4,5,6,7,8,9.
To remove the $car_id from the string:
my $car_id = 3;
my $new_values = q{1,2,3,4,5,6,7,8,9};
$new_values = join q{,}, grep { $_ != $car_id }
split /,/, $new_values;
say $new_values;
# Prints:
# 1,2,4,5,6,7,8,9
If you already removed the id(s), and you need to remove the extra commas, reformat the string like so:
my $new_values = q{,,1,2,,4,5,6,7,8,9,,,};
$new_values = join q{,}, grep { /\d/ } split /,/, $new_values;
say $new_values;
# Prints:
# 1,2,4,5,6,7,8,9
You can use
s/^$car_id,|,$car_id\b//
Details
^ - start of string
$car_id - variable value
, - comma
| - or
, - comma
$car_id - variable value
\b - word boundary.
s/^\Q$car_id\E,|,\Q$car_id\E\b//
Another approach is to store an extra leading and trailing comma (,1,2,3,4,5,6,7,8,9,)
The main benefit is that it makes it easier to search for the id using SQL (since you can search for ,$car_id,). Same goes for editing it.
On the Perl side, you'd use
s/,\K\Q$car_id\E,// # To remove
substr($_, 1, -1) # To get actual string
Ugly way: use regex to remove the value, then simplify
$new_values = $oringa_value[0];
$new_values =~ s/$car_id//;
$new_values =~ s/,+/,/;
Nice way: split and merge
$new_values = $oringa_value[0];
my #values = split(/,/, $new_values);
my $index = 0;
$index++ until $values[$index] eq $car_id;
splice(#values, $index, 1);
$new_values = join(',', #values);

Use dict to replace word in string

Im trying to replace one part of my string using a dict.
s = 'I am a string replaceme'
d = {
'replaceme': 'replace me'
}
Ive tried lots of variations like
s = s.replace(d, d[other])
That throws an error being name error: name 'other' is not defined. If I do
s = s.replace('replaceme', 'replace me')
It works. How can i achive my goal?
You have to replace each KEY of your dict with the VALUE associated. Which value holds the other variable? Is it a valid KEY of your substitutions dict?
You can try with this solution.
for k in d:
s = s.replace(k, d[k])
Each key in dictionary is the value to be replaced, using the corresponding VALUE accessed with d[k].
If the dictionary is big the provided example will show poor performances.
You could split the string and rejoin:
s = 'I am a string replaceme'
d = {
'replaceme': 'replace me'
}
print(" ".join([w if w not in d else d[w] for w in s.split(" ")]))
That won't match substrings where str.replace will, if you are trying to match substring iterate over the dict.items and replace the key with the value:
d = {
'replaceme': 'replace me'
}
for k,v in d.items():
s = s.replace(k,v)
print(s)
I am a string replace me
Here is a different approach: using reduce:
s = 'I am a string replaceme'
d = {'replaceme': 'replace me', 'string': 'phrase,'}
s = reduce(lambda text, old_new_pair: text.replace(* old_new_pair), d.items(), s)
# s is now 'I am a phrase, replace me'

RegEx and split camelCase

I want to get an array of all the words with capital letters that are included in the string. But only if the line begins with "set".
For example:
- string "setUserId", result array("User", "Id")
- string "getUserId", result false
Without limitation about "set" RegEx look like /([A-Z][a-z]+)/
$str ='setUserId';
$rep_str = preg_replace('/^set/','',$str);
if($str != $rep_str) {
$array = preg_split('/(?<=[a-z])(?=[A-Z])/',$rep_str);
var_dump($array);
}
See it
Also your regex will also work.:
$str = 'setUserId';
if(preg_match('/^set/',$str) && preg_match_all('/([A-Z][a-z]*)/',$str,$match)) {
var_dump($match[1]);
}
See it

In Perl, how many groups are in the matched regex?

I would like to tell the difference between a number 1 and string '1'.
The reason that I want to do this is because I want to determine the number of capturing parentheses in a regular expression after a successful match. According the perlop doc, a list (1) is returned when there are no capturing groups in the pattern. So if I get a successful match and a list (1) then I cannot tell if the pattern has no parens or it has one paren and it matched a '1'. I can resolve that ambiguity if there is a difference between number 1 and string '1'.
You can tell how many capturing groups are in the last successful match by using the special #+ array. $#+ is the number of capturing groups. If that's 0, then there were no capturing parentheses.
For example, bitwise operators behave differently for strings and integers:
~1 = 18446744073709551614
~'1' = Î ('1' = 0x31, ~'1' = ~0x31 = 0xce = 'Î')
#!/usr/bin/perl
($b) = ('1' =~ /(1)/);
print isstring($b) ? "string\n" : "int\n";
($b) = ('1' =~ /1/);
print isstring($b) ? "string\n" : "int\n";
sub isstring() {
return ($_[0] & ~$_[0]);
}
isstring returns either 0 (as a result of numeric bitwise op) which is false, or "\0" (as a result of bitwise string ops, set perldoc perlop) which is true as it is a non-empty string.
If you want to know the number of capture groups a regex matched, just count them. Don't look at the values they return, which appears to be your problem:
You can get the count by looking at the result of the list assignment, which returns the number of items on the right hand side of the list assignment:
my $count = my #array = $string =~ m/.../g;
If you don't need to keep the capture buffers, assign to an empty list:
my $count = () = $string =~ m/.../g;
Or do it in two steps:
my #array = $string =~ m/.../g;
my $count = #array;
You can also use the #+ or #- variables, using some of the tricks I show in the first pages of Mastering Perl. These arrays have the starting and ending positions of each of the capture buffers. The values in index 0 apply to the entire pattern, the values in index 1 are for $1, and so on. The last index, then, is the total number of capture buffers. See perlvar.
Perl converts between strings and numbers automatically as needed. Internally, it tracks the values separately. You can use Devel::Peek to see this in action:
use Devel::Peek;
$x = 1;
$y = '1';
Dump($x);
Dump($y);
The output is:
SV = IV(0x3073f40) at 0x3073f44
REFCNT = 1
FLAGS = (IOK,pIOK)
IV = 1
SV = PV(0x30698cc) at 0x3073484
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x3079bb4 "1"\0
CUR = 1
LEN = 4
Note that the dump of $x has a value for the IV slot, while the dump of $y doesn't but does have a value in the PV slot. Also note that simply using the values in a different context can trigger stringification or nummification and populate the other slots. e.g. if you did $x . '' or $y + 0 before peeking at the value, you'd get this:
SV = PVIV(0x2b30b74) at 0x3073f44
REFCNT = 1
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 1
PV = 0x3079c5c "1"\0
CUR = 1
LEN = 4
At which point 1 and '1' are no longer distinguishable at all.
Check for the definedness of $1 after a successful match. The logic goes like this:
If the list is empty then the pattern match failed
Else if $1 is defined then the list contains all the catpured substrings
Else the match was successful, but there were no captures
Your question doesn't make a lot of sense, but it appears you want to know the difference between:
$a = "foo";
#f = $a =~ /foo/;
and
$a = "foo1";
#f = $a =~ /foo(1)?/;
Since they both return the same thing regardless if a capture was made.
The answer is: Don't try and use the returned array. Check to see if $1 is not equal to ""

Remove a number from a comma separated string while properly removing commas

FOR EXAMPLE: Given a string... "1,2,3,4"
I need to be able to remove a given number and the comma after/before depending on if the match is at the end of the string or not.
remove(2) = "1,3,4"
remove(4) = "1,2,3"
Also, I'm using javascript.
As jtdubs shows, an easy way is is to use a split function to obtain an array of elements without the commas, remove the required element from the array, and then rebuild the string with a join function.
For javascript something like this might work:
function remove(array,to_remove)
{
var elements=array.split(",");
var remove_index=elements.indexOf(to_remove);
elements.splice(remove_index,1);
var result=elements.join(",");
return result;
}
var string="1,2,3,4,5";
var newstring = remove(string,"4"); // newstring will contain "1,2,3,5"
document.write(newstring+"<br>");
newstring = remove(string,"5");
document.write(newstring+"<br>"); // will contain "1,2,3,4"
You also need to consider the behavior you want if you have repeats, say the string is "1,2,2,4" and I say "remove(2)" should it remove both instances or just the first? this function will remove only the first instance.
Just use multiple substitutions.
s/^$removed,//;
s/,$removed$//;
s/,$removed,/,/;
This will be easier than trying to invent a single replacement that handles all those cases.
string input = "1,2,3,4";
List<string> parts = new List<string>(input.Split(new char[] { ',' }));
parts.RemoveAt(2);
string output = String.Join(",", parts);
Instead of using regex, I would do something like:
- split on comma
- delete the right element
- join with comma
Here is a perl script that does the job:
#!/usr/bin/perl
use 5.10.1;
use strict;
use warnings;
my $toremove = 5;
my $string = "1,2,3,4,5";
my #tmp = split/,/, $string;
#tmp = grep{ $_ != $toremove }#tmp;
$string =join',', #tmp;
say $string;
Output:
1,2,3,4
Javascript has improved since this question was posted.
I use the following regex to remove items from a csv string
let searchStr = "359";
let regex = new RegExp("^" + searchStr + ",?|," + searchStr);
csvStr = csvStr.replace(regex, "");
If the child_id is the start, middle or end, or only item it is replaced.
If the searchStr is at the start of the csvStr it and any trailing comma is replaced. Else if the searchStr is anywhere else in the csvStr it must be preceded with a comma so the searchStr and its preceding comma are replaced by an empty string.