Shorten Code Logic For String Manipulation - regex

Examples
"123456" would be ["123", "456"].
"1234567891011" would be ["123", "456", "789", "10", "11"].
I have come up with this logic using regex to solve the challenge but I am being asked if there is a way to make the logic shorter.
def ft(str)
end
The result from the scan gives a lot of whitespaces so after the join operation, I am left with either a double dash or triple dashes so I used this .gsub(/-+/, '-') to fix that. I also noticed sometimes there is a dash at the begin or the end of the string, so I used .gsub(/^-|-$/, '') to fix that too
Any Ideas?

Slice the string in chunks of max 3 digits. (s.scan(/.{1,3}/)
Check if the last chunk has only 1 character. If so, take the last char of the chunk before and prepend it to the last.
Glue the chunks together using join(" ")

Inspired by #steenslag's recommendation. (There are quite a few other ways to achieve the same with varying levels of verbosity and esotericism)
Here is how I would go about it:
def format_str(str)
numbers = str.delete("^0-9").scan(/.{1,3}/)
# there are a number of ways this operation can be performed
numbers.concat(numbers.pop(2).join.scan(/../)) if numbers.last.length == 1
numbers.join('-')
end
Breakdown:
numbers = str.delete("^0-9") - delete any non numeric character from the string
.scan(/.{1,3}/) - scan them into groups of 1 to 3 characters
numbers.concat(numbers.pop(2).join.scan(/../)) if numbers.last.length == 1 - If the last element has a length of 1 then remove the last 2 elements join them and then scan them into groups of 2 and add these groups back to the Array
numbers.join('-') - join the numbers with a hyphen to return a formatted String
Example:
require 'securerandom'
10.times do
s = SecureRandom.hex
puts "Original: #{s} => Formatted: #{format_str(s)}"
end
# Original: fd1bbce41b1c784ce6ad5303d868bbe9 => Formatted: 141-178-465-303-86-89
# Original: af04bd4d4d6beb5a0412a692d5d3d42d => Formatted: 044-465-041-269-253-42
# Original: 9a1833a43cbef51c3f3c21baa66fe996 => Formatted: 918-334-351-332-166-996
# Original: 4104ae13c998cec896997b9919bdafb3 => Formatted: 410-413-998-896-997-991-93
# Original: 0eb49065472240ba32b3c029f897b30d => Formatted: 049-065-472-240-323-029-897-30
# Original: 4c68f9f68e8f6132c0ed5b966d639cf4 => Formatted: 468-968-861-320-596-663-94
# Original: 65987ee04aea8fb533dbe38c0fea7d63 => Formatted: 659-870-485-333-807-63
# Original: aa8aaf1cf59b52c9ad7db6d4b1ae0cbb => Formatted: 815-952-976-410
# Original: 8eb6b457059f91fd06ccbac272db8f4e => Formatted: 864-570-599-106-272-84
# Original: 1c65825ed59dcdc6ec18af969938ea57 => Formatted: 165-825-596-189-699-38-57
That being said to modify your existing code this will work as well:
def format_str(str)
str
.delete("^0-9")
.scan(/(?=\d{5})\d{3}|(?=\d{3}$)\d{3}|\d{2}/)
.join('-')
end

Here are three more ways to do that.
Use String#scan with a regular expression
def fmt(str)
str.delete("^0-9").scan(/\d{2,3}(?!\d\z)/)
end
The regular expression reads, "match two or three digits provided they are not followed by a single digit at the end of the string". (?!\d\z) is a negative lookahead (which is not part of the match). As matches are greedy by default, the regex engine will always match three digits if possible.
Solve by recursion
def fmt(str)
recurse(str.delete("^0-9"))
end
def recurse(s)
case s.size
when 2,3
[s]
when 4
[s[0,2], s[2,2]]
else
[s[0,3], *fmt(s[3..])]
end
end
Determine the last two matches from the size of the string
def fmt(str)
s = str.delete("^0-9")
if s.size % 3 == 1
s[0..-5].scan(/\d{3}/) << s[-4,2] << s[-2,2]
else
s.scan(/\d{2,3}/)
end
end
All methods exhibit the following behaviour.
["5551231234", "18883319", "123456", "1234567891011"].each do |str|
puts "#{str}: #{fmt(str)}"
end
5551231234: ["555", "123", "12", "34"]
18883319: ["188", "833", "19"]
123456: ["123", "456"]
1234567891011: ["123", "456", "789", "10", "11"]

An approach:
def foo(s)
s.gsub(/\D/, '').scan(/\d{1,3}/).each_with_object([]) do |x, arr|
if x.size == 3 || arr == []
arr << x
else
y = arr.last
arr[-1] = y[0...-1]
arr << "#{y[-1]}#{x}"
end
end
end
Remove all non-digits characters, then scan for 1 to 3 digit chunks. Iterate over them. If it's the first time through or the chunk is three digits, add it to the array. If it isn't, take the last digit from the previous chunk and prepend it to the current chunk and add that to the array.
Alternatively, without generating a new array.
def foo(s)
s.gsub(/\D/, '').scan(/\d{1,3}/).instance_eval do |y|
y.each_with_index do |x, i|
if x.size == 1
y[i] = "#{y[i-1][-1]}#{x}"
y[i-1] = y[i-1][0...-1]
end
end
end
end

Without changing your code too much and without adjusting your actual regex, I might suggest replacing scan with split in order to avoid all the extra nil values; replacing gsub with tr which is much faster; and then using reject(&:empty?) to loop through and remove any blank array elements before joining with whatever character you want:
string = "12345fg\n\t\t\t 67"
string.tr("^0-9", "")
.split(/(?=\d{5})(\d{3})|(?=\d{3}$)(\d{3})|(\d{1,2})/)
.reject(&:empty?)
.join("-")
#=> 123-45-67

Not suggesting this is the best approach, but wanted to offer a little food for thought:
You can basically reduce the logic for your challenge to test for 1 single condition and to use 2 very simple pattern matches:
Condition to test for: Number of characters is more than 3 and has a modulo(3) of 1. This condition will require the use of both pattern matches.
All other conditions will use a single pattern match so no reason to test for those.
This could probably be made a little less verbose but it’s all spelled out pretty well for clarity:
def format(s)
n = s.delete("^0-9")
regex_1 = /.{1,3}/
regex_2 = /../
if [n.length-3, 0].max.modulo(3) == 1
a = n[0..-5].scan(regex_1)+n[-4..-1].scan(regex_2)
else a=n.scan(regex_1)
end
a.join("-")
end

Related

RegEx for matching 3 alphabets and 1-2 digits

I am trying to write a regular expression to find a match in a text having at least 100 characters. The match should be like this - Any sub string within a string that contains at least 3 alphabet to begin with, at least 1 digit following it and a maximum of 2 digits following the 3 letters.
Examples -
abcjkhklfdpdn24hjkk - In this case I want to extract pdn24
hjdksfkpdf1lkjk - In this case I want to extract pdf1
hjgjdkspdg34kjfs dhj khk678jkfhlds1 - In this case I want both pdg34 and lds1
How do I write a regex for this ? The length of the starting letters for a match is always 3 and the digits length can be either 1 or 2 (not more not less)
This is what works if there are 2 digits after the 3 letter string.
[A-Za-z]{3}[0-9]{2}
But the length of the digits can vary between 1 and 2. How do I include the varying length in the regex?
The expression we wish to design is quite interesting. We can first add your original expression with a slight modification in a capturing group, then we should think of left and right boundaries around it. For instance, on the right we might want to use \D:
([A-Za-z]{3}[0-9]{1,2})\D
DEMO 1
We can surely define an exact restricted expression. However, this might just work.
Based on Cary Swoveland's advice, we can also use this expression, which is much better:
\p{L}{3}\d{1,2}(?!\d)
Test
re = /([A-Za-z]{3}[0-9]{1,2})\D/m
str = 'abcjkhklfdpdn24hjkk
hjdksfkpdf1lkjk
hjgjdkspdg34kjfs dhj khk678jkfhlds1 '
# Print the match result
str.scan(re) do |match|
puts match.to_s
end
This script shows how the capturing group works:
const regex = /([A-Za-z]{3}[0-9]{1,2})\D/gm;
const str = `abcjkhklfdpdn24hjkk
hjdksfkpdf1lkjk
hjgjdkspdg34kjfs dhj khk678jkfhlds1 `;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
At least 3 alphabets: [a-zA-Z]{3,}
1 or 2 digits (not more not less): [0-9]{1,2}
This gives us:
/[a-zA-Z]{3,}[0-9]{1,2}/

Comparing filenames and determine their incremental digits

Imagine i have a sequence of files, e.g.:
...
segment8_400_av.ts
segment9_400_av.ts
segment10_400_av.ts
segment11_400_av.ts
segment12_400_av.ts
...
When the filenames are known, i can match against the filenames with a regular expression like:
/segment(\d+)_400_av\.ts/
Because i know the incremental pattern.
But what would be a generic approach to this? I mean how can i take two file names out of the list, compare them and find out where in the file name the counting part is, taking into account any other digits that can occur in the filename (the 400 in this case)?
Goal: What i want to do is to run the script against various file sequences to check for example for missing files, so this should be the first step to find out the numbering scheme. File sequences can occur in many different fashions, e.g.:
test_1.jpg (simple counting suffix)
test_2.jpg
...
or
segment9_400_av.ts (counting part inbetween, with other static digits)
segment10_400_av.ts
...
or
01_trees_00008.dpx (padded with zeros)
01_trees_00009.dpx
01_trees_00010.dpx
Edit 2: Probably my problem can be described more simple: With a given set of files, i want to:
Find out, if they are a numbered sequence of files, with the rules below
Get the first file number, get the last file number and file count
Detect missing files (gaps in the sequence)
Rules:
As melpomene summarized in his answer, the file names only differ in one substring, which consists only of digits
The counting digits can occur anywhere in the filename
The digits can be padded with 0's (see example above)
I can do #2 and #3, what i am struggling with is #1 as a starting point.
You tagged this question regex, so here's a regex-based solution:
use strict;
use warnings;
my $name1 = 'segment12_400_av.ts';
my $name2 = 'segment10_400_av.ts';
if (
"$name1\0$name2" =~ m{
\A
( \D*+ (?: \d++ \D++ )* ) # prefix
( \d++ ) # numeric segment 1
( [^\0]* ) # suffix
\0 # separator
\1 # prefix
( \d++ ) # numeric segment 2
\3 # suffix
\z
}xa
) {
print <<_EOT_;
Result of comparing "$name1" and "$name2"
Common prefix: $1
Common suffix: $3
Varying numeric parts: $2 / $4
Position of varying numeric part: $-[2]
_EOT_
}
Output:
Result of comparing "segment12_400_av.ts" and "segment10_400_av.ts"
Common prefix: segment
Common suffix: _400_av.ts
Varying numeric parts: 12 / 10
Position of varying numeric part: 7
It assumes that
the strings are different (guard the condition with $name1 ne $name2 && ... if that's not guaranteed)
there's only one substring that's different between the input strings (otherwise it won't find any match)
the differing substring consists of digits only
all digits surrounding the first point of difference are part of the varying increment (e.g. the example above recognizes segment as the common prefix, not segment1)
The idea is to combine the two names into a single string (separated by NUL, which is unambiguous because filenames can't contain \0), then let the regex engine do the hard work of finding the longest common prefix (using greediness and backtracking).
Because we're in a regex, we can get a bit more fancy than just finding the longest common prefix: We can make sure that the prefix doesn't end with a digit (see the segment1 vs. segment case above) and we can verify that the suffix is also the same.
See if this works for you:
use strict;
use warnings;
sub compare {
my ( $f1, $f2 ) = #_;
my #f1 = split /(\d+)/sxm, $f1;
my #f2 = split /(\d+)/sxm, $f2;
my $i = 0;
my $out1 = q{};
my $out2 = q{};
foreach my $p (#f1) {
if ( $p eq $f2[$i] ) {
$out1 .= $p;
$out2 .= $p;
}
else {
$out1 .= sprintf ' ((%s)) ', $p;
$out2 .= sprintf ' ((%s)) ', $f2[$i];
}
$i++;
}
print $out1 . "\n";
print $out2 . "\n";
return;
}
print "Test1:\n";
compare( 'segment8_400_av.ts', 'segment9_400_av.ts' );
print "\n\nTest2:\n";
compare( 'segment999_8_400_av.ts', 'segment999_9_400_av.ts' );
You basically split strings by starting/ending digits, the loop through the items and compare each of the 'pieces'. If they are equal, you accumulate. If not, then you highlight the differences and accumulate.
Output (I'm using ((number)) for the highlight)
Test1:
segment ((8)) _400_av.ts
segment ((9)) _400_av.ts
Test2:
segment999_ ((8)) _400_av.ts
segment999_ ((9)) _400_av.ts
I assume that only the counter differs across the strings
use warnings;
use strict;
use feature 'say';
my ($fn1, $fn2) = ('segment8_400_av.ts', 'segment12_400_av.ts');
# Collect all numbers from all strings
my #nums = map { [ /([0-9]+)/g ] } ($fn1, $fn2);
my ($n, $pos); # which number in the string, at what position
# Find which differ
NUMS:
for my $j (1..$#nums) { # strings
for my $i (0..$#{$nums[0]}) { # numbers in a string
if ($nums[$j]->[$i] != $nums[0]->[$i]) { # it is i-th number
$n = $i;
$fn1 =~ /($nums[0]->[$i])/g; # to find position
$pos = $-[$i];
say "It is $i-th number in a string. Position: $pos";
last NUMS;
}
}
}
We loop over the array with arrayrefs of numbers found in each string, and over elements of each arrayref (eg [8, 400]). Each number in a string (0th or 1st or ...) is compared to its counterpart in the 0-th string (array element); all other numbers are the same.
The number of interest is the one that differs and we record which number in a string it is ($n-th).
Then its position in the string is found by matching it again and using #- regex variable with (the just established) index $n, so the offset of the start of the n-th match. This part may be unneeded; while question edits helped I am still unsure whether the position may or not be useful.
Prints, with position counting from 0
It is 0-th number in a string. Position: 7
Note that, once it is found that it is the $i-th number, we can't use index to find its position; an number earlier in strings may happen to be the same as the $i-th one, in this string.
To test, modify input strings by adding the same number to each, before the one of interest.
Per question update, to examine the sequence (for missing files for instance), with the above findings you can collect counters for all strings in an array with hashrefs (num => filename)
use Data::Dump qw(dd);
my #seq = map { { $num[$_]->[$n] => $fnames[$_] } } 0..$#fnames;
dd \#seq;
where #fnames contains filenames (like two picked for the example above, $fn1 and $fn2). This assumes that the file list was sorted to begin with, or add the sort if it wasn't
my #seq =
sort { (keys %$a)[0] <=> (keys %$b)[0] }
map { { $num[$_]->[$n] => $fnames[$_] } }
0..$#fnames;
The order is maintained by array.
Adding this to the above example (with two strings) adds to the print
[
{ 8 => "segment8_400_av.ts" },
{ 12 => "segment12_400_av.ts" },
]
With this all goals in "Edit 2" should be straighforward.
I suggest that you build a regex pattern by changing all digit sequences to (\d+) and then see which captured values have changed
For instance, with segment8_400_av.ts and
segment9_400_av.ts you would generate a pattern /segment(\d+)_(\d+)_av\.ts/. Note that s/\d+/(\d+)/g will return the number of numeric fields, which you will need for the subsequent check
The first would capture 8 and 400 which the second would capture 9 and 400. 8 is different from 9, so it is in that region of the string where the number varies
I can't really write much code as you don't say what sort of result you want from this process

groovy regex, how to match array items in a string

The string looks like this "[xx],[xx],[xx]"
Where xx is a ploygon like this "(1.0,2.3),(2.0,3)...
Basically, we are looking for a way to get the string between each pair of square brackets into an array.
E.g. String source = "[hello],[1,2],[(1,2),(2,4)]"
would result in an object a such that:
a[0] == 'hello'
a[1] == '1,2'
a[2] == '(1,2),(2,4)'
We have tried various strategies, including using groovy regex:
def p = "[12],[34]"
def points = p =~ /(\[([^\[]*)\])*/
println points[0][2] // yields 12
However,this yields the following 2 dim array:
[[12], [12], 12]
[, null, null]
[[34], [34], 34]
so if we took the 3rd item from every even rows we would be ok, but this does look very correct. We are not talking into account the ',' and we are not sure why we are getting "[12]" twice, when it should be zero times?
Any regex experts out there?
I think that this is what you're looking for:
def p = "[hello],[1,2],[(1,2),(2,4)]"
def points = p.findAll(/\[(.*?)\]/){match, group -> group }
println points[0]
println points[1]
println points[2]
This scripts prints:
hello
1,2
(1,2),(2,4)
The key is the use of the .*? to make the expression non-greedy to found the minimum between chars [] to avoid that the first [ match with the last ] resulting match in hello],[1,2],[(1,2),(2,4) match... then with findAll you returns only the group captured.
Hope it helps,

R code to check if word matches pattern

I need to validate a string against a character vector pattern. My current code is:
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# valid pattern is lowercase alphabet, '.', '!', and '?' AND
# the string length should be >= than 2
my.pattern = c(letters, '!', '.', '?')
check.pattern = function(word, min.size = 2)
{
word = trim(word)
chars = strsplit(word, NULL)[[1]]
all(chars %in% my.pattern) && (length(chars) >= min.size)
}
Example:
w.valid = 'special!'
w.invalid = 'test-me'
check.pattern(w.valid) #TRUE
check.pattern(w.invalid) #FALSE
This is VERY SLOW i guess...is there a faster way to do this? Regex maybe?
Thanks!
PS: Thanks everyone for the great answers. My objective was to build a 29 x 29 matrix,
where the row names and column names are the allowed characters. Then i iterate over each word of a huge text file and build a 'letter precedence' matrix. For example, consider the word 'special', starting from the first char:
row s, col p -> increment 1
row p, col e -> increment 1
row e, col c -> increment 1
... and so on.
The bottleneck of my code was the vector allocation, i was 'appending' instead of pre-allocate the final vector, so the code was taking 30 minutes to execute, instead of 20 seconds!
There are some built-in functions that can clean up your code. And I think you're not leveraging the full power of regular expressions.
The blaring issue here is strsplit. Comparing the equality of things character-by-character is inefficient when you have regular expressions. The pattern here uses the square bracket notation to filter for the characters you want. * is for any number of repeats (including zero), while the ^ and $ symbols represent the beginning and end of the line so that there is nothing else there. nchar(word) is the same as length(chars). Changing && to & makes the function vectorized so you can input a vector of strings and get a logical vector as output.
check.pattern.2 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]*$"),word) & nchar(word) >= min.size
}
check.pattern.2(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Next, using curly braces for number of repetitions and some paste0, the pattern can use your min.size:
check.pattern.3 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]{",min.size,",}$"),word)
}
check.pattern.3(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Finally, you can internalize the regex from trim:
check.pattern.4 = function(word, min.size = 2)
{
grepl(paste0("^\\s*[a-z!.?]{",min.size,",}\\s*$"),word)
}
check.pattern.4(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
If I understand the pattern you are desiring correctly, you would want a regex of a similar format to:
^\\s*[a-z!\\.\\?]{MIN,MAX}\\s*$
Where MIN is replaced with the minimum length of the string, and MAX is replaced with the maximum length of the string. If there is no maximum length, then MAX and the comma can be omitted. Likewise, if there is neither maximum nor minimum everything within the {} including the braces themselves can be replaced with a * which signifies the preceding item will be matched zero or more times; this is equivalent to {0}.
This ensures that the regex only matches strings where every character after any leading and trailing whitespace is from the set of
* a lower case letter
* a bang (exclamation point)
* a question mark
Note that this has been written in Perl style regex as it is what I am more familiar with; most of my research was at this wiki for R text processing.
The reason for the slowness of your function is the extra overhead of splitting the string into a number of smaller strings. This is a lot of overhead in comparison to a regex (or even a manual iteration over the string, comparing each character until the end is reached or an invalid character is found). Also remember that this algorithm ENSURES a O(n) performance rate, as the split causes n strings to be generated. This means that even FAILING strings must do at least n actions to reject the string.
Hopefully this clarifies why you were having performance issues.

How can I return true only if one of a set of strings matches?

I want to return true if the user enters only one of a set of possible matches. Similar to an XOR operator, but only one string out of the entire group may exist in the input. Here is my code:
if input.match?(/str|con|dex|wis|int|cha/)
The following inputs should return true:
+2 int
-3con
str
con
wisdom
dexterity
The following inputs should return false:
+1 int +2 cha
-4dex+3con-1cha
int cha
str dex
con wis cha
strength intelligence
strdex
I'd probably go with String#scan and a simple regex so that you can understand what you've done later:
if input.scan(/str|dex|con|int|wis|cha/).length == 1
# Found exactly one
else
# Didn't find it or found too many
end
That also makes it easier to distinguish between the various ways it can fail.
Presumably your strings will be relatively small so scanning the string for all the matches won't have any noticeable overhead.
The following are three ways to answer the question without creating an intermediate array. All employ the regular expression:
R = /str|con|dex|wis|int|cha/
and return the following:
one_match? "It wasn't a con, really" #=> true
one_match? "That sounds to me like a wild guess." #=> falsy (nil or false)
one_match? "Both int and dex are present." #=> falsy (nil or false)
one_match? "Three is an integer." #=> true
one_match? "Both int and indexes are present." #=> falsy (nil or false)
#1 Do first and last match begin at the same index?
def one_match?(s)
(idx = s.index(R)) && idx == s.rindex(R)
end
See String#index and String#rindex.
#2 Use the form of String#index that takes an argument equal to the index at which the search is to begin.
def one_match?(s)
s.index(R) && s.index(R, Regexp.last_match.end(0)).nil?
end
See Regexp::last_match and MatchData#end. Regexp.last_match can be replaced by $~.
#3 Use the form of String#gsub that takes one argument and no block to create an enumerator that generates matches
def one_match?(s)
s.gsub(/str|con|dex|wis|int|cha/).count { true } == 1
end
See Enumerable#count.
Alternatively,
s.gsub(/str|con|dex|wis|int|cha/).to_a.size == 1
though this has the disadvantage of creating a temporary array.
To match whole words only
In the penultimate example 'int' matches 'int' in 'integer' and in the last match 'dex' matches 'dex' in 'indexes'. To enforce full-word matches the regular expression can be changed to:
/\b(?:str|con|dex|wis|int|cha)\b/
If you have to use a regex you may use
/\A(?!(?:.*(str|con|dex|wis|int|cha)){2}).*\g<1>/m
See the regex demo
Details
\A - start of string
(?!(?:.*(str|con|dex|wis|int|cha)){2}) - no two occurrences of any 0+ chars followed with str, con, dex, wis, int, cha
.* - any 0+ chars as many as possible
\g<1> - Group 1 pattern (str, con, dex, wis, int or cha).