How to capture multiple instances of the same group in Rust regex? - regex

Here is my text:
hello: 3 32 2 8
I want to capture it with the following regex:
^([a-z]+):( [0-9]+)+$
I'm doing this:
let txt = "hello: 3 32 2 8";
let re = Regex::new("^([a-z]+):( [0-9]+)+$")?;
let caps = re.captures(txt);
println!("{caps:?}");
I'm getting only the last number 8 in the second capture group:
Some(Captures({0: Some("hello: 3 32 2 8"), 1: Some("hello"), 2: Some(" 8")}))
I suspect that it is an expected behavior of captures, but what is the workaround?

I would simply capture the whole sequence of integers.
Since we know this substring has the expected shape, we can split and parse it with confidence (except if one integer has too many digits).
Note that I added some tolerance around white-spaces.
use regex::Regex;
fn detect(txt: &str) -> Result<(&str, Vec<u32>), Box<dyn std::error::Error>> {
let re = Regex::new(r"^\s*([a-z]+)\s*:((\s*[0-9]+)+)\s*$")?;
let caps = re.captures(txt).ok_or("no match")?;
// starting from here, we know that all the expected substrings exist
// thus we can unwrap() the options/errors
let name = caps.get(1).unwrap().as_str();
let values = caps
.get(2)
.unwrap()
.as_str()
.split_ascii_whitespace()
.filter_map(|s| s.parse().ok()) // FIXME: overflow ignored
.collect();
Ok((name, values))
}
fn main() {
for txt in [
"hello: 3 32 2 8",
"hello :\t3 32 2 8",
"\thello :\t3 32 2 8 ",
"hello:",
"hello:9999999999 3",
] {
println!("{:?} ~~> {:?}", txt, detect(txt));
}
}
/*
"hello: 3 32 2 8" ~~> Ok(("hello", [3, 32, 2, 8]))
"hello :\t3 32 2 8" ~~> Ok(("hello", [3, 32, 2, 8]))
"\thello :\t3 32 2 8 " ~~> Ok(("hello", [3, 32, 2, 8]))
"hello:" ~~> Err("no match")
"hello:9999999999 3" ~~> Ok(("hello", [3]))
*/

Related

In Raku, how does one write the equivalent of Haskell's span function?

In Raku, how does one write the equivalent of Haskell's span function?
In Haskell, given a predicate and a list, one can split the list into two parts:
the longest prefix of elements satisfying the predicate
the remainder of the list
For example, the Haskell expression …
span (< 10) [2, 2, 2, 5, 5, 7, 13, 9, 6, 2, 20, 4]
… evaluates to …
([2,2,2,5,5,7],[13,9,6,2,20,4])
How does one write the Raku equivalent of Haskell's span function?
Update 1
Based on the answer of #chenyf, I developed the following span subroutine (additional later update reflects negated predicate within span required to remain faithful to the positive logic of Haskell's span function) …
sub span( &predicate, #numberList )
{
my &negatedPredicate = { ! &predicate($^x) } ;
my $idx = #numberList.first(&negatedPredicate):k ;
my #lst is Array[List] = #numberList[0..$idx-1], #numberList[$idx..*] ;
#lst ;
} # end sub span
sub MAIN()
{
my &myPredicate = { $_ <= 10 } ;
my #myNumberList is Array[Int] = [2, 2, 2, 5, 5, 7, 13, 9, 6, 2, 20, 4] ;
my #result is Array[List] = span( &myPredicate, #myNumberList ) ;
say '#result is ...' ;
say #result ;
say '#result[0] is ...' ;
say #result[0] ;
say #result[0].WHAT ;
say '#result[1] is ...' ;
say #result[1] ;
say #result[1].WHAT ;
} # end sub MAIN
Program output is …
#result is ...
[(2 2 2 5 5 7) (13 9 6 2 20 4)]
#result[0] is ...
(2 2 2 5 5 7)
(List)
#result[1] is ...
(13 9 6 2 20 4)
(List)
Update 2
Utilizing information posted to StackOverflow concerning Raku's Nil, the following updated draft of subroutine span is …
sub span( &predicate, #numberList )
{
my &negatedPredicate = { ! &predicate($^x) } ;
my $idx = #numberList.first( &negatedPredicate ):k ;
if Nil ~~ any($idx) { $idx = #numberList.elems ; }
my List $returnList = (#numberList[0..$idx-1], #numberList[$idx..*]) ;
$returnList ;
} # end sub span
sub MAIN()
{
say span( { $_ == 0 }, [2, 2, 5, 7, 4, 0] ) ; # (() (2 2 5 7 4 0))
say span( { $_ < 6 }, (2, 2, 5, 7, 4, 0) ) ; # ((2 2 5) (7 4 0))
say span( { $_ != 9 }, [2, 2, 5, 7, 4, 0] ) ; # ((2 2 5 7 4 0) ())
} # end sub MAIN
I use first method and :k adverb, like this:
my #num = [2, 2, 2, 5, 5, 7, 13, 9, 6, 2, 20, 4];
my $idx = #num.first(* > 10):k;
#num[0..$idx-1], #num[$idx..*];
A completely naive take on this:
sub split_on(#arr, &pred) {
my #arr1;
my #arr2 = #arr;
loop {
if not &pred(#arr2.first) {
last;
}
push #arr1: #arr2.shift
}
(#arr1, #arr2);
}
Create a new #arr1 and copy the array into #arr2. Loop, and if the predicate is not met for the first element in the array, it's the last time through. Otherwise, shift the first element off from #arr2 and push it onto #arr1.
When testing this:
my #a = [2, 2, 2, 5, 5, 7, 13, 9, 6, 2, 20, 4];
my #b = split_on #a, -> $x { $x < 10 };
say #b;
The output is:
[[2 2 2 5 5 7] [13 9 6 2 20 4]]
Only problem here is... what if the predicate isn't met? Well, let's check if the list is empty or the predicate isn't met to terminate the loop.
sub split_on(#arr, &pred) {
my #arr1;
my #arr2 = #arr;
loop {
if !#arr2 || not &pred(#arr2.first) {
last;
}
push #arr1: #arr2.shift;
}
(#arr1, #arr2);
}
So I figured I'd throw my version in because I thought that classify could be helpful :
sub span( &predicate, #list ) {
#list
.classify({
state $f = True;
$f &&= &predicate($_);
$f.Int;
}){1,0}
.map( {$_ // []} )
}
The map at the end is to handle the situation where either the predicate is never or always true.
In his presentation 105 C++ Algorithms in 1 line* of Raku (*each) Daniel Sockwell discusses a function that almost answers your question. I've refactored it a bit to fit your question, but the changes are minor.
#| Return the index at which the list splits given a predicate.
sub partition-point(&p, #xs) {
my \zt = #xs.&{ $_ Z .skip };
my \mm = zt.map({ &p(.[0]) and !&p(.[1]) });
my \nn = mm <<&&>> #xs.keys;
return nn.first(?*)
}
#| Given a predicate p and a list xs, returns a tuple where first element is
#| longest prefix (possibly empty) of xs of elements that satisfy p and second
#| element is the remainder of the list.
sub span(&p, #xs) {
my \idx = partition-point &p, #xs;
idx.defined ?? (#xs[0..idx], #xs[idx^..*]) !! ([], #xs)
}
my #a = 2, 2, 2, 5, 5, 7, 13, 9, 6, 2, 20, 4;
say span { $_ < 10 }, #a; #=> ((2 2 2 5 5 7) (13 9 6 2 20 4))
say span { $_ < 5 }, [6, 7, 8, 1, 2, 3]; #=> ([] [6 7 8 1 2 3])
Version 6.e of raku will sport the new 'snip' function:
use v6.e;
dd (^10).snip( * < 5 );
#«((0, 1, 2, 3, 4), (5, 6, 7, 8, 9)).Seq␤»

Find position of first non-zero decimal

Suppose I have the following local macro:
loc a = 12.000923
I would like to get the decimal position of the first non-zero decimal (4 in this example).
There are many ways to achieve this. One is to treat a as a string and to find the position of .:
loc a = 12.000923
loc b = strpos(string(`a'), ".")
di "`b'"
From here one could further loop through the decimals and count since I get the first non-zero element. Of course this doesn't seem to be a very elegant approach.
Can you suggest a better way to deal with this? Regular expressions perhaps?
Well, I don't know Stata, but according to the documentation, \.(0+)? is suported and it shouldn't be hard to convert this 2 lines JavaScript function in Stata.
It returns the position of the first nonzero decimal or -1 if there is no decimal.
function getNonZeroDecimalPosition(v) {
var v2 = v.replace(/\.(0+)?/, "")
return v2.length !== v.length ? v.length - v2.length : -1
}
Explanation
We remove from input string a dot followed by optional consecutive zeros.
The difference between the lengths of original input string and this new string gives the position of the first nonzero decimal
Demo
Sample Snippet
function getNonZeroDecimalPosition(v) {
var v2 = v.replace(/\.(0+)?/, "")
return v2.length !== v.length ? v.length - v2.length : -1
}
var samples = [
"loc a = 12.00012",
"loc b = 12",
"loc c = 12.012",
"loc d = 1.000012",
"loc e = -10.00012",
"loc f = -10.05012",
"loc g = 0.0012"
]
samples.forEach(function(sample) {
console.log(getNonZeroDecimalPosition(sample))
})
You can do this in mata in one line and without using regular expressions:
foreach x in 124.000923 65.020923 1.000022030 0.0090843 .00000425 {
mata: selectindex(tokens(tokens(st_local("x"), ".")[selectindex(tokens(st_local("x"), ".") :== ".") + 1], "0") :!= "0")[1]
}
4
2
5
3
6
Below, you can see the steps in detail:
. local x = 124.000823
. mata:
: /* Step 1: break Stata's local macro x in tokens using . as a parsing char */
: a = tokens(st_local("x"), ".")
: a
1 2 3
+----------------------------+
1 | 124 . 000823 |
+----------------------------+
: /* Step 2: tokenize the string in a[1,3] using 0 as a parsing char */
: b = tokens(a[3], "0")
: b
1 2 3 4
+-------------------------+
1 | 0 0 0 823 |
+-------------------------+
: /* Step 3: find which values are different from zero */
: c = b :!= "0"
: c
1 2 3 4
+-----------------+
1 | 0 0 0 1 |
+-----------------+
: /* Step 4: find the first index position where this is true */
: d = selectindex(c :!= 0)[1]
: d
4
: end
You can also find the position of the string of interest in Step 2 using the
same logic.
This is the index value after the one for .:
. mata:
: k = selectindex(a :== ".") + 1
: k
3
: end
In which case, Step 2 becomes:
. mata:
:
: b = tokens(a[k], "0")
: b
1 2 3 4
+-------------------------+
1 | 0 0 0 823 |
+-------------------------+
: end
For unexpected cases without decimal:
foreach x in 124.000923 65.020923 1.000022030 12 0.0090843 .00000425 {
if strmatch("`x'", "*.*") mata: selectindex(tokens(tokens(st_local("x"), ".")[selectindex(tokens(st_local("x"), ".") :== ".") + 1], "0") :!= "0")[1]
else display " 0"
}
4
2
5
0
3
6
A straighforward answer uses regular expressions and commands to work with strings.
One can select all decimals, find the first non 0 decimal, and finally find its position:
loc v = "123.000923"
loc v2 = regexr("`v'", "^[0-9]*[/.]", "") // 000923
loc v3 = regexr("`v'", "^[0-9]*[/.][0]*", "") // 923
loc first = substr("`v3'", 1, 1) // 9
loc first_pos = strpos("`v2'", "`first'") // 4: position of 9 in 000923
di "`v2'"
di "`v3'"
di "`first'"
di "`first_pos'"
Which in one step is equivalent to:
loc first_pos2 = strpos(regexr("`v'", "^[0-9]*[/.]", ""), substr(regexr("`v'", "^[0-9]*[/.][0]*", ""), 1, 1))
di "`first_pos2'"
An alternative suggested in another answer is to compare the lenght of the decimals block cleaned from the 0s with that not cleaned.
In one step this is:
loc first_pos3 = strlen(regexr("`v'", "^[0-9]*[/.]", "")) - strlen(regexr("`v'", "^[0-9]*[/.][0]*", "")) + 1
di "`first_pos3'"
Not using regex but log10 instead (which treats a number like a number), this function will:
For numbers >= 1 or numbers <= -1, return with a positive number the number of digits to the left of the decimal.
Or (and more specifically to what you were asking), for numbers between 1 and -1, return with a negative number the number of digits to the right of the decimal where the first non-zero number occurs.
digitsFromDecimal = (n) => {
dFD = Math.log10(Math.abs(n)) | 0;
if (n >= 1 || n <= -1) { dFD++; }
return dFD;
}
var x = [118.8161330, 11.10501660, 9.254180571, -1.245501523, 1, 0, 0.864931613, 0.097007836, -0.010880074, 0.009066729];
x.forEach(element => {
console.log(`${element}, Digits from Decimal: ${digitsFromDecimal(element)}`);
});
// Output
// 118.816133, Digits from Decimal: 3
// 11.1050166, Digits from Decimal: 2
// 9.254180571, Digits from Decimal: 1
// -1.245501523, Digits from Decimal: 1
// 1, Digits from Decimal: 1
// 0, Digits from Decimal: 0
// 0.864931613, Digits from Decimal: 0
// 0.097007836, Digits from Decimal: -1
// -0.010880074, Digits from Decimal: -1
// 0.009066729, Digits from Decimal: -2
Mata solution of Pearly is very likable, but notice should be paid for "unexpected" cases of "no decimal at all".
Besides, the regular expression is not a too bad choice when it could be made in a memorable 1-line.
loc v = "123.000923"
capture local x = regexm("`v'","(\.0*)")*length(regexs(0))
Below code tests with more values of v.
foreach v in 124.000923 605.20923 1.10022030 0.0090843 .00000425 12 .000125 {
capture local x = regexm("`v'","(\.0*)")*length(regexs(0))
di "`v': The wanted number = `x'"
}

Reading In Integers in Python

So, my question is simple. I'm simply struggling with syntax here. I need to read in a set of integers, 3, 11, 2, 4, 4, 5, 6, 10, 8, -12. What I want to do with those integers is place them in a list as I'm reading them. n = n x n array in which these will be presented. so if n = 3, then i will be passed something like this 3 \n 11 2 4 \n 4 5 6 \n 10 8 -12 ( \n symbolizing a new line in input file)
n = int(raw_input().strip())
a = []
for a_i in xrange(n):
value = int(raw_input().strip())
a.append(value)
print(a)
I receive this error from the above code code:
value = int(raw_input().strip())
ValueError: invalid literal for int() with base 10: '11 2 4'
The actual challenge can be found here, https://www.hackerrank.com/challenges/diagonal-difference .
I have already completed this in Java and C++, simply trying to do in Python now but I suck at python. If someone wants to, they don't have too, seeing the proper way to read in an entire line, say " 11 2 4 ", creating a new list out that line, and adding it to an already existing list. So then all I have to do is search said index of list[ desiredInternalList[ ] ].
You can split the string at white space and convert the entries into integers.
This gives you one list:
for a_i in xrange(n):
a.extend([int(x) for x in raw_input().split()])
and this a list of lists:
for a_i in xrange(n):
a.append([int(x) for x in raw_input().split()]):
You get this error because you try to give all inputs in one line. To handle this issue you may use this code
n = int(raw_input().strip())
a = []
while len(a)< n*n:
x=raw_input().strip()
x = map(int,x.split())
a.extend(x)
print(a)

Reg expression subset work individually but when in function nothing happens

Good evening,
I have a list "a" which I successfully subset using regular expressions.
a=a[grep("Macy*|Nors*", a$Geography, perl=TRUE),]
a=a[grep("Levis*|Diesel*|Replay*", a$Brand.Name, perl=TRUE), ]
a=a[grep("Week*", a$Time, perl=TRUE), ]
I created the function clean below but when I apply it to my list "a"nothing happens
clean=function(x){
x=x[grep("Macy*|Nors*", x$Geography, perl=TRUE),]
x=x[grep("Levis*|Diesel*|Replay*", x$Brand.Name, perl=TRUE), ]
x=x[grep("Week*", x$Time, perl=TRUE), ]
return (x)
}
clean(a) just returns original "a"
I tried printing each individual step but literally nothing happens.
Thank you for your help
a = data.frame(Geography = c(paste('Macy',1:5, sep = " "),'john'),stringsAsFactors = FALSE)
a
Geography
1 Macy 1
2 Macy 2
3 Macy 3
4 Macy 4
5 Macy 5
6 john
clean = function(x){
x = x[grep("Macy*|Nors*", x[['Geography']], perl = TRUE),]
return(x)
}
clean(a)
[1] "Macy 1" "Macy 2" "Macy 3" "Macy 4" "Macy 5"

R regex find ranges in strings

I have a bunch of email subject lines and I'm trying to extract whether a range of values are present. This is how I'm trying to do it but am not getting the results I'd like:
library(stringi)
df1 <- data.frame(id = 1:5, string1 = NA)
df1$string1 <- c('15% off','25% off','35% off','45% off','55% off')
df1$pctOff10_20 <- stri_match_all_regex(df1$string1, '[10-20]%')
id string1 pctOff10_20
1 1 15% off NA
2 2 25% off NA
3 3 35% off NA
4 4 45% off NA
5 5 55% off NA
I'd like something like this:
id string1 pctOff10_20
1 1 15% off 1
2 2 25% off 0
3 3 35% off 0
4 4 45% off 0
5 5 55% off 0
Here is the way to go,
df1$pctOff10_20 <- stri_count_regex(df1$string1, '^(1\\d|20)%')
Explanation:
^ the beginning of the string
( group and capture to \1:
1 '1'
\d digits (0-9)
| OR
20 '20'
) end of \1
% '%'
1) strapply in gsubfn can do that by combining a regex (pattern= argument) and a function (FUN= argument). Below we use the formula representation of the function. Alternately we could make use of betweeen from data.table (or a number of other packages). This extracts the matches to the pattern, applies the function to it and returns the result simplifying it into a vector (rather than a list):
library(gsubfn)
btwn <- function(x, a, b) as.numeric(a <= as.numeric(x) & as.numeric(x) <= b)
transform(df1, pctOff10_20 =
strapply(
X = string1,
pattern = "\\d+",
FUN = ~ btwn(x, 10, 20),
simplify = TRUE
)
)
2) A base solution using the same btwn function defined above is:
transform(df1, pctOff10_20 = btwn(gsub("\\D", "", string1), 10, 20))