Ruby regex counting characters - regex

I am trying to create a regex in ruby that matches against strings with 10 characters which are not special characters i.e. would match with \w.
So far I have come up with this:
/\w{10,}/
but the issue is that it will only count a consecutive sequence of word characters. I want to match any string which counts up to have at least 10 "word" characters. Is this possible? I am fairly new to regex as a whole so any help would be appreciated.

If I understood correctly, this should work:
/(?:\w[^\w]*){9,}\w/
Explanation:
We start with a single
\w
We want to capture all the other characters until another \w, hence:
\w[^\w]*
[^<list of chars>] matches any character other than listed in the brackets, so [^\w] means any character that is not a word character. * denotes 0 or more. The above will match "a-- ", "b" and "c!" in "a-- bc!" string.
Since we need 10 \w, we will match 9 (or more) groups like that, followed by a single \w
(\w[^\w]*){9,}\w
We don't really care for captures here (especially since ruby will ignore repeated group captures anyway, so we make the group non-capturing)
(?:\w[^\w]*){9,}\w
Alternatively we could just use simpler regex:
(?:\w[^\w]*){10,}
But it will also cover characters after the last word character in a string - not sure if this is required here.

Match anywhere in the string:
/\w(?:\W*\w){9,19}/
/(?:\W*\w){10,20}/
Validate a string of 10 to 20 characters long:
/\A(?:\W*\w){10,20}\W*\z/
Prefer non-capturing groups, particularly when extracting found matches.
Watch out for ^ and $ that mark up start and end of the line respectively in Ruby's regex.
EXPLANATION
--------------------------------------------------------------------------------
\A the beginning of the string
--------------------------------------------------------------------------------
(?: group, but do not capture (between 10 and
20 times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\W* non-word characters (all but a-z, A-Z, 0-
9, _) (0 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
){10,20} end of grouping
--------------------------------------------------------------------------------
\W* non-word characters (all but a-z, A-Z, 0-
9, _) (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\z the end of the string

Related

Extract the last path-segments of a URI or path using RegEx

I am trying to extract the last section of the following string :
"/subscriptions/5522233222-d762-666e-555a-e6666666666/resourcegroups/rg-sql-Belguim-01/providers/Microsoft.Compute/snapshots/vm-sql-image-v3.3-pre-sysprep-Oct-2021-BG"
I want to capture:
"snapshots/vm-sql-image-v3.3-pre-sysprep-Oct-2021-BG"
I tried below with no luck:
(\w*?\/\w*?)$
How to pull this off using regex?
Use
[^\/]+\/[^\/]+$
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
[^\/]+ any character except: '\/' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
[^\/]+ any character except: '\/' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
Your issues
(\w*?/\w*?)$ is for simple or empty last 2 segments (tested), e.g.
matched hello/world/subscriptions123/snap_shots capturing subscriptions123/snap_shots
matched /1/2// capturing the last 2 empty segments
OK was:
capture-group
/ to match the last path-separator before end ($)
\w*? intended to match the path-segment of any length
What to improve:
*? is a bit too unrestricted, choose quantifier as + for at least one (instead * for any or ? for zero or one)
\w is for word-meta-character, does not match hyphens or dots (OK for snapshot, not for given last segment)
Quick-fixed
(\w+/[\w\.-]+)$ (tested)
added dot \. and hyphen - to character-set containing \w
Simple but solid
(snapshots/[^\/]+)$ (tested)
fore-last path-segment assumed as fix constant snapshots
[^\/] any character except (^) slash in last segment
Note: the slash doesn't need to be escaped \/ like Ryszard answered

Regex to match Strings which contain non Chinese characters between two Chinese Characters

I'm trying to figure out how to write a regex to match this pattern
测试1003##$%#测试
Chinese Characters + non Chinese Characters + Chinese Characters, non Chinese Characters can be anything, and Chinese Characters are always the same(测试).
I know we can use ^((?!(\p{Han}).)*$ to match non Chinese Characters.. but not sure how should I make sure the head and tail are always the same Chinese Characters(测试 in this case).
Use
^(\p{Han}+)\P{Han}*\g{1}$
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\p{Han}+ Chinese characters
(1 or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\P{Han} non-word Chinese characters (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\g{1} matches the same text as most recently matched
by the 1st capturing group
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
If prefix = suffix = 测试, then use
^测试\P{Han}*测试$
Or, if the suffix and prefix can include more Chinese characters:
^测试\p{Han}*\P{Han}*\p{Han}*测试$
If there should be at least a single character other than \p{Han} you can match \P{Han}.
Capture the \p{Han} chars in capture group 1, and add a backreference at the end to group 1.
^(\p{Han}+)\P{Han}.*\1$
^ Start of string
(\p{Han}+) Capture group 1, match 1+ chars in the han script
\P{Han} Match at least a char other than \p{Han}
.* Match the rest of the string
\1$ Match a backreference to group 1 at the end of the string
Regex demo
To also match only 测试 you can use:
^(\p{Han}+)(?:\P{Han}.*\1)?$
Regex demo

Regex to pick the alias from email address

I need to identify all email addresses in a given cell enclosed in any special character, written in any number of multiple lines.
This is something that I built.
"(!\s<,;-)[a-zA-Z0-9]*#"
Is there any improvement?
The pattern (!\s<,;-)[a-zA-Z0-9]*# starts with capturing !\s<,;- literally. If you want to match 1 of the listed characters, you can use a character class [!\s<,;-] instead.
If you want to match xyz123 in xyz123#gmail.com you can use:
[a-zA-Z0-9]+(?=#)
The pattern matches
[a-zA-Z0-9]+ Match 1+ occurrences of any of the listed ranges
(?=#) Assert (not match) an # directly to the right of the current position
See a regex demo.
Use
([a-zA-Z0-9]\w*)#
See regex proof
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[a-zA-Z0-9] any character of: 'a' to 'z', 'A' to
'Z', '0' to '9'
--------------------------------------------------------------------------------
\w* word characters (a-z, A-Z, 0-9, _) (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
# '#'

Regex select all BUT group

So I'm in a situation where I must use only regex to select everything but a specific word. For the purposes of example, the word will be foobar. This is an example of what should happen:
this should be highlighted, and
same with this. but any sentence
that has the word
foobar
shouldnt be, and same for any regular
sentence with foobar <-- like that
foobar beginning a sentence should invalidate
the entire sentence, same with at the end foobar
only foobar, and nothing else of the sentence
more words here more irrelevant stuff to highlight
and nothing of the key word
what about multiple foobar on the same foobar line?
And what should be matched, would look something like this:
The best I could get is /\b(?!foobar)[^\n]+\n?/g which works if the word foobar is alone on it's own separate line formatted like this:
not foobar
foobar (ignored)
totallynotfoobar
nobar
foobutts
foobar (ignored)
notagain
And the rest is matched... but this is not what I want.
So my question is, how would I accomplish the original example? Is it even possible?
Here's one way: (demo)
\W*\b(?!foobar).+?\b\W*
The ? in .+? is to ensure we stop matching as soon as we get a \b, otherwise we might skip over some foobar's.
The \W*'s are necessary to consume any leading or trailing non-word characters in the string.
Every word and every word separator are matched separately here, which might not be ideal.
Full explanation:
NODE EXPLANATION
--------------------------------------------------------------------------------
\W* non-word characters (all but a-z, A-Z, 0-
9, _) (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
foobar 'foobar'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
.+? any character except \n (1 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\W* non-word characters (all but a-z, A-Z, 0-
9, _) (0 or more times (matching the most
amount possible))
A variation with look-behind and look-ahead: (with /gs or /gm) (demo)
(?<=^|\bfoobar\b)(?!foobar\b)(.*?)(?=\bfoobar\b|$)
I believe all those \b's are necessary to correctly handle all cases where foobar appears as part of a word (if it as part of a word should also be excluded, just removing all \b's should work).

regex filename of a unixpath without the first two digits

I have filename in a unix-path starting with two digits ... how can i extract the name without the extension
/this/is/my/path/to/the/file/01filename.ext should be filename
I currently have [^/]+(?=\.ext$) so I get 01filename, but how do I get rid of the first two digits?
You can add a look-behind in front of what you already have, looking for two digits:
(?<=\d\d)[^/]+(?=.ext$)
This only works if you have exactly two digits! Unfortunately, in most regex engines it is not possible to use quantifiers like * or + in lookbehinds.
(?<=\d\d) - checks for two digits before the match
[^/]+ - matches 1 or more characters, except /
(?=.ext$) - checks for .ext behind the match
Try this one :
/\d\d(.*?).\w{3}$
Explanation :
/\d\d : slash followed by two digit
(.*?) : the capture
.\w{3} : a dot followed by three letters
$ : end of string
It works for me on Expresso
Consider the following Regex...
(?<=\d{2})[^/]+(?=.ext$)
Good Luck!
A more general regex:
(?:^|\/)[\d]+([^.]+)\.[\w.]+$
Explanation:
(?: group, but do not capture:
^ the beginning of the string
| OR
\/ '/'
) end of grouping
[\d]+ any character of: digits (0-9) (1 or more
times (matching the most amount possible))
( group and capture to \1:
[^.]+ any character except: '.' (1 or more
times (matching the most amount
possible))
) end of \1
\. '.'
[\w\.]+ any character of: word characters (a-z, A-
Z, 0-9, _), '.' (1 or more times
(matching the most amount possible))
$ before an optional \n, and the end of the
string