generate search words from text with CamelCase by using regEx - regex

I want from this text with CamelCase search words to generate.
I do not know if that's possible only with RegEx. but I'm already close.
I use it in the scripting language AutoHotkey (https://autohotkey.com/docs/misc/RegEx-QuickRef.htm) .
data: reCommended for future AutoHotkeyReleases.
regEx: (((\b[^A-Z\s]*)?([A-Z][a-z]+)|([\W_-]?[a-z]+))) ( https://regex101.com/r/NgRmXZ/2 )
expected Groups:
reCommended
re
Commended
for
future
Auto
Hotkey
AutoHotkey
HotkeyReleases
Releases
AutoHotkeyReleases.
i also tried, but not works for me:
(?=\p{Lu}\p{Ll})|(?<=\p{Ll})(?=\p{Lu}) from Splitting CamelCase with regex
(([a-z]*)(?<=[a-z])((?:[A-Z])[a-z]+)) https://regex101.com/r/NgRmXZ/3
(?<=[a-z])([A-Z])|(?<=[A-Z])([A-Z][a-z]) https://regex101.com/r/NgRmXZ/4
((?<!^)([A-Z][a-z]+|(?<=[a-z])[A-Z][a-z]+)) https://regex101.com/r/B5vXaZ/1
I have started to implement my prototyp already here:
https://gist.github.com/sl5net/ba5aef19f44fe68204ccb6c96e7c96e0

I have made a regex that almost satisfies your need. However, I'm missing one combination. I don't think, it's possible, because it would require parantheses to overlap, the 'Hotkey' would have to be part of two different overlapping Groups.
Well, here's the regex:
/\b((\w+?(?=[A-Z]|\b))([A-Z][a-z]*)?)([A-Z][a-z]*)?/g
It starts by a Word boundary, then creates 2 Groups, Group 2 matches any Word character one or more times (ungreedy) until a look ahead for a Capital letter OR a Word boundary is reached.
Group 3 will match a Capital letter followed by zero or more lowercase letters. That is optional.
Group 1 combines Group 2 and Group 3.
Finally Group 4 will match a Capital letter followed by zero or more lowercase letters. That is optional.
As mentioned, I don't think its possible to create a Group, that combines Group 3 and Group 4, since they overlap. Other than that, this should Work, as you want.

Related

Regex to capture 2 sets of 8 non capital letters, sandwiched by dots

I'm hoping to capture some rando weirdness that is being dynamically generated and stuffed at the end of my temporary ASP files. The weirdness is a pattern of two sets of eight characters (appear to be all non capital letters) sandwiched by dots.
I am trying to find the best regex to capture both sets of 8 non capital letter characters, sandwiched by dots.
Here is my current regex:
\.([^A-Z]{8})\.
My current regex works ok to capture the first set, but doesn't capture the second set. I believe it's because the dot is getting eaten after the first match and so there's no dot to trigger the second set from matching.
How can I improve this regex so it captures both sets of dynamic weirdness? Would greatly appreciate any help folks can provide!
Data set to match:
String
Expected match
\Windows\Microsoft.NET\Framework64\v4.0.30319\Temporary ASP.NET Files\svc_pr30\701d8ff1\10cc0653\App_Web_defaultwsdlhelpgenerator.aspx.cdcab7d2.3sl-aaqs.dll
cdcab7d2.3sl-aaqs
\Windows\Microsoft.NET\Framework64\v4.0.30319\Temporary ASP.NET Files\svc_pr21\a201b637\20c58f14\App_Web_defaultwsdlhelpgenerator.aspx.cdcab7d2.xqj2w-wv.dll
cdcab7d2.xqj2w-wv
\Windows\Microsoft.NET\Framework64\v4.0.30319\Temporary ASP.NET Files\web_releaseapi\638ee986\2f0d9ef4\App_Web_defaultwsdlhelpgenerator.aspx.cdcab7d2.-qsn3y9x.dll
cdcab7d2.-qsn3y9x
\Windows\Microsoft.NET\Framework64\v4.0.30319\Temporary ASP.NET Files\web_releaseapi\638ee986\2f0d9ef4\App_Web_defaultwsdlhelpgenerator.aspx.cdcab7d2.pyn4enbe.dll
cdcab7d2.pyn4enbe
\Windows\Microsoft.NET\Framework64\v4.0.30319\Temporary ASP.NET Files\cmuserservice_windowsauth\a10d69fc\d9424d7d\App_Web_defaultwsdlhelpgenerator.aspx.cdcab7d2.thhlx9xi.dll
cdcab7d2.thhlx9xi
Instead of focusing on the non capital, your dataset actually extracts: all small letters, numbers, and dash. Separated by a dot
And you want to extract just before the .dll
So you can use this regex to extract.
([a-z0-9-]+?\.[-a-z0-9]+?)\.dll
Then for your result, simply get group 1 of the regex matches.
I presume you know about regex grouping.
See the demo here
You can use
\.([^A-Z.]{8}\.[^A-Z.]{8})\.
See the regex demo. Details:
\. - a dot
([^A-Z.]{8}\.[^A-Z.]{8}) - Group 1: eight chars other than a dot and uppercase ASCII letters, a . and then again eight chars other than a dot and uppercase ASCII letters
\. - a dot.
Group 1 values for each tested string will be:
cdcab7d2.3sl-aaqs
cdcab7d2.xqj2w-wv
cdcab7d2.-qsn3y9x
cdcab7d2.pyn4enbe
cdcab7d2.thhlx9xi

How to select with regex this character?

For the example i have these four ip address:
10.100.0.11; wrong
10.100.1.12; good
10.100.11.4; good
10.100.44.1; wrong
The task has simple rules. In the 3rd place cant be 0, and the 4rd place cant be a solo 1.
I need to select they from an ip table in different routers and i know only this rules.
My solution:
^(10.100.[1-9]{1,3}.[023456789]{1,3})$
but in this case every number with 1 like 10, 100 etc is missing, so in this way this solution is wrong.
^(10.100.[1-9]{1,3}.[1-9]{2,3})$
This solve the problem of the single 1, but make another one.
From the rules you have given, this regex should work:
10\.100\.([123456789]\d*|\d{2,})\.([^1]$|\d{2,})
it also matches 3rd position number containing a 0 but not in the first place.
so 10.100.10.4 will match as well as 10.100.02.4
I don't know if it's the intended behavior since I'm not familiar with ip adress.
The last part \.([^1]$|\d{2,}) reads like this:
"after the 3rd dot is either
a character which is not 1 followed by the end of the line
or two or more digits"
If you want to avoid malformed string containing non-digit character like 10.100.12.a to be match you should replace [^1] by [023456789] or lazier (and therefore better ;) by [02-9]
I use https://regex101.com to debug regex. It's just awesome.
Here is your regex if you want to play with it
You might use
^10\.100\.[1-9]{1,3}\.(?:[02-9]|\d{2,3})$
The pattern matches
^ Start of string
10\.100\. Match 10.100. (note to escape the dot to match it literally)
[1-9]{1,3} Match 3 times a digit 1-9
\. Match a dot
(?: Non capture group
[02-9] Match a digit 0 or 2-9
| Or
\d{2,3} Match 2 or 3 digits 0-9
) Close the group
$ End of string
Regex demo

Regex to detect preferred stock symbols

To start off, regex is probably the least talented aspect within my programming belt, this is what I have so far:
\D{1,5}(PR)\D+$
\D{1,5} because common stock symbols are always a maximum of 5 letters
(PR) because that is part of the pattern that needs to be searched (more below in the background info)
\D+$ because I'm trying to match any single letter at the end of the string
A small tidbit of background
Preferred stock symbols are not standardized and so every platform, exchange, etc has their own way to display them. Having said that, most display a special character in their name, which makes those guys easy to detect. The characters are
[] {'.', '/', '-', ' ', '+'};
The trickier ones all have a similar pattern:
{symbol}PR{0}
{symbol}p{0}
{symbol}P{0}
Where 0 is just any single letter A-Z
Here is a sample data set for the trickier ones:
PSAPRZ
PSApA
PSApZ
PSAPA
PSAPZ
My regex seems to be working for the first one, since I'm specifically looking for (PR) and matching any single letter character at the end, but I can't for the life of me figure out how to also detect the patterns that end in p{0} or P{0} in the same regex. I completely gave up trying to incorporate finding the special symbols because I can easily just do a string.Contains on the target string for any of those chars. The more important part is figuring out these trickier ones.
How do I get my regex statement to also detect the p{0} and P{0} matches within the same regex statement?
Edit 1
If you're curious at the madness of different possibilities, including the "easy to detect" versions, grab a popcorn, here you go :)
PSA.PA
PSA.PR.A
PSA/PA
PSAPRA
PSA-A
PSA PRA
PSA.PRA
PSA.PA
PSA+A
PSA/PRA
PSApA
PSAPA
PSA-PA
This should do it:
^[A-Z]{1,5}([Pp]|PR)[A-Z]$
Explanation:
^ - anchor at start
[A-Z]{1,5} - one to five uppercase letters
([Pp]|PR) - capture group used for: uppercase P or lowercase p or uppercase PR
[A-Z] - one uppercase letters
$ - anchor at end
UPDATE after EDIT 1 in question. To support the odd formats with ., /, -, + use this:
^[A-Z]{1,5}[.\/\s\+\-]?([Pp]|PR\.?)[A-Z]$
Explanation:
^ - anchor at start
[A-Z]{1,5} - one to five uppercase letters
[.\/\s\+\-]? - optional single character ., /, , +, -
([Pp]|PR\.?) - capture group used for: uppercase P, or lowercase p, or uppercase PR followed by optional .
[A-Z] - one uppercase letters
$ - anchor at end
Note on anchors: Use ^...$ anchors if you only have the stock symbol in the string. If you have text with a stock symbol anywhere within, use word boundaries \b...\b instead.
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

Regex to match number(s) or UUID

I need regex which loosely matches UUIDs and numbers. I expect my filename to be formatted like:
results_SOMETHING.csv
This something ideally should be numbers (count of how many time a script is run) or a UUID.
This regex is encompasses a huge set of filenames:
^results_?.*.csv$
and this one:
^results_?[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}.csv$
matches only UUIDs. I want a regex whose range is somewhere in between. Mostly I don't want matches like result__123.csv.
Note: This doesn't directly answer the OP question, but given the title, it will appear in searches.
Here's a proper regex to match a uuid based on this format without the hex character constraint:
(\w{8}(-\w{4}){3}-\w{12}?)
If you want it to match only hex characters, use:
/([a-f\d]{8}(-[a-f\d]{4}){3}-[a-f\d]{12}?)/i
(Note the / delimiters used in Javascript and the /i flag to denote case-insensitivity; depending on your language, you may need to write this differently, but you definitely want to handle both lower and upper case letters).
If you're prepending results_ and appending .csv to it, that would look like:
^results_([a-z\d]{8}(-[a-z\d]{4}){3}-[a-z\d]{12}?).csv$
-----EDITED / UPDATED-----
Based on the comments you left, there are some other patterns you want to match (this was not clear to me from the question). This makes it a little more challenging - to summarize my current understanding:
results.csv - match (NEW)
results_1A.csv - match (NEW)
results_ABC.csv - ? no match (I assume)
result__123.csv - no match
results_123.csv - match
Results_123.cvs - ? no match
results_0a0b0c0d-884f-0099-aa95-1234567890ab.csv - match
You will find the following modification works according to the above "specification":
results(?:_(?:[0-9a-f]{8}-(?:[0-9a-f]{4}-){3}[0-9a-f]{12}|(?=.*[0-9])[A-Z0-9]+))?\.csv
Breaking it down:
results matches characters "results" literally
(?:_ ….)? non-capturing group, repeated zero or one time:
"this is either there, or there is nothing"
[0-9a-f]{8}- exactly 8 characters from the group [0-9a-f]
followed by hyphen "-"
(?:[0-9a-f]{4}-){3} ditto but group of 4, and repeated three times
[0-9a-f]{12} ditto, but group of 12
| OR...
(?=.*[0-9]+) at least one number following this
[A-Z0-9]+ at least one capital letter or number
\.csv the literal string ".csv" (the '.' has to be escaped)
demonstration on regex101.com

Noob regex poser (match MAY contain and MUST have)

Probably really simple for you Regex masters :) I'm a noob at regex, just having picked up some PHP, but wanting to learn (once this project is complete, I'll knuckle down and crack regular expressions).
I'd like to understand how to compose a regex that may contain some data, but must contain other.
My example being, the match MAY begin with numbers but doesn't have to, however if it does, I need the number and the following 2 words. If it doesn't begin with a number, just the first 2 words. The data will be at the beginning of the string.
The following would match:
123 Fore Street, Fiveways (123 Fore Street returned(no comma))
Our House Village (Our House returned)
7 Eightnine (7 Eightnine returned)
Thanks
Something like this should work:
^((?:\d+\s)?\w+(?:\s\w+)?)
You can test it out somewhere like http://rubular.com/ before coding it, it's usually easier.
What it means:
^ -> beginning of the line
(?:\d+\s)? -> a non capturing group, (marked by ?:), consisting of several digits and a space, since we follow it by ?, it's optional.
\w+(?:\s\w+)? -> several alphanumeric characters (look up what \w means), followed by, optionally, a space and another "word", again in a non capturing group.
The whole thing is encapsulated in a capturing group, so group 1 will contain your match.
Use this regex with multiline option
^(\d+(\s*\b[a-zA-Z]+\b){1,2}|(\s*\b[a-zA-Z]+\b){1,2})
Group1 contains your required data
\d+ means match digit i.e \d 1 to many times+
\s* means match space i.e \s 0 to many times*
(\s*\b[a-zA-Z]+\b){1,2} matches 1 to 2 words..