Regex word count - matching words with apostrophe - regex

I'm trying to count words using Regex, with the following pattern:
#"\\w+"
This works, however it's matching it's as:
it
s
Is there a better way to match words that contain punctuation?
Also, words surrounded by punctuation, for example 'word' should also be matched (withhout the ')

The one way to test for such cases is:
#"\\w+(?:'\\w+)?"
So it will match both its and it's, but only its in its'.

I find this style readable, this is with hyphenated words.
'?([a-zA-z'-]+)'?
this is without hyphenation
'?([a-zA-z']+)'?
if you want quick and dirty regex testing with visual feedback you can use one of the many online regex testing tools, i like rubular.com (even for non ruby regex testing)

Related

Matching a string between two sets of characters without using lookarounds

I've been working on some regex to try and match an entire string between two characters. I am trying to capture everything from "System", all the way down to "prod_rx." (I am looking to include both of these strings in my match). Below is the full text that I am working with:
\"alert_id\":\"123456\",\"severity\":\"medium\",\"summary\":\"System generated a Medium severity alert\\\\prod_rx.\",\"title\":\"123456-test_alert\",
The regex that I am using right now is...:
(?<=summary\\":\\").*?(?=\\")
This works perfectly when I am able to use lookarounds, such as in Regex101: https://regex101.com/r/jXltNZ/1. However, the regex parser in the software that my company uses does not support lookarounds (crazy, right?).
Anyway - my question is basically how can I match the above text described without using lookaheads/lookbehinds. Any help is VERY MUCH appreciated!!
Well, we can simply use other non-lookaround method, such as this simple expression:
.+summary\\":\\"(.+)\\",
and our data is in this capturing group:
(.+)
our right boundary is:
\\",
and our left boundary is:
.+summary\\":\\"
Demo

Finding all possible Acronyms

I have creating a script using VBA to go through a Word document to find all word that could possibly be an acronym but I found that my regEx pattern is not find all of them.
The regEx pattern I am using is "([A-Z]{2,})(-([A-Z]{2,})[A-Za-z0-9])"
With this pattern I am able to find
AA
AAA
AA-BB
AA-BBB
AAA-BB
AAA-BBB
AAA-1234
AAA-BBB-1234
but it does not find these words
B2B
B2B-1234
B2B-A1A-1234
The expectation of the word match should be that the first character is a letter and must contains at least two uppercase letters and at least one number. In addition, if there are dashes in the the word then the characters before the dash must match the expectation of the word match.
Is there is a way to use the regEx pattern above to also include the letter-digit-letter acronyms too?
Milco, welcome to StackOverflow. I think that the following regex will work for you:
([A-Z][A-Z0-9]+)(-[A-Z0-9]{2,})*
This regex accommodates digits and an optional number of hyphenated terms and matches each of your cases above. I tested it out at regextesteronline.com - I'm assuming that VB.net regexes are the same as VBA, which they should be, at least for basic regexes.

Regex for extracting each word between hyphens

I am learning regex and trying to write a pattern that exactly matches each of the strings without'-' so that I can iterate for each of the groups and print the respective strings.
I have a string that looks like "Abcd001-wd2s-vwe1-20180e3103.txt"
I was able to write a regex for extracting Abcd001, wd2s and .txt from above text as shown below
(\A[^-]+)=> Abcd001
(-[^-]+-)=> wd2s
(\..*)=>.txt
However, I was unable to come up with the correct pattern for extracting the exact strings vwe1 and 20180e3103
It will be really helpful if you can guide me on this or if there is a better approach to achieve this?
Please note: [^-.]+ may give me all the words separately but I am looking for an option where I have a group defined for each of these strings so that its one to one mapping.
Thanks!
To get vwe1 or 20180e3103 from the example data, you might use a quantifier {2} or {3} to repeat matching one or more word charcters followed by a hyphen (?:\w+-){2}.
Then you could capture in a group ([^-.]+) matching not a hyphen or a dot.
(?:\w+-){2}([^-.]+)
Try the below regex
/\-([^\)]+)\-/gmi;
Also check the similar implementation:
https://stackoverflow.com/a/50336050/8179245

regex look ahead behind (look around) negative problems

I am having trouble understanding negative regex lookahead / lookbehind. I got the impression from reading tutorials that when you set a criteria to look for, the criteria doesn't form part of the search match.
That seems to hold for positive lookahead examples I tried, but when I tried these negative ones, it matches the entire test string. 1, it shouldn't have matched anything, and 2 even if it did, it wasn't supposed to include the lookahead criteria??
(?<!^And).*\.txt$
with input
And.txt
See: https://regex101.com/r/vW0aXS/1
and
^A.*(?!\.txt$)
with input:
A.txt
See: https://regex101.com/r/70yeED/1
PS: if you're going to ask me which language. I don't know. we've been told to use regex without any specific reference to any specific languages. I tried clicking various options on regex101.com and they all came up the same.
Lookarounds only try to match at their current position.
You are using a lookbehind at the beginning of the string (?<!^And).*\.txt$, and a lookahead at the end of the string ^A.*(?!\.txt$), which won't work. (.* will always consume the whole string as it's first match)
To disallow "And", for example, you can put the lookahead at the beginning of the string with a greedy quantifier .* inside it, so that it scans the whole string:
(?!.*And).*\.txt$
https://regex101.com/r/1vF50O/1
Your understanding is correct and the issue is not with the lookbehind/lookahead. The issue is with .* which matches the entire string in both cases. The period . matches any character and then you follow it with * which makes it match the entire string of any length. Remove it and both you regexes will work:
(?<!^And)\.txt$
^A(?!\.txt$)

How to regex replace all characters of specific type in lines that have a specific word

There is a multiline text, in which there are specific lines that i'm interested in indicated by specific words. For example i am interested in the lines that have ".jpg" in them.
I'm trying to use a lookahead:
(?=\.jpg)
In these lines i would like to delete specific characters, for example all matches of "_"
Sample input:
https://somewebpage/stuff1_stuff2_stuff3.jpg
Desired output:
https://somewebpage/stuff1stuff2stuff3.jpg
I'm trying to write this regex for latest notepad++
My problem is that i can't seem to properly combine the positive lookahead with my regex recursively
([^_]*)(_?)
Any help is appreciated.
[_-](?=.*\.jpg) worked for me. replace with empty string to remove the characters or just do a find. you can expand your character list of course, but I think this covers you.