Regex to match a word or a dot - regex

This should be a fairly trivial question but I have spent quite some time and Im unable to do it -
If this is my string -
"this/DT word/NN is/VBZ a/DT dot/NN ./."
I want to extract the immediate neighbors of / , be it a word,comma or a full stop.
(\\w+)/(\\w+) gives the words before n after / but not the full stops etc.
I tried this - "\\.\\/\\.|(\\w+)/(\\w+)" for grabbing the full stops but doesn't seem to work.
Can someone help please.( I am trying this in R)
Thanks!

Note that \w only matches letters, digits and an underscore. A dot/period belongs to punctuation and can be captured with Perl-like \p{P} or POSIX class [:punct:]. Thus, theoretically, you could use something like ([\\w[:punct:]]+)/([\\w[:punct:]]+) (or even a more POSIXish ([[:alpha:][:punct:]]+)/([[:alpha:][:punct:]]+)), but I guess matching non-whitespace characters on both sides of / suits your purpose best.
Here is an alternative to the (\\S+)/(\\S+) regex:
([^\\s]+)/([^\\s]+)
See regex demo
The [^\s] means any symbol other than a whitespace. Note that \S means *any non-whitespace character.
If you can have no non-whitespace characters on either side of /, I believe
([^\\s]*)/([^\\s]*)
or
(\\S*)/(\\S*)
will work better for you since * will match 0 or more characters.
See another demo

You can use this regex
"(\\S+)/(\\S+)"
i.e. grab each non-space text before and after /.
RegEx Demo

Related

Regex Extraction - Match before a space, or NOT before a space

Here are my potential inputs:
brian#muck.co, brian#gmail.com
brian#gmail.com, brian#muck.co
What I want to do is extract the #muck.co email address.
What I have tried is:
\s.*#muck.co
The problem is that this only grabs an email address if it is preceded by a space (so it would only match the second example input above). . . How would I write a Regex expression to match either inputs?
\s matches for a space, so you should wanted to use something like [^\s]*#muck.co - this means any number of not space caracters. [] - for a set of symbols, ^ - for negate effect.
It does not work for me, because \s in my regex flavour seems to not contain regular space, but this works [^[:space:]]\+#muck\.co. Also \+ instead of * for one or more non-space characters instead of any number and escape dot \. which unescaped stands for any single character.
You can use a negated character class to not cross the # and use either a word boundary at the end to prevent a partial word match:
[^\s#]+#muck\.co\b
Regex demo

RegEx in VSCode: capture every character/letter - not just ASCII

I am working with historical text and I want to reformat it with RegEx. Problem is: There are lots of special characters (that is: letters) in the text that are not matched by RegEx character classes like [a-z] / [A-Z] or \w .
For example I want to match the dot (and only the dot) in the following line:
<tag1>Quomodo restituendus locus Demosth. Olÿnth</tag1>
Without the ÿ I could easily work with the mentioned character classes, like:
(?<=(<tag1>(\w|\s)*))\.(?=((\w|\s)*</tag1>))
But it does not work with special characters that are not covered by ASCII. I tried lots of things but I can't make it work so the RegEx really only captures the dot in this very line. If I use more general Expressions like (.)* (instead of (\w|\s)* ) I get many more of the dots in the document (for example dots that are not between an opening and a closing tag but in between two such tagsets), which is not what I want. Any ideas for an expression that covers like all unicode letters?
You may match any text between < and > with [^<>]*:
(?<=(<tag1>[^<>]*))\.(?=([^<>]*</tag1>))
See the regex demo. Not sure you need all those capturing groups, you might get what you need without them:
(?<=<tag1>[^<>]*)\.(?=[^<>]*</tag1>)
See this regex demo. Details:
(?<=<tag1>[^<>]*) - a location immediately preceded with <tag1 and then any zero or more chars other than < and >
\. - a dot
(?=[^<>]*</tag1>) - a location immediately preceded with any zero or more chars other than < and > and then </tag1>.
use a negated character class that exculdes the dot and the opening angle bracket:
(?<=<tag1>[^.<]*(?:<(?!/tag1>)[^.<]*)*)\.
with this kind of pattern it isn't even needed to check the closing tag. But if you absolutely want to check it, ends the pattern with:
(?=[^<]*(?:<(?!/tag1>)[^<]*)*</tag1>)

How to find a particular string

Im using Visual Studio 2017 and in a long long text file Im searching for a particular function but unable to find
here's what the regex Im using
c\.CreateMap\<(\w)+\,\s+Address\>
and I want to in these
c.CreateMap<ClientAddress, Address>()
c.CreateMap<Responses.SiteAddress, Data.Address>()
and so on.
As soon as I add "Address" in the regex it stops matching any.
what am I doing wrong?
You can try this
c\.CreateMap\<\w+\.?\w+?\,\s*\w*?\.?Address\>
Explanation
c\.CreateMap\< - Matches c\.CreateMap\<.
\w+ - Matches any word character one or more time.
\.? - Matches '.' zero or one time.
\, - Matches ','.
\s* - Matches space zero or more time.
\w - Matches word character zero or more time.
\.? - Matches '.' zero or one time.
Address\> - Matches Address\>.
Demo
P.S- In case you also want to match something like this.
c.CreateMap<Responses.SiteAddress.abc, Data.Address.xyz>()
You can use this.
c\.CreateMap\<(\w+\.?\w+?)*\,\s*(?:\w*?\.?)*Address(\.\w*)?\>
Demo
Here is general regex I can suggest:
c\.CreateMap\<[\w.]+,\s+(?:[\w.]+\.)?Address\>\s*\(\s*\)
This will match any term with dots or word characters in the first position in the diamond. In the second, position, it will match Address, or some parent class names, followed by a dot separator, followed by Address.
Demo
Note that I also include the empty function call parentheses in the regex. As well, I allow for flexibility in the whitespace may appear after the diamond, or between the parentheses.
In your second example, you have extra dot which is not handled. Your regex needs little modification. Also, you don't need to escape < or > or , Use this,
c\.CreateMap<([\w.])+,\s+[\w.]*Address>
Demo
To match any of the functions on your question, you can use:
c\.CreateMap[^)]+\)
Regex Demo
Regex Explanation:

Match asterisk followed by space in PCRE

I'm just having trouble figuring out how to regex properly. What I need is to match an asterisk followed by a space followed by any amount of characters that aren't \n. (Similar to reddit list formatting)
Example:
* Test
* Test2
* Test3
The closest I got was this, but it wasn't working.
/^[*][ ](.*?)/s
Can anyone familiar with PCRE help me.
You should not use a lazy dot pattern at the end of the regex because it will never match any single char (as it will be skipped when the regex engine comes up to it, and since there is nothing to match after it, the empty string will be matched by .*?).
Use the greedy dot pattern:
^\* (.*)
See the regex demo
Other notes: you may use \h to match any horizontal whitespace instead of the regular space in the pattern. To match start of lines with ^ use m modifier. Only use s modifier if you need . to match any chars including a newline (and carriage return depending on PCRE verbs that are active).

Regex to match words after dot until a whitespace occurs

Given the following string
span.a.b this.is.really.confusing
I need to return the matches a and b. I've been able to get close with the following regex:
(?<=\.)[\w]+
But it's also matching is, really, and confusing. When I include a negative lookahead I get even closer, but I'm still not there.
(?<=\.)[\w]+(?=\s) # matches b, confusing
How can I match words after a dot until a whitespace occurs?
How can I match words after a dot until a whitespace occurs?
NB: this is language agnostic pseudo-code, but should work.
regex = "^[^\s.]+.(\S+).*"
targets = <extracted_group>.split(".")
Regex explanation:
"^": beings with
"[^\s.]+." 1 or more non-whitespace, non-period characters, followed by a period.
"(\S+)": group and capture all of the following non-whitespace characters
".*": matches 0 or more of any non-newline character
If the split function takes a regex instead of a string, you'll need to escape the '.' or use a character class.
NB: You can do it without the split, but I think that the split is more transparent.
I am not sure if this is good enough for all your possible cases, but it should work with the provided example:
\.([\w]+)\.([\w]+)\s
$1 = a, $2 = b