How to distinguish between saved segment and alternative? - regex

From the following text...
Acme Inc.<SPACE>12345<SPACE or TAB>bla bla<CRLF>
... I need to extract company name + zip code + rest of the line.
Since either a TAB or a SPACE character can separate the second from the third tokens, I tried using the following regex:
FIND:^(.+) (\d{5})(\t| )(.+)$
REPLACE:\1\t\2\t\3
However, the contents of the alternative part is put in the \3 part, so the result is this:
Acme Inc.<TAB>12345<TAB><TAB or SPACE here>$
How can I tell the (Perl) regex engine that (\t| ) is an alternative instead of a token to be saved in RAM?
Thank you.

You want:
^(.+?) (\d{5})[\t ](.+)$
Since you are matching one character or the other, you can use a character class instead. Also, I made your first quantifier non-greedy (+? instead of +) to reduce the amount of backtracking the engine has to do to find the match.
In general, if you want to make capture groups not capture anything, you can add ?: to it, like:
^(.+?) (\d{5})(?:\t| )(.+)$

Use non-capturing parentheses:
^(.+) (\d{5})(?:\t| )(.+)$

One way is to use \s instead of ( |\t) which will match any whitespace char.
See Backslash-sequences for how Perl defines "whitespace".

Related

Case analysis with REGEX

I have some data like
small_animal/Mouse
BigAnimal:Elephant
Not an animal.
What I want to get is:
Mouse
Elephant
Not an animal.
Thus, I need a regular expression that searches for / or : as follows: If one of these is found, take the text behind that character. If neither / nor : exists, take the whole string.
I tried a lot. For example this will work for mouse and elephant, but not for the third line:
(?<=:)[^:]*|(?<=/)[^/]*
And this will always give the full string ...
(?<=:)[^:]*|(?<=/)[^/]*|^.*$
My head is burning^^ Maybe, somebody can help? :) Thanks a lot!
EDIT:
#The fourth bird offered a nice solution for single characters. But what if I want to search for strings like
animal::Dog
Another123Cat
Not an animal.
How can I split on :: or 123?
You might use
^(?:[^:/]*[:/])?\K.+
^ Start of string
(?:[^:/]*[:/])? Optionally match any char except : or / till matching either : or /
\K Forget what is matched so far
.+ Match 1+ times any char
regex demo
If you don't want to cross a newline, you can extend the character class with [^:/\r\n]*
Another option could be using an alternation
^[^:/]*[:/]\K.+|.+
Regex demo
Or perhaps making use of a SKIP FAIL approach by matching what you want to omit
^[^:/]*[:/](*SKIP)(*F)|.+
Regex demo
If you want to use multiple characters, you might also use
^(?:(?:(?!123|::|[:/]).)*+(?:123|::|[:/]))?\K.+
Regex demo

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Regexp to search and replace with transformed result

In our project we had translations called like
Resources.Blabla.MooFoo.GetString("I am a whatever string!");
It was using Resgen, and now we want to use "standard" way for it.
We parsed out the source text files(removing special chars from keys) into resx files, and now want to search-replace whole project to change every call to
Resources.Translate("Iamawhateverstring");
The point here is that aside from replacing the call signature, which is not a problem, I need to parse out symbols like spaces, dots, etc. from parameter so that
"I am a whatever string!"
turns into
"Iamawhateverstring"
How can I do it ?
Regex for spaces replacement:
(?<=(GetString\(")[A-Za-z0-9 ]+) (?=(.*?("\)){1}))
(?<=(GetString\(")[A-Za-z0-9 ]+) looks before space character for GetString("[a-Z0-9] including space if there is one or more space occurance in string
(?=(.*?("\)){1})) looks after space character for ")
I would replace it using this regex: DEMO
(Resources.Blabla.MooFoo.GetString)(\(".*"\);)
then when you get the capture groups:
Replace capture group 1 with "Resources.Translate"
Replace capture group 2 using captureGroup2.replace(/\s/g, '')
So essentially it is a two step process.

How to capture text between two markers?

For clarity, I have created this:
http://rubular.com/r/ejYgKSufD4
My strings:
http://blablalba.com/foo/bar_soap/foo/dir2
http://blablalba.com/foo/bar_soap/dir
http://blablalba.com/foo/bar_soap
My Regular expression:
\/foo\/(.*)
This returns:
/foo/bar_soap/dir/dir2
/foo/bar_soap/dir
/foo/bar_soap
But I only want
/foo/bar_soap
Any ideas how I can achieve this? As illustrated above, I want everything after foo up until the first forward slash.
Thanks in advance.
Edit. I only want the text after foo until until the next forward slash after. Some directories may also be named as foo and this would render incorrect results. Thanks
. will match anything, so you should change it to [^/] (not slash) instead:
\/foo\/([^\/]*)
Some of the other answers use + instead of *. That might be correct depending on what you want to do. Using + forces the regex to match at least one non-slash character, so this URL would not match since there isn't a trailing character after the slash:
http://blablalba.com/foo/
Using * instead would allow that to match since it matches "zero or more" non-slash characters. So, whether you should use + or * depends on what matches you want to allow.
Update
If you want to filter out query strings too, you could also filter against ?, which must come at the front of all query strings. (I think the examples you posted below are actually missing the leading ?):
\/foo\/([^?\/]*)
However, rather than rolling out your own solution, it might be better to just use split from the URI module. You could use URI::split to get the path part of the URL, and then use String#split split it up by /, and grab the first one. This would handle all the weird cases for URLs. One that you probably haven't though of yet is a URL with a specified fragment, e.g.:
http://blablalba.com/foo#bar
You would need to add # to your filtered-character class to handle those as well.
You can try this regular expression
/\/foo\/([^\/]+)/
\/foo\/([^\/]+)
[^\/]+ gives you a series of characters that are not a forward slash.
the parentheses cause the regex engine to store the matched contents in a group ([^\/]+), so you can get bar_soap out of the entire match of /foo/bar_soap
For example, in javascript you would get the matched group as follows:
regexp = /\/foo\/([^\/]+)/ ;
match = regexp.exec("/foo/bar_soap/dir");
console.log(match[1]); // prints bar_soap

Using RegEx with something of the format "xxx:abc" to match just "abc"?

I've not done much RegEx, so I'm having issues coming up with a good RegEx for this script.
Here are some example inputs:
document:ASoi4jgt0w9efcZXNDOFzsdpfoasdf-zGRnae4iwn2, file:90jfa9_189204hsfiansdIASDNF, pdf:a09srjbZXMgf9oe90rfmasasgjm4-ab, spreadsheet:ASr0gk0jsdfPAsdfn
And here's what I'd want to match on each of those examples:
ASoi4jgt0w9efcZXNDOFzsdpfoasdf-zGRnae4iwn2, 90jfa9_189204hsfiansdIASDNF, a09srjbZXMgf9oe90rfmasasgjm4-ab, ASr0gk0jsdfPAsdfn
What would be the best and perhaps simplest RegEx to use for this? Thanks!
.*:(.*) should get you everything after the last colon in the string as the value of the first group (or second group if you count the 'match everything' group).
An alternative would be [^:]*$ which gets you all characters at the end of the string up to but not including the last character in the string that is a colon.
Use something like below:
([^:]*)(,|$)
and get the first group. You can use a non-capturing group (?:ABC) if needed for the last. Also this makes the assumption that the value itself can have , as one of the characters.
I don't think answers like (.*)\:(.*) would work. It will match entire string.
(.*)\:(.*)
And take the second capture group...
Simplest seems to be [^:]*:([^,]*)(?:,|$).
That is find something that has something (possibly nothing) up to a colon, then a colon, then anything not including a comma (which is the thing matched), up to a comma or the end of the line.
Note the use of a non-capturing group at the end to encapsulate the alternation. The only capturing group appearing is the one which you wish to use.
So in python:
import re
exampStr = "document:ASoi4jgt0w9efcZXNDOFzsdpfoasdf-zGRnae4iwn2, file:90jfa9_189204hsfiansdIASDNF, pdf:a09srjbZXMgf9oe90rfmasasgjm4-ab, spreadsheet:ASr0gk0jsdfPAsdfn"
regex = re.compile("[^:]*:([^,]*)(?:,|$)")
result = regex.findall(exampStr)
print result
#
# Result:
#
# ['ASoi4jgt0w9efcZXNDOFzsdpfoasdf-zGRnae4iwn2', '90jfa9_189204hsfiansdIASDNF', 'a09srjbZXMgf9oe90rfmasasgjm4-ab', 'ASr0gk0jsdfPAsdfn']
#
#
A good introduction is at: http://www.regular-expressions.info/tutorial.html .