Regex trouble capturing everything before last letter or number - regex

I want to capture everything before the last letter or number. I do not want to match any white space, "-", or "#013" after the last letter or number.
This is the regex I currently have but it seems to be matching everything
(?<system_name>.*\w(?:[a-zA-Z]|[0-9]))
Current data:
469869-system
476657-SYSTEM
476657-system
681125-system#013
981765-system#013
687755-system#013
438105-system#013
281055-system#013
485548-SYSTEM
785455-system
489418-system
589568-system
489661-SYSTEM
486328-system - - #015
286728-system - - #015
SYSTEM-433455
system
What I want to match:
469869-system
476657-SYSTEM
476657-system
681125-system
981765-system
687755-system
438105-system
281055-system
485548-SYSTEM
785455-system
489418-system
589568-system
489661-SYSTEM
486328-system
286728-system
SYSTEM-433455
system

You can use:
^[\w-]+
where:
^ # beginning of line
[\w-]+ # character class, 1 or more word character or hyphen
Demo & explanation

Related

Regex conditional lookout

My input text file is like
A={5,6},B={2},C={3}
B={2,4}
A={5},B={1},C={3}
A={5},B={2},C={3,4,QWERT},D={TXT}
I would like to match all the lines where A=5,B=2 and C=3. The catch is, if variable is not mentioned, then that variable can take any value and hence that line also needs to be matched.
Above should match line 1,2 & 4.
I tried
.*?(?:(?=A)A\{.*?5).*?(?:(?=B)B\{.*?2).*?(?:(?=C)C\{.*?3)
https://regex101.com/r/NN9qk5/1
But, it is not working
I shall be using this regex in a python 3.6 code.
If you want to solve it with a regex, you may use
^
(?!.*\bA={(?![^{}]*\b5\b))
(?!.*\bB={(?![^{}]*\b2\b))
(?!.*\bC={(?![^{}]*\b3\b))
.*
See the regex demo
The point is to fail a match if there is a key that contains no given number value inside braces.
E.g. (?!.*\bA={(?![^{}]*\b5\b)) is a negative lookahead that fails the match if, immediately to the right of the current location, there is no
- .* - any 0+ chars other than line break chars
- \bA - a whole word A
- ={ - ={ substring
- (?![^{}]*\b5\b) - that is not followed with any 0+ chars other than { and } and then followed with 5 as a whole word.
Sample usage in Python 3.6:
import re
s = """A={5,6},B={2},C={3}
B={2,4}
A={5},B={1},C={3}
A={5},B={2},C={3,4,QWERT},D={TXT}"""
given = { 'A': '5', 'B': '2', 'C': '3'}
reg_pattern = ''
for key,val in given.items():
reg_pattern += r"(?!.*\b{}={{(?![^{{}}]*\b{}\b))".format(key,val)
reg = re.compile(reg_pattern)
for line in s.splitlines():
if reg.match(line):
print(line)
Output:
A={5,6},B={2},C={3}
B={2,4}
A={5},B={2},C={3,4,QWERT},D={TXT}
Note the use of re.match, this method only searches for a match at the start of the string, so, no need adding ^ anchor (that matches string start).

How can I find words in Notepad++?

I have lot of queries like this,
select categorych0_.category_id as category3_2_0_, categorych0_.id as
id1_2_0_, categorych0_.id as id1_2_1_, categorych0_.category_id as
category3_2_1_, categorych0_.check_id as check_id4_2_1_,
categorych0_.tenantid as tenantid2_2_1_, check1_.id as id1_5_2_,
check1_.check_group as check_gr2_5_2_,
check1_.check_group_description_label as check_gr3_5_2_,
check1_.check_group_label as check_gr4_5_2_, check1_.check_name_label
as check_na5_5_2_, check1_.check_number as check_nu6_5_2_,
check1_.check_scope as check_sc7_5_2_, check1_.display_order as
display_8_5_2_, check1_.tenantid as tenantid9_5_2_ from
category_checks categorych0_ left outer join checks check1_ on
categorych0_.check_id=check1_.id where categorych0_.category_id=?
I need to remove 'as' phrases that mean, all alies phrases need to remove.
Try this regex:
as[^,]*?(?=,|from)
Replace each match with a blank string
Click for Demo
Explanation:
as - matches as literally
[^,]*? - matches 0+ occurrences of any character that is not a , as few as possible
(?=,|from) - positive lookahead to validate that the above match must be followed by a , or the text from

Parsing out particular text in a big text column in a Dataframe - R

Suppose I have the following data,
data
text
abc/1234&
qwertyabc/5555&
a&sdfghabc/ppp&plksa&
z&xabc/lkjh&poiuw&
lkjqwefasrjabc/855698&plkjdhweb
For example if I want to parse out the text between abc/ and first occurrence of & alone, how do I parse out those text between these texts. I want the text between first occurence of abc/ and first occurrence of & after abc/ has occurred.
My output should be as follows,
data
text parsed_out
abc/1234& 1234
qwertyabc/5555& 5555
a&sdfghabc/ppp&plksa& ppp
z&xabc/lkjh&poiuw& lkjh
lkjqwefasrjabc/855698&plkjdhweb 855698
The following is my trying,
data1 = within(data, FOO<-data.frame(do.call('rbind', strsplit(as.character(text), 'abc/', fixed=TRUE))))
data2 = within(data1, FOO1<-data.frame(do.call('rbind', strsplit(as.character(FOO$X1), '&', fixed=TRUE))))
This is using too much of memory since the text file is of 8 million rows and also data2 would be having several columns because it has several '&'. Can anybody help me in parsing text between these two characters as only one column in a best efficient way so that it doesn't occupy too much of memory?
x = "thesearepresentinthestartingwhichisnotneededhttp://google.com/needstobeparsedout&reoccurencenotneeded&"
here, the function should check for http://google.com/ and parse out until first & is found. Here the output should be needstobeparsedout.
new_x = "\"http://www.google.com/search?q=erykah+badu+with+hiatus+kaiyote,+august+3&""
Why is it not working with this link?
Thanks
I actually wanted to parse out few parts of the URL and for example, I want to parse out, the text between "http:www.google.com/" and first occurrence of "&".
Use
sub(".*?https?://(?:www\\.)?google\\.com/([^&]+).*", "\\1", x)
See the regex demo.
The pattern matches:
(optionally add a ^ in front to match the start of string position)
.*? - 0+ chars as few as possible from the start till the first
https?:// - either https:// or http:// followed with
(?:www\\.)? - 1 or 0 (optional) sequence www.
google\\.com/ - literal text google.com
([^&]+) - 1 or more chars other than & (Capture group 1)
.* - any 0+ chars (up to the end of string).
In the replacment pattern, \1 refers to the subtext captured into Group 1.

Regex to grab formulas

I am trying to parse a file that contains parameter attributes. The attributes are setup like this:
w=(nf*40e-9)*ng
but also like this:
par_nf=(1) * (ng)
The issue is, all of these parameter definitions are on a single line in the source file, and they are separated by spaces. So you might have a situation like this:
pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0
The current algorithm just splits the line on spaces and then for each token, the name is extracted from the LHS of the = and the value from the RHS. My thought is if I can create a Regex match based on spaces within parameter declarations, I can then remove just those spaces before feeding the line to the splitter/parser. I am having a tough time coming up with the appropriate Regex, however. Is it possible to create a regex that matches only spaces within parameter declarations, but ignores the spaces between parameter declarations?
Try this RegEx:
(?<=^|\s) # Start of each formula (start of line OR [space])
(?:.*?) # Attribute Name
= # =
(?: # Formula
(?!\s\w+=) # DO NOT Match [space] Word Characters = (Attr. Name)
[^=] # Any Character except =
)* # Formula Characters repeated any number of times
When checking formula characters, it uses a negative lookahead to check for a Space, followed by Word Characters (Attribute Name) and an =. If this is found, it will stop the match. The fact that the negative lookahead checks for a space means that it will stop without a trailing space at the end of the formula.
Live Demo on Regex101
Thanks to #Andy for the tip:
In this case I'll probably just match on the parameter name and equals, but replace the preceding whitespace with some other "parse-able" character to split on, like so:
(\s*)\w+[a-zA-Z_]=
Now my first capturing group can be used to insert something like a colon, semicolon, or line-break.
You need to add Perl tag. :-( Maybe this will help:
I ended up using this in C#. The idea was to break it into name value pairs, using a negative lookahead specified as the key to stop a match and start a new one. If this helps
var data = #"pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0";
var pattern = #"
(?<Key>[a-zA-Z_\s\d]+) # Key is any alpha, digit and _
= # = is a hard anchor
(?<Value>[.*+\-\\\/()\w\s]+) # Value is any combinations of text with space(s)
(\s|$) # Soft anchor of either a \s or EOB
((?!\s[a-zA-Z_\d\s]+\=)|$) # Negative lookahead to stop matching if a space then key then equal found or EOB
";
Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
.OfType<Match>()
.Select(mt => new
{
LHS = mt.Groups["Key"].Value,
RHS = mt.Groups["Value"].Value
});
Results:

Regexp variable alphanumeric matching

I am trying to match the following sample:
ZU2A ZS6D-9 ZT0ER-7 ZR6PJH-12
It is a combination of letters and numbers (alphanumeric).
Here is an explanation:
It will always start with a capital (uppercase) Z
Followed always by only ONE(1) of R,S,T or U "[R|S|T|U]"
Followed always by only ONE(1) number "[0-9]"
Followed always by a minimum of ONE(1) and optionally a maximum of THREE(3) capital (uppercase) letters like this [A-Z]{1,3}
Optionally followed by "-" and a minimum of ONE(1) and a maximum of TWO(2) numbers
At the moment I have this:
Z[R|S|T|U][0-9][A-Z]{1,}(\-)?([0-9]{1,3})
But that does not seem to catch all the samples.
EDIT: Here is a sample of a complete string:
ZU0D>APT314,ZT1ER,WIDE1,ZS3PJ-2,ZR5STU-12*/V:/021414z2610.07S/02814.02Ek067/019/A=005475!w%<!
Any help would be appreciated.
Thank You
Danny
Your main problem is that the whole optional part should be surrounded by one set of parentheses marked with ? (=optional). All in all, you want
Z[RSTU][0-9][A-Z]{1,3}(?:-[0-9]{1,2})?
A couple of extra notes:
In a character group, you can simply list the characters. So for 2 you want either [RSTU] or (?:R|S|T|U).
A group in the form of (?:example) instead of (example) prevents the sub-expression from being returned as a match. It has no effect on which inputs are matched.
You don't need to escape - with a backslash outside of a character class.
Here's an example test case script in Python:
import re
s = r'Z[RSTU][0-9][A-Z]{1,3}(?:-[0-9]{1,2})?'
rex = re.compile(s)
for test in ('ZU2A', 'ZS6D-9', 'ZT0ER-7', 'ZR6PJH-12'):
assert rex.match(test), test
long_test = 'ZU0D>APT314,ZT1ER,WIDE1,ZS3PJ-2,ZR5STU-12*/V:/021414z2610.07S/02814.02Ek067/019/A=005475!w%<!'
found = rex.findall(long_test)
assert found == ['ZU0D', 'ZT1ER', 'ZS3PJ-2', 'ZR5STU-12'], found