I have some text that looks like this:
UPPERCASE TEXT {wildcard amount of text} {Anchor word}
With this pattern repeating multiple times. I want to extract these multiple matches, which I can do with
[A-Z][A-Z ]+.+anchor
However I don't want it to match if there is UPPERCASE text within the wildcard text. I can check for this with a negative lookahead
[A-Z][A-Z ]+(?!.+[A-Z][A-Z ]+).+anchor
However the lookahead matches with all the other matches and cancels out. I can put limits on the size of the lookahead however sometimes the distance between uppercase words and the anchor is small and sometimes it is large, so I can't match everything.
In your positive lookahead you don't need the + operand. You want it to fail if there is two or more characters, this is equivalent to failing if there is two characters.
Your negative lookahead needs to be tested every char in the intermediate section.
https://regex101.com/r/tcxch6/1
Why not just match uppercase letters, then space, then any amount of not-uppercase characters, then space then the anchor?
/([A-Z ]+) ([^A-Z]+) (anchor)/
I guess the problem is that if the text is
UPPERCASE TEXT {wildcard OTHERTEXT etc} anchor
Then this will find OTHERTEXT as the first match. Maybe the answer is to fix it to the start of the line, like this
/^([A-Z ]+) ([^A-Z]+) (anchor)/
If that isn't right, I recommend giving some more examples of the input and the required matches, because the question isn't all that clear at the moment.
Related
I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.
As such, it should pick up:
ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.
I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.
However this does also pick up these wrong words:
A-bc, a-b-c
I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.
If a lookahead is supported and you don't want to match double -- you might use:
\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b
Explanation
\b A word boundary
(?= Positive lookahead, assert that from the current location to the right is
(?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
) Close the lookahead
[A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
(?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
\b A word boundary
See a regex101 demo.
To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.
\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)
See another regex demo.
I have a text document with a lot of large integers, e.g. 123456789. I want to automatically insert commas into these to make them more readable: 123,456,789. However, my document also contains decimals, and these should remain untouched. Is there a regular expressions that will insert these? An answer on a similar question suggested (?<=\d)(?=(\d\d\d)+(?!\d)), but this also detects decimal numbers. What's more, I am unable to insert the commas using either Notepad++ or Overleaf. What should I replace this regex with?
If you don't want to touch the decimals you could use (*SKIP)(*FAIL) to match a dot and 1+ digits to consume the characters that should not be part of the match.
(Tested on Notepad++ 7.7.1)
\.\d+(*SKIP)(*FAIL)|\B(?=(?:\d{3})+(?!\d))
In the replacement use a comma ,
In parts
\.\d+(*SKIP)(*FAIL) Match a dot literally and 1+ digits (match to be left untouched)
| Or
\B Anchor that matches where \b does not match
(?= Positive lookahead, assert what is directly on the right is
(?:\d{3})+ Repeat 1+ times matching 3 digits
(?!\d) Negative lookahead, assert what is directly on the right is not a digit
) Close lookahead
Regex demo
My guess is that maybe,
(?<=\d)(?=(?:\d{3})+(?!\d|\.))
or
(?!^)(?=(?:\d{3})+(?!\.|\d))
Demo 2
or
\d+\.\d*(*SKIP)(*FAIL)|(?!^)(?=(?:\d{3})+(?!\.|\d))
Demo 3
might be close to what you're trying to write, which you can simply replace it with a comma.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
I have got the following regex expression so far:
used-cars\/((?:\d+[a-z]|[a-z]+\d)[a-z\d]*)
This is sort of working, I need it to match basically ANYTHING apart from JUST numbers after used-cars/
Match:
used-cars/page-1
used-cars/1eeee
used-cars/page-1?*&_-
Not Match:
used-cars/2
used-cars/400
Can someone give me a hand? Been trying get this working for a while now!
There are few shortcomings of your regex used-cars\/((?:\d+[a-z]|[a-z]+\d)[a-z\d]*).
It's checking for used-cars/ followed by multiple digits then one character within a-z OR multiple characters within a-z then one digit.
[a-z\d]* is searching for either characters or digits which is also optional.
It's inaccurate for your pattern.
Try with following regex.
Regex: ^used-cars\/(?!\d+$)\S*$
Explanation:
used-cars\/ searches for literal used-cars/
(?!\d+$) is negative lookahead for only digits till end. If only digits are present then it won't be a match.
\S* matches zero or more characters other than whitespace.
Regex101 Demo
Using Regular Expression,
from any line of input that has at least one word repeated two or more times.
Here is how far i got.
/(\b\w+\b).*\1
but it is wrong because it only checks for single char, not one word.
input: i might be ill
output: < i might be i>ll
<> marks the matched part.
so, i try to do (\b\w+\b)(\b\w+\b)*\1
but it is not working totally.
Can someone give help?
Thanks.
this should work
(\b\w+\b).*\b\1\b
greedy algorithm will ensure longest match. If you want second instance to be a separate word you have to add the boundaries there as well. So it's the same as
\b(\w+)\b.*\b\1\b
Positive lookahead is not a must here:
/\b([A-Za-z]+)\b[\s\S]*\b\1\b/g
EXPLANATION
\b([A-Za-z]+)\b # match any word
[\s\S]* # match any character (newline included) zero or more times
\b\1\b # word repeated
REGEX 101 DEMO
To check for repeated words you can use positive lookahead like this.
Regex: (\b[A-Za-z]+\b)(?=.*\b\1\b)
Explanation:
(\b[A-Za-z]+\b) will capture any word.
(?=.*\b\1\b) will lookahead if the word captured by group is present or not. If yes then a match is found.
Note:- This will produce repeated results because the word which is matched once will again be matched when regex pointer captures it as a word.
You will have to use programming to strip off the repeated results.
Regex101 Demo
I need to extract the last number that is inside a string. I'm trying to do this with regex and negative lookaheads, but it's not working. This is the regex that I have:
\d+(?!\d+)
And these are some strings, just to give you an idea, and what the regex should match:
ARRAY[123] matches 123
ARRAY[123].ITEM[4] matches 4
B:1000 matches 1000
B:1000.10 matches 10
And so on. The regex matches the numbers, but all of them. I don't get why the negative lookahead is not working. Any one care to explain?
Your regex \d+(?!\d+) says
match any number if it is not immediately followed by a number.
which is incorrect. A number is last if it is not followed (following it anywhere, not just immediately) by any other number.
When translated to regex we have:
(\d+)(?!.*\d)
Rubular Link
I took it this way: you need to make sure the match is close enough to the end of the string; close enough in the sense that only non-digits may intervene. What I suggest is the following:
/(\d+)\D*\z/
\z at the end means that that is the end of the string.
\D* before that means that an arbitrary number of non-digits can intervene between the match and the end of the string.
(\d+) is the matching part. It is in parenthesis so that you can pick it up, as was pointed out by Cameron.
You can use
.*(?:\D|^)(\d+)
to get the last number; this is because the matcher will gobble up all the characters with .*, then backtrack to the first non-digit character or the start of the string, then match the final group of digits.
Your negative lookahead isn't working because on the string "1 3", for example, the 1 is matched by the \d+, then the space matches the negative lookahead (since it's not a sequence of one or more digits). The 3 is never even looked at.
Note that your example regex doesn't have any groups in it, so I'm not sure how you were extracting the number.
I still had issues with managing the capture groups
(for example, if using Inline Modifiers (?imsxXU)).
This worked for my purposes -
.(?:\D|^)\d(\D)