Different behavior between two regex patterns [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I am trying to match the letters 'C' or 'c' as they appear in a file.
They must be stand alone and NOT followed by a '+' or '.'.
The following two patterns give me the same result using Regex101, but I get a different result
in the Dataquest IDE and my home PC.
The two patterns are:
pattern = r'\b[Cc]\b(?!\+|\.)'
pattern = r"\b[Cc]\b[^.+]"
The problem line in question is: (Line 223 from the hacker_news.csv file)
MemSQL (YC W11) Raises $36M Series C
On my home PC and Dataquests IDE:
The regex using the negative lookahead matches that line.
The other regex does not.
On Regex101 they both match that line.
I am NOT supposed to match it.
I wrote the lookahead regex, which fails in Dataquests IDE.
The non-lookahead version is their answer, which passes.
I think they should both yield the same result, but they do not.
I am running Python 3.7.6
What am I missing?

(?!\+|\.) is negative lookahead. It doesn't include any additional characters in the match; it simply adds a requirement to the character that precedes it that says it can't be followed by . or +. In your input string, the C at the end is not followed by one of these characters, so the match succeeds.
[^.+] matches a single character that is not a . or a +. There are no characters after the C so the match fails.

Related

Find and replace a Regex pattern occurring more than once [duplicate]

This question already has answers here:
How can I match overlapping strings with regex?
(6 answers)
Matching when an arbitrary pattern appears multiple times
(1 answer)
Closed 2 years ago.
I'm trying to find-and-replace instances where consecutive commas appear throughout a string; replacing them w/ something like ",N/A,". I was using a very simple /,,/g pattern, and that works on things like ",,abc" and ",,,,abc" (with even numbers of commas). However, it doesn't catch things like ",,,abc". That's because the first two commas are considered a match, and then the third comma is just considered part of a new ",abc" string. Is there a way to handle this w/ a RegEx pattern or options? Otherwise, I'm going to need to perform multiple searches.
FWIW - I'm working in JavaScript, but I'm guessing this is just a general RegEx question/answer.
The reason why /,,/g only matches once with three commas is because the global match restarts after the position of the final consumed characters. You need a way to match the pattern of ,, without consuming those characters for pattern matching purposes.
If your language supports it, use a positive lookahead. A positive lookeahead lets a regex match some additional characters, but not consume them in the pattern.
/,(?=,)/g
In English, this means:
, # match a comma, then
(?= #start a group that must exist, and if so, isn't consumed by the pattern,
, # a comma
)
See more about this here: https://www.regular-expressions.info/lookaround.html
Javascript supports positive lookahead. :)

Regex: Match words that only contain certain letters [duplicate]

This question already has answers here:
regex to match entire words containing only certain characters
(4 answers)
Closed 3 years ago.
I am using a Regex dictionary located here, and want to find words that contain ONLY the following letters: B, C, D, E, H, I, K, O. So, for example: cod, hoe, and hob.
I thought the simple way of doing this would be with the following regex query: [bcdehiko]+, but this yields many words that contain at least one instance of the bracketed letters, and any other letter.
For that website, the easiest solution is to combine your started regex with line start and line end matches. This will ensure that the word contains nothing but the characters you want. Here is the regex you want to use to get your results:
^[bcdehiko]+$
If you're okay with - in words, you can use this as well:
^[bcdehiko]+(-[bcdehiko]+)*$
Credit to #ctwheels for the improvement on the second regex.
Since you haven't specified a language (and I think that others looking for such answers might find this useful), here is an answer to your question in python without the use of regex.
l = 'bcdehiko'
d = ['cod', 'codz']
for w in d:
print(all(x in l for x in w))
This method loops over the dictionary* d and ensures all characters in that word exists in the string l. See it working here.
* dictionary in the OP's original question refers to a dictionary in the wordbook sense, not in the computing sense.In the script, the variable d is a list.
Alternatively, if you want to ensure that a word contains at least one character from a list of characters, you can replace any with all in the above script (you can test by adding the word ran to the list d - which doesn't contain a single letter in the list d). See it working here.
You are using this regex:
[bcdehiko]+
Which means match one or more instances of given characters in square brackets.
However this regex will also allow matching other characters in a word since there is no word boundary in use.
You may want to wrap your regex with \b on either side to ensure there are no other characters allowed:
\b[bcdehiko]+\b

Regex to match exact number [duplicate]

This question already has answers here:
Regex match exact number not if it exist in string
(2 answers)
Closed 2 years ago.
I have a lot of LOC of a project in visual studio and I want to search for every line which uses the numbers 12 and 13. It can't be part of a bigger number, I need to retrieve only the code that actually uses the constants 12 and 13. I think it is possible to do with regex but I'm having a hard time here.
Any help will be very appreciated.
Brief
You want to use the Find and Replace window found at Edit -> Find and Replace -> Find in Files with the regex \b1[23]\b and the Find Options Use Regular Expressions checkbox selected.
Code
\b Word boundary assertion
Matches, without consuming any characters, immediately between a character matched by \w and a character not matched by \w (in either order). It cannot be used to separate non-words from words.
1 Match this literally
[23] Match a character in the set (2 or 3)
\b Word boundary assertion
(?<![0-9])1[23](?![0-9])
Will match
12
13
abc12hbd
but not
3456324123656
234564567546
121212
13121312
1
3
123
If your 12 or 13 might appear in a hexadecimal string you can exclude that with
(?<![0-9a-fA-F])1[23](?![0-9a-fA-F])
You need to decide what characters are allowed to be on either side of the 12 or 13 and then exclude the others. See https://regex101.com/ for more help
This might be a solution:
^\D*(?<p>12|13)\D*
the group with the name p would hold the 12 or 13.
But you better try to use an online regex tester such as https://regex101.com/ or any other that shows up on google.

Understanding Regex expression [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I have a file where the application is configured to check the following Regex
[\x00-\x1F\x7F&&[^\x0A]&&[^\x0D]]
Can anyone please tell me the meaning of this regex expression exactly what it means. I do know that this regex expression ignored line feed and character feed. I even validated my file on http://regexr.com/ with the above specified regex expression and it shows no match found so not understanding why the regex is getting matched in the application.
FYI: I do not want the regex to match file as it is stopping my processing.
It could be that in Java and Ruby the regex expression && refers to character class intersection, while http://regexr.com/ doesn't support that expression and is trying to match literal & symbols. The regex you posted means match any characters from \x00 to \x1f or \x7f as long as it's not \x0A or \x0D.

Regex Meaning (Regex Golf) [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 7 years ago.
Regex newbie here, so I was trying this website for fun: https://regex.alf.nu
In particular, I'm concerned about the "Ranges" section here: https://regex.alf.nu/2
I was able to get as far as ^[a-f]+, and couldn't figure out the rest. By accident, I added a $ to get ^[a-f]+$ which was actually the answer.
Trying to wrap my mind around the meaning of this regex. Can someone give the plain English explanation of what's happening here?
It seems to say "a string that starts and ends with one or more of the letters a through f," but that doesn't quite make sense for me, for instance, with the word "cajac" which seems to satisfy those conditions.
For those who can't see the URL, it's asking me to match these words:
abac
accede
adead
babe
bead
bebed
bedad
bedded
bedead
bedeaf
caba
caffa
dace
dade
daff
dead
deed
deface
faded
faff
feed
But NOT match these:
beam
buoy
canjac
chymia
corah
cupula
griece
hafter
idic
lucy
martyr
matron
messrs
mucose
relose
sonly
tegua
threap
towned
widish
yite
In English it means: Match any words which contain only the letters a thru f.
Your pattern, when broken down:
^ assert position at start of the string
[a-f]+ match a single character present in the list below:
+ Between one and unlimited times, as many times as possible, giving back as needed
a-f a single character in the range between a and f (case sensitive)
$ assert position at end of the string
You can also see a quick explanation of your patterns on the Regex101 webpage.