Regexp expression /(\#)(.*?)(\;)/ failing at quotations - regex

I have a helper method that goes through any given block of text and replaces substrings that are in the format '#something;' with a link. It works with all test cases I've tried, including
#user; #user name; #user.name; ##user.name; #user*name;
but gets hung up on quotations, as in
#I'll fight you;
but still matches up until that point? Below, for hacky debugging purposes, I have the helper method putting three asterisks ('*') on either side of the assumed match, so the above tag results in
***#I'***ll fight you;
I can't figure it out.
(And if anyone has any additional tips and tricks on how to get it to match a tag like '#username;;', where the end character is also a part of the name, lemme know. I figured that might be too complicated and better done programmatically.)
module PostsHelper
def tag_users(content)
# User tagging in format '#multiword name;'
# Regexp /(\#)(.*?)(\;)/ for debugging; user configurable eventually
start_character = '#'
end_character = ';'
tag_pattern = eval('/(#{start_character})(.*?)(#{end_character})/')
name_pattern = eval('/(?<=#{start_character})(.*?)(?=#{end_character})/')
# Iterate through all tags and replace with link
content.gsub(tag_pattern) do
tag = Regexp.last_match(0)
tagged_name = tag[name_pattern, 1]
tagged_user = User.where('lower(name) = ?', tagged_name.downcase).first
if tagged_user
"<a href='#{user_path(tagged_user.id)}'>##{tagged_name}</a>"
else
'***' + tag + '***'
end
end
end
end
Edit: I called a quotation mark a comma. I hate myself.

What about something like this?
/(?<=#).[^;]*/
it should match everything in between the # and the ; -- as tried now at http://www.rubular.com/.
I'd also caution against using the termination character within the username -- it will be difficult to differentiate #user;; from maybe mentioning #user; in a sentence that is followed by a semi-colon.

Related

How to remove/replace specials characters from a 'dynamic' regex/string on ruby?

So I had this code working for a few months already, lets say I have a table called Categories, which has a string column called name, so I receive a string and I want to know if any category was mentioned (a mention occur when the string contains the substring: #name_of_a_category), the approach I follow for this was something like below:
categories.select { |category_i| content_received.downcase.match(/##{category_i.downcase}/)}
That worked pretty well until today suddenly started to receive an exception unmatched close parenthesis, I realized that the categories names can contain special chars so I decided to not consider special chars or spaces anymore (don't want to add restrictions to the user and at the same time don't want to deal with those cases so the policy is just to ignore it).
So the question is there a clean way of removing these special chars (maintaining the #) and matching the string (don't want to modify the data just ignore it while looking for mentions)?
You can also use
prep_content_received = content_received.gsub(/[^\w\s]|_/,'')
p categories.select { |c|
prep_content_received.match?(/\b#{c.gsub(/[^\w\s]|_/, '').strip()}\b/i)
}
See the Ruby demo
Details:
The prep_content_received = content_received.gsub(/[^\w\s]|_/,'') creates a copy of content_received with no special chars and _. Using it once reduced overhead if there are a lot of categories
Then, you iterate over the categories list, and each time check if the prep_content_received matches \b (word boundary) + category with all special chars, _ and leading/trailing whitespace stripped from it + \b in a case insensitive way (see the /i flag, no need to .downcase).
So after looking around I found some answers on the platform but nothing with my specific requirements (maybe I missed something, if so please let me know), and this is how I fix it for my case:
content_received = 'pepe is watching a #comedy :)'
categories = ['comedy :)', 'terror']
temp_content = content_received.downcase
categories.select { |category_i| temp_content.gsub(/[^\sa-zA-Z0-9]/, '#' => '#').match?(/##{category_i.downcase.
gsub(/[^\sa-zA-Z0-9]/, '')}/) }
For the sake of the example, I reduced the categories to a simple array of strings, basically the first gsub, remove any character that is not a letter or a number (any special character) and replace each # with an #, the second gsub is a simpler version of the first one.
You can test the snippet above here

Advanced grouping in domain name regex with Python3

I have a program written in python3 that should parse several domain names every day and extrapolate data.
Parsed data should serve as input for a search function, for aggregation (statistics and charts) and to save some time to the analyst that uses the program.
Just so you know: I don't really have the time to study machine learning (which seems to be a pretty good solution here), so I chose to start with regex, that I already use.
I already searched the regex documentation inside and outside StackOverflow and worked on the debugger on regex101 and I still haven't found a way to do what I need.
Edit (24/6/2019): I mention machine learning because of the reason I need a complex parser, that is automate things as much as possible. It would be useful for making automatic choices like blacklisting, whitelisting, etc.
The parser should consider a few things:
a maximum number of 126 subdomains plus the TLD
each subdomain must not be longer than 64 characters
each subdomain can contain only alphanumeric characters and the - character
each subdomain must not begin or end with the - character
the TLD must not be longer than 64 characters
the TLD must not contain only digits
but I to go a little deeper:
the first string can (optionally) contain a "usage type" like cpanel., mail., webdisk., autodiscover. and so on... (or maybe a symple www.)
the TLD can (optionally) contain a particle like .co, .gov, .edu and so on (.co.uk for example)
the final part of the TLD is not really checked against any list of ccTLD/gTLDs right now and I don't think it will be in the future
What I thought useful to solve the problem is a regex group for the optional usage type, one for each subdomain and one for the TLD (the optional particle must be inside the TLD group)
With these rules in mind I came up with a solution:
^(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[\d]?|wiki|www[\d]?\.)?([a-z\d][a-z\d\-]{0,62}[a-z\d])?((\.[a-z\d][a-z\d\-]{0,62}[a-z\d]){0,124}?(?P<TLD>(\.co|\.com|\.edu|\.net|\.org|\.gov)?\.(?!\d+)[a-z\d]{1,64})$
The above solution doesn't return the expected results
I report here a couple of examples:
A couple of strings to parse
without.further.ado.lets.travel.the.forest.com
www.without.further.ado.lets.travel.the.forest.gov.it
The groups I expect to find
FullMatchwithout.further.ado.lets.travel.the.forest.com
group2without
group3further
group4ado
group5lets
group6travel
group7the
group8forest
groupTLD.com
FullMatchwww.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGEwww.
group2without
group3further
group4ado
group5lets
group6travel
group7the
group8forest
groupTLD.gov.it
The groups I find
FullMatchwithout.further.ado.lets.travel.the.forest.com
group2without
group3.further.ado.lets.travel.the.forest
group4.forest
groupTLD.com
FullMatchwww.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGEwww.
group2without
group3.further.ado.lets.travel.the.forest
group4.forest
groupTLD.gov.it
group6.gov
As you can see from the examples, a couple of particles are found twice and that is not the behavior i sought for, anyway. Any attempt to edit the formula results in unexpeted output.
Any idea about a way to find the expected results?
This a simple, well-defined task. There is no fuzzyness, no complexity, no guessing, just a series of easy tests to figure out everything on your checklist. I have no idea how "machine learning" would be appropriate, or helpful. Even regex is completely unnecessary.
I've not implemented everything you want to verify, but it's not hard to fill in the missing bits.
import string
double_tld = ['gov', 'edu', 'co', 'add_others_you_need']
# we'll use this instead of regex to check subdomain validity
valid_sd_characters = string.ascii_letters + string.digits + '-'
valid_trans = str.maketrans('', '', valid_sd_characters)
def is_invalid_sd(sd):
return sd.translate(valid_trans) != ''
def check_hostname(hostname):
subdomains = hostname.split('.')
# each subdomain can contain only alphanumeric characters and
# the - character
invalid_parts = list(filter(is_invalid_sd, subdomains))
# TODO react if there are any invalid parts
# "the TLD can (optionally) contain a particle like
# .co, .gov, .edu and so on (.co.uk for example)"
if subdomains[-2] in double_tld:
subdomains[-2] += '.' + subdomains[-1]
subdomains = subdomains[:-1]
# "a maximum number of 126 subdomains plus the TLD"
# TODO check list length of subdomains
# "each subdomain must not begin or end with the - character"
# "the TLD must not be longer than 64 characters"
# "the TLD must not contain only digits"
# TODO write loop, check first and last characters, length, isnumeric
# TODO return something
I don't know if it is possible to get the output exactly as you asked. I think that with a single pattern it cannot catch results in different groups(group2, group3,..).
I found one way to get almost the result you expect using regex module.
match = regex.search(r'^(?:(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[\d]?|wiki|www[\d]?)\.)?(?:([a-z\d][a-z\d\-]{0,62}[a-z\d])\.){0,124}?(?P<TLD>(?:co|com|edu|net|org|gov)?\.(?!\d+)[a-z\d]{1,64})$', 'www.without.further.ado.lets.travel.the.forest.gov.it')
Output:
match.captures(0)
['www.without.further.ado.lets.travel.the.forest.gov.it']
match.captures[1] or match.captures('USAGE')
['www.']
match.captures(2)
['without', 'further', 'ado', 'lets', 'travel', 'the', 'forest']
match.captures(3) or match.captures('TLD')
['gov.it']
Here, to avoid taking . in groups I have added it in non-capturing group like this
(?:([a-z\d][a-z\d\-]{0,62}[a-z\d])\.)
Hope it helps.

Using regex for dividing a substrings by comma based on repeating groups

This is a coding exercise. I'm supposed parse html to a string using python such that a string of html like the following:
"<div><p><b></b></p><p></p><p></p></div>"
Becomes:
"DIV([P([B([])]),P([]),P([])])"
Where each global tag that encloses other ones have to be returned separated by a comma.
I understand that regex is not the best choice for this kind of job. Nevertheless, I have a limited set of tools available of which regex is one of them.
So far, what I have is the following:
repl_from = ["<div>", "</div>", "<img />", "<p>", "</p>", "<b>", "</b>"]
for i in repl_from:
if i == "<div>":
j = "DIV(["
elif i == "<img />":
j = "IMG({})"
elif i == "<p>":
j = "P(["
elif i == "<b>":
j = "B(["
else: j = "])"
html = html.replace(i, j)
This gets me DIV([P([B([])])P([])P([])]). Now I have to divide the inner arguments by commas, and this where I thought about using regexes. But I'm lost in this regards.
I have a seudo code that goes something like this:
1) Find the opening of a global tag (patternI = '[A-Z]+\(\[)')
2) Check if what follows are repeating tags (patternII = '[A-Z]+\(\[\]\)+')
3)If so, get start and end index of patternII, and then just do a replace with the commas. This last part can be simply executed by splitting using the split() and later the join()function, I think.
How can I implement the last part of the algorithm?
EDIT
Ok, I think I made a mistake when explaining the situation. For any tag that encloses another set of tags (like <div><p></p><p></p></div>) the enclosed tags must be parsed as arguments to the global one (therefore: DIV([P([]), P([])])); if the global tag encloses only one, then there are no commas added (<div><p></p></div> will turn out to be DIV([P([])]). In the case when there's no enclosed tag (like so <p></p><b></b>) then when they are transformed, then they carry no commas in between (as such P([])B([])).
I am sure I am not understanding something here but if this is the case why not just use a simple:
a="DIV([P([B([])])P([])P([])])"
import re
print(re.sub(r"\)[A-Z]","),P",a))
This will give:
'DIV([P([B([])]),P([]),P([])])'
I must apologise for the fact that I don't know html at all so I can only try to match what you come up with as value of "a"(since I can't imagine all the possible cases that may exist) in regards to your recent comment:
lest say:
a="DIV([P([B([])])P([])B([])])P([])B([])"
This we can fix with a mix of re.findall and re.sub:
first we will find all that we want to replace in a:
b=re.findall(r"\)[A-Z]",a)
print(b)
this will give:
[')P', ')B', ')P', ')B']
after that we will need to insert a comma in belween each element as we will use this to substitute the original elements:
for i in range(len(b)):
b[i]=b[i][0]+","+b[i][1]
print(b)
this will give:
['),P', '),B', '),P', '),B']
then we will use this b to substitute :
for i in range(len(b)):
a=re.sub(r"\)[A-Z]",b[i],a,1)
print(a)
which will give:
DIV([P([B([])]),P([]),B([])]),P([]),B([])
so the entire above code will look like:
import re
a="DIV([P([B([])])P([])B([])])P([])B([])"
b=re.findall(r"\)[A-Z]",a)
for i in range(len(b)):
b[i]=b[i][0]+","+b[i][1]
for i in range(len(b)):
a=re.sub(r"\)[A-Z]",b[i],a,1)
print(a)
P.S.: please just share the possible values of a for which it won't work and the final result you expect from that. I will be able to match it for that.

Regular expression for matching git branch prefixes in specified order

I am trying to write a git hook that will not allow user to push his commit into the branch that do not follow our branches prefixes rules.
So, for matching the branch name I need to write a regular expression.
Here is the example list representing ordering of the branch prefixes:
ex - experimental
d - devops
b - back-end
f - front-end
So, for instance we have the following branch names that should match:
Branch name may have additional words after prefixes as below (e.g -git-hooks, -api).
Each branch name have to contain at least one prefix (e.g -ex).
<developer>-ex-d - developer's experimental branch with devops implementation of a git hooks
ex-b-f - experimental branches with back-end and front-end api implementation
Branches names that shouldn't match:
<developer>-d-ex-f-b - illegal prefixes order
exp-front-back - illegal prefixes
The most difficult part for me is too understand how to match prefixes in correct order without their repetition.
Thanks in advance for answering!
Either I haven't understood you correctly, or this is what you want
/^([^-]+)?(-ex)?(-d)?(-b)?(-f)?(?!((-ex)|(-d)|(-b)|(-f)))(-.*)?$/
See https://regexr.com/3l4t4 to see it working
This looks for your prefixes in sequence, then uses a negative lookahead to enforce no duplicates, then allows other name chunks subsequently.
Actually, there's a bug: this will disallow (eg) <developer>-ex-d-dev because it thinks there's a duplicate -d. I'll leave the answer here in case someone else can improve it.
I wasn't able to answer this with regex, but in python it can be done like this (I know it's not a pretty answer but it works):
import re
def check_prefixes(string_to_check):
lst_prefixes = ['-ex-', '-d-', '-b-', '-f-']
# find position of all prefixes (-1 indicates prefix was not present and so can be skipped)
lst_positions = [string_to_check.find(prefix) for prefix in lst_prefixes if string.find(prefix) != -1]
string_correct = True
# if the prefixes aren't in the right order the string isn't correct
if sorted(lst_positions) != lst_positions:
string_correct = False
else:
# find all the prefixes
# the string needs to be adjusted so that re.findall actually finds all the prefixes and doesn't skip the even numbered ones or the first/last
string_to_check= '-' + string_to_check.split('>-')[-1].replace('-', '--') + '-'
all_prefixes = re.findall('(-.*?-)', string_to_check)
# check if all the prefixes are legal
for prefix in all_prefixes:
if prefix not in lst_prefixes:
string_correct = False
break
return string_correct
to_check = ['<developer>-ex-d',
'ex-b-f',
'<developer>-d-ex-f-b',
'exp-front-back']
for string_to_check in to_check:
print '{} = {}'.format(string_to_check, check_prefixes(string_to_check))
Perhaps you can make it work in whatever coding language you're most comfortable in.
To do with a single regex it may be easier to match the unwanted branches :
without suffix ^[-]*$
other than ex,d,b,f : -(?!(ex|d|b|f)(-|$))[^-]*
not followed by one of expected suffixes
-ex-(?!(d|b|f)(-|$))|
-d-(?!(b|f)(-|$))|
-b-(?!f(-|$))
can be checked on regex101

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.