python - Wrong regex used? - regex

Here is my func:
#register.filter
def load_human_key(key):
"""
load util based on typ for key
return: More readable key
"""
regex = re.findall('[A-Z][^A-Z]*', key)
if regex:
joined_regex = " ".join(regex)
return joined_regex
return key
When I use load_human_key("JsonKey"). It works fine and returns Json Key, but when I use load_human_key("JsonKEY") it returns "Json K E Y"), which is not the behaviour i'd like to implement. Can sb help my function, so that load_human_key("JsonKEY") = load_human_key("JsonKey")? I am completly new to regex.
Thanks!

A regex only cannot change characters from upper case to lower case, so you'll need to map each match to take care of that with Python code.
Not your question, but the naming used in your code is confusing: regex is not the regex, but the list of matches you get from executing one.
Here is how you could do it:
def load_human_key(key):
return re.sub('[A-Z]+[^A-Z]*', lambda m: ' ' + m[0].capitalize(), key).lstrip()

Related

How do we fetch multiple occurrences of a regex in a single string in Groovy?

I have a string that I need to fetch the ID field from -
{"jobs":[{"id":"6369c112a2ee5ca08adaa1d01b7e5c74","status":"RUNNING"},{"id":"bbfd87f15334c8e27b40bc46896e95c7","status":"RUNNING"},{"id":"90c5a32e8300da7d43ce351f7f72f0d2","status":"RUNNING"}]}
What I would need all the matched IDs stored in an array.
I tried with the following regex, but couldn't fetch the string -
/"id"\ *:\ *"(.*?)"/
/"id"\ *:\ *"(?<id>.*?)"/
I'm not sure if it matches and I'm not sure how to fetch the matched data.
Try this:
def str = '{"jobs":[{"id":"6369c112a2ee5ca08adaa1d01b7e5c74","status":"RUNNING"},{"id":"bbfd87f15334c8e27b40bc46896e95c7","status":"RUNNING"},{"id":"90c5a32e8300da7d43ce351f7f72f0d2","status":"RUNNING"}]}'
def pattern = /(?<="id":")\w+(?=")/
def matcher = str =~ /$pattern/
assert matcher.collect() == ["6369c112a2ee5ca08adaa1d01b7e5c74", "bbfd87f15334c8e27b40bc46896e95c7", "90c5a32e8300da7d43ce351f7f72f0d2"]
It's surely more appropriate to process your input with a JSON parser. It is JSON:
def s = '''{"jobs":
[{"id":"6369c112a2ee5ca08adaa1d01b7e5c74","status":"RUNNING"},
{"id":"bbfd87f15334c8e27b40bc46896e95c7","status":"RUNNING"},
{"id":"90c5a32e8300da7d43ce351f7f72f0d2","status":"RUNNING"}]}'''
def ids = new groovy.json.JsonSlurper().parse(s.bytes).jobs.collect{it.id}
And that sets ids to [6369c112a2ee5ca08adaa1d01b7e5c74, bbfd87f15334c8e27b40bc46896e95c7, 90c5a32e8300da7d43ce351f7f72f0d2]

regex year format authentication

I have a program where the user is asked for the session year which needs to be in the form of 20XX-20XX. The constraint here is that it needs to be a year followed by its next year. Eg. 2019-2020.
For example,
Vaild Formats:
2019-2020
2018-2019
2000-2001
Invalid Fromats:
2019-2021
2000-2000
2019-2018
I am trying to validate this input using regular expressions.
My work:
import re
def add_pages(matchObject):
return "{0:0=3d}".format(int(matchObject) + 1)
try:
a = input("Enter Session")
p = r'2([0-9]{3})-2'
p1= re.compile(p)
x=add_pages(p1.findall(a)[0])
p2 = r'2([0-9]{3})-2'+x
p3 = re.compile(p2)
l=p3.findall(a)
if not l:
raise Exception
else:
print("Authenticated")
except Exception as e:
print("Enter session. Eg. 2019-2020")
Question:
So far I have not been able to retrieve a single regex that will validate this input. I did have a look at backreferencing in regex but it only solved half my query. I am looking for ways to improve this authentication process. Is there any single regex statement that will check for this constraint? Let me know if you need any more information.
Do you really need to get the session year in one input?
I think its better to have two inputs (or just automatically set the session year to be the first year + 1).
I don't know if you're aiming for something bigger and this is just an example but using regex just doesn't seem appropriate for this task to me.
For example you could do this:
print("Enter session year")
first_year = int(input("First year: "))
second_year = int(input("Second year: "))
if second_year != (first_year + 1):
# some validation
else:
# program continues
First of all, why regex? Regex is terrible at math. It would be easier to do something like:
def check_years(string):
string = "2011-2012"
years = string.split("-")
return int(years[0]) == (int(years[1]) - 1)

Possible to combine two lines of code into one

Searching through a database looking for matches. Need to log the matches as well as though that don't match so I have the full database but those that match I specifically need to know the part that matches.
serv = ['6:00am-9:00pm', 'Unavailable', '7:00am-10:00pm', '8:00am-9:00pm', 'Closed']
if self.serv[datas] == 'Today':
clotime.append('')
elif self.serv[data] == 'Tomorrow':
clotime.append('')
elif self.serv[data] == 'Yesterday':
clotime.append('')
else:
clo = re.findall('-(.*?):', self.serv[data])
clotime.append(clo[0])
The bulk majority of the data ends up running through re.findall but some is still left for the initial if/elif checks.
Is there a way to condense this code down and do it all with re.findall, maybe even with just one line of code. I need the everything(entire database) gone through/logged so I can process through the database correctly when I go to display the data on a map.
Using anchors you can match a whole string
clo = re.search('^(?:To(?:day|morrow)|Yesterday)$|-(.*?):', self.serv[data])
if clo is not None:
clotime.append(clo.group(1))
With your example list:
serv = ['6:00am-9:00pm', 'Unavailable', '7:00am-10:00pm', '8:00am-9:00pm', 'Closed']
clotime = []
for data in serv:
clo = re.search('^(?:To(?:day|morrow)|Yesterday)$|-(.*?):', data)
if clo is not None:
clotime.append(clo.group(1))
print(clotime)
I would try something like this:
clo = re.findall('-(\d+):', self.serv[data])
clotime.append(clo[0] if clo else '')
If I understood your existing code it looks like you want to append an empty string in the cases where a closing hour couldn't be found in the string? This example extracts the closing hour but uses an empty string whenever the regex doesn't match anything.
Also if you're only matching digits it's better to be explicit about that.

regex to return all values not just first found one

I'm learning Pig Latin and am using regular expressions. Not sure if the regex is language agnostic or not but here is what I'm trying to do.
If I have a table with two fields: tweet id and tweet, I'd like to go through each tweet and pull out all mentions up to 3.
So if a tweet goes something like "#tim bla #sam #joe something bla bla" then the line item for that tweet will have tweet id, tim, sam, joe.
The raw data has twitter ids not the actual handles so this regex seems to return a mention (.*)#user_(\\S{8})([:| ])(.*)
Here is what I have tried:
a = load 'data.txt' AS (id:chararray, tweet:chararray);
b = foreach a generate id, LOWER(tweet) as tweet;
// filter data so only tweets with mentions
c = FILTER b BY tweet MATCHES '(.*)#user_(\\S{8})([:| ])(.*)';
// try to pull out the mentions.
d = foreach c generate id,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1}',3) as mention1,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1,2}',3) as mention2,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){2,3}',3) as mention3;
e = limit d 20;
dump e;
So in that try I was playing with quantifiers, trying to return the first, second and 3rd instance of a match in a tweet {1}, {1,2}, {2,3}.
That did not work, mention 1-3 are just empty.
So I tried changing d:
d = foreach c generate id,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)',2) as mention1,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',5) as mention2,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',8) as mention3,
But, instead of returning each user mentioned, this returned the same mention 3 times. I had expected that by cutting n pasting the expression again I'd get the second match, and pasting it a 3rd time would get the 3rd match.
I'm not sure how well I've managed to word this question but to put it another way, imagine that the function regex_extract() returned an array of matched terms. I would like to get mention[0], mention[1], mention[2] on a single line item.
Whenever you use PATTERN_EXTRACT or PATTERN_EXTRACT_ALL udf, keep in mind that it is just pure regex handled by Java.
It is easier to test the regex through a local Java test. Here is the regex I found to be acceptable :
Pattern p = Pattern.compile("#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?");
String input = "So if a tweet goes something like #tim bla #sam #joe #bill something bla bla";
Matcher m = p.matcher(input);
if(m.find()){
for(int i=0; i<=m.groupCount(); i++){
System.out.println(i + " -> " + m.group(i));
}
}
With this regex, if there is at least a mention, it will returns three fields, the seconds and/or third being null if a second/third mention is not found.
Therefore, you may use the following PIG code :
d = foreach c generate id, REGEX_EXTRACT_ALL(
tweet, '#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?');
You do not even need to filter the data first.

How to highlight searched for words using a regular expression

Hi
I am working on a groovy application that requires me to highlight(add spans) to the word that is searched for.For instance given the text below :
youtube
[href="youtube.com] i am here , in Youtube[/a]
I want to search for the word "youtube" and when it returned the above text should look like this :
[span]youtube[span]
[href="youtube.com] i am here , in [span]Youtube[/span] [/a]
The youtube word that is contained in the href or in the iframe must be ignored.
At the moment I have the following code :
def m = test =~ /([^<]*)?(youtube)/
println m[0]
def highLightText = { attrs, body ->
def postBody = attrs.text
def m = postBody =~ /(?i:${attrs.searchTerm})/
def array = []
m.each{
array << it as String
}
array.unique()
String result = postBody
array.each{
result = result.replaceAll("${it}", "<span class='highlight'>${it}</span>")
}
out << result
}
And it returns :
[span]youtube[span]
[href="[span]youtube[span].com] i am here , in [span]Youtube[/span] [/a]
Can anyone help me with a regular expression that can select only words that are not contained in links or other tags.
Thanks
A maintainable solution is unlikely to be achievable using regular expressions - the problem is too complex.
Parse your HTML into a DOM and consider only text nodes as being suitable for potential highlighting. Text nodes will, by definition, be only those pieces of content that are rendered and will not be element names, attributes/attribute values and so on.
The complexity of your problem is then reduced down to: how to do I find and highlight a string within another string?