extract substring using regex in groovy - regex

If I have the following pattern in some text:
def articleContent = "<![CDATA[ Hellow World ]]>"
I would like to extract the "Hellow World" part, so I use the following code to match it:
def contentRegex = "<![CDATA[ /(.)*/ ]]>"
def contentMatcher = ( articleContent =~ contentRegex )
println contentMatcher[0]
However I keep getting a null pointer exception because the regex doesn't seem to be working, what would be the correct regex for "any peace of text", and how to collect it from a string?

Try:
def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[ 0 ]​[ 1 ]
However I worry that you are planning to parse xml with regular expressions. If this cdata is part of a larger valid xml document, better to use an xml parser

The code below shows the substring extraction using regex in groovy:
class StringHelper {
#NonCPS
static String stripSshPrefix(String gitUrl){
def match = (gitUrl =~ /ssh:\/\/(.+)/)
if (match.find()) {
return match.group(1)
}
return gitUrl
}
static void main(String... args) {
def gitUrl = "ssh://git#github.com:jiahut/boot.git"
def gitUrl2 = "git#github.com:jiahut/boot.git"
println(stripSshPrefix(gitUrl))
println(stripSshPrefix(gitUrl2))
}
}

A little bit late to the party but try using backslash when defining your pattern, example:
def articleContent = "real groovy"
def matches = (articleContent =~ /gr\w{4}/) //grabs 'gr' and its following 4 chars
def firstmatch = matches[0] //firstmatch would be 'groovy'
you were on the right track, it was just the pattern definition that needed to be altered.
References:
https://www.regular-expressions.info/groovy.html
http://mrhaki.blogspot.com/2009/09/groovy-goodness-matchers-for-regular.html

One more sinle-line solution additional to tim_yates's one
def result = articleContent.replaceAll(/<!\[CDATA\[(.+)]]>/,/$1/)
Please, take into account that in case of regexp doesn't match then result will be equal to the source. Unlikely in case of
def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[0]​[1]
it will raise an exception.

In my case, the actual string was multi-line like below
ID : AB-223
Product : Standard Profile
Start Date : 2020-11-19 00:00:00
Subscription : Annual
Volume : 11
Page URL : null
Commitment : 1200.00
Start Date : 2020-11-25 00:00:00
I wanted to extract the Start Date value from this string so here is how my script looks like
def matches = (originalData =~ /(?<=Actual Start Date :).*/)
def extractedData = matches[0]
This regex extracts the string content from each line which has a prefix matching Start Date :
In my case, the result is is 2020-11-25 00:00:00
Note : If your originalData is a multi-line string then in groovy you can include it as follows
def originalData =
"""
ID : AB-223
Product : Standard Profile
Start Date : 2020-11-19 00:00:00
Subscription : Annual
Volume : 11
Page URL : null
Commitment : 1200.00
Start Date : 2020-11-25 00:00:00
"""
This script looks simple but took me some good time to figure out few things so I'm posting this here.

Related

Fastest way to replace phrases from sentences with Python?

I have a list of 3800 names I want to remove from 750K sentences.
The names can contain multiple words such as "The White Stripes".
Some names might also be look like a subset of a larger name, ex: 'Ame' may be one name and 'Amelie' may be another.
This is what my current implementation looks like:
def find_whole_word(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
names_lowercase = ['the white stripes', 'the beatles', 'slayer', 'ame', 'amelie'] # 3800+ names
def strip_names(sentence: str):
token = sentence.lower()
has_name = False
matches = []
for name in names_lowercase:
match = find_whole_word(name)(token)
if match:
matches.append(match)
def get_match(match):
return match.group(1)
matched_strings = list(map(get_match, matches))
matched_strings.sort(key=len, reverse=True)
for matched_string in matched_strings:
# strip names at the start, end and when they occur in the middle of text (with whitespace around)
token = re.sub(rf"(?<!\S){matched_string}(?!\S)", "", token)
return token
sentences = [
"how now brown cow",
"die hard fan of slayer",
"the white stripes kill",
"besides slayer I believe the white stripes are the best",
"who let ame out",
"amelie has got to go"
] # 750K+ sentences
filtered_list = [strip_names(sentence) for sentence in sentences]
# Expected: filtered_list = ["how now brown cow", "die hard fan of ", " kill", "besides I believe are the best", "who let out", " has got to go"]
My current implementation takes several hours. I don't care about readability as this code won't be used for long.
Any suggestions on how I can increase the run time?
My previous solution was overkill.
All I really had to do was use the word boundary \b as described in the documentation.
Usage example: https://regex101.com/r/2CZ8el/1
import re
names_joined = "|".join(names_lowercase)
names_whole_words_filter_expression = re.compile(rf"\b({names_joined})\b", flags=re.IGNORECASE)
def strip_names(text: str):
return re.sub(names_whole_words_filter_expression, "", text).strip()
Now it takes a few minutes instead of a few hours 🙌

How to extract files with date pattern using python

I have n-files in a folder like
source_dir
abc_2017-07-01.tar
abc_2017-07-02.tar
abc_2017-07-03.tar
pqr_2017-07-02.tar
Lets consider for a single pattern now 'abc'
(but I get this pattern randomly from Database, so need double filtering,one for pattern and one for last day)
And I want to extract file of last day ie '2017-07-02'
Here I can get common files but not exact last_day files
Code
pattern = 'abc'
allfiles=os.listdir(source_dir)
m_files=[f for f in allfiles if str(f).startswith(pattern)]
print m_files
output:
[ 'abc_2017-07-01.tar' , 'abc_2017-07-02.tar' , 'abc_2017-07-03.tar' ]
This gives me all files related to abc pattern, but how can filter out only last day file of that pattern
Expected :
[ 'abc_2017-07-02.tar' ]
Thanks
just a minor tweak in your code can get you the desired result.
import os
from datetime import datetime, timedelta
allfiles=os.listdir(source_dir)
file_date = datetime.now() + timedelta(days=-1)
pattern = 'abc_' +str(file_date.date())
m_files=[f for f in allfiles if str(f).startswith(pattern)]
Hope this helps!
latest = max(m_files, key=lambda x: x[-14:-4])
will find the filename with latest date among filenames in m_files.
use python regex package like :
import re
import os
files = os.listdir(source_dir)
for file in files:
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
and then you can work with day in the loop to do what ever you want. Like create that list:
import re
import os
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
return day
files = os.listdir(source_dir)
days = [extract_day(file) for file in files]
if the month is also variable you can substitute '07' with '\d\d' or also '\d{2}'. Be carefull if you have files that dont match with the pattern at all, then match.group() will cause an error since match is of type none. Then use :
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
try:
day = match.group(1)
except :
day = None
return day

part of a string contained in another string regex python

Is there a way to check if any part of a string matches with another string in python?
For e.g.: I have URLs which look like this
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
and I have strings which look like:
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
string = '|'.join(string_list)
I would like to match string with url.
Anastasia Beverly Hills with www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA and
www.ulta.com/beautyservices/benefitbrowbar/ with Benefit Cosmetics.
I've been trying url['urls'].str.contains('('+string+')', case = False) but this does not match.
What;s the correct way to do this?
I can't do it as a regex in one line but here is my attempt using itertools and any:
import pandas as pd
from itertools import product
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
"""
For each of Cartesian product (the different combinations) of
string_list and urls.
"""
for x in list(product(string_list, url['urls'])):
"""
If any of the words in the string (x[0]) are present in
the URL (x[1]) disregarding case.
"""
if any (word.lower() in x[1].lower() for word in x[0].split()):
"""
Show the match.
"""
print ("Match String: %s URL: %s" % (x[0], x[1]))
Outputs:
Match String: Benefit Cosmetics URL: www.ulta.com/beautyservices/benefitbrowbar/
Match String: Anastasia Beverly Hills URL: www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA
Updated:
The way you were looking at it you could alternatively use:
import pandas as pd
import warnings
pd.set_option('display.width', 100)
"""
Supress the warning it will give on a match.
"""
warnings.filterwarnings("ignore", 'This pattern has match groups')
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
"""
Create a pandas DataFrame.
"""
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
"""
Using one string at a time.
"""
for string in string_list:
"""
Get the individual words in the string and concatenate them
using a pipe to create a regex pattern.
"""
s = "|".join(string.split())
"""
Update the DataFrame with True or False where the regex
matches the URL.
"""
url[string] = url['urls'].str.contains('('+s+')', case = False)
"""
Show the result
"""
print (url)
which would output:
urls Benefit Cosmetics Anastasia Beverly Hills
0 www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00... False True
1 www.ulta.com/beautyservices/benefitbrowbar/ True False
Which I guess, if you want it in a DataFrame, may be better but I prefer the first way.

Regular Expression Python3---why

I have written the following python code
import re
def get_items():
text = '''
Quantitative Finance
Statistics
General information
Support and Governance Model
Find
'''
pattern = re.compile(r'(.*?)', re.S)
items = re.match(pattern, text).group(1)
print(items)
get_items()
but it does't work,why?
the regular expression as follws:
pattern = re.compile(r'(.*?)', re.S)
Your regex is correct, but you are using the wrong calls to iterate over the machtes. See the corrected version below which uses pattern.finditer(text) and match.group(1).
import re
def get_items():
text = '''
Quantitative Finance
Statistics
General information
Support and Governance Model
Find
'''
pattern = re.compile(r'(.*?)', re.S)
for match in pattern.finditer(text):
yield match.group(1)
for item in get_items():
print(item)

How to remove unwanted items from a parse file

from googlefinance import getQuotes
import json
import time as t
import re
List = ["A","AA","AAB"]
Time=t.localtime() # Sets variable Time to retrieve date/time info
Date2= ('%d-%d-%d %dh:%dm:%dsec'%(Time[0],Time[1],Time[2],Time[3],Time[4],Time[5])) #formats time stamp
while True:
for i in List:
try: #allows elements to be called and if an error does the next step
Data = json.dumps(getQuotes(i.lower()),indent=1) #retrieves Data from google finance
regex = ('"LastTradePrice": "(.+?)",') #sets parse
pattern = re.compile(regex) #compiles parse
price = re.findall(pattern,Data) #retrieves parse
print(i)
print(price)
except: #sets Error coding
Error = (i + ' Failed to load on: ' + Date2)
print (Error)
It will display the quote as: ['(number)'].
I would like it to only display the number, which means removing the brackets and quotes.
Any help would be great.
Changing:
print(price)
into:
print(price[0])
prints this:
A
42.14
AA
10.13
AAB
0.110
Try to use type() function to know the datatype, in your case type(price)
it the data type is list use print(price[0])
you will get the output (number), for brecess you need to check google data and regex.