extract substring using regex in groovy

extract substring using regex in groovy - regex

If I have the following pattern in some text:
def articleContent = "<![CDATA[ Hellow World ]]>"
I would like to extract the "Hellow World" part, so I use the following code to match it:
def contentRegex = "<![CDATA[ /(.)*/ ]]>"
def contentMatcher = ( articleContent =~ contentRegex )
println contentMatcher[0]
However I keep getting a null pointer exception because the regex doesn't seem to be working, what would be the correct regex for "any peace of text", and how to collect it from a string?

Try:
def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[ 0 ][ 1 ]
However I worry that you are planning to parse xml with regular expressions. If this cdata is part of a larger valid xml document, better to use an xml parser

The code below shows the substring extraction using regex in groovy:
class StringHelper {
#NonCPS
static String stripSshPrefix(String gitUrl){
def match = (gitUrl =~ /ssh:\/\/(.+)/)
if (match.find()) {
return match.group(1)
}
return gitUrl
}
static void main(String... args) {
def gitUrl = "ssh://git#github.com:jiahut/boot.git"
def gitUrl2 = "git#github.com:jiahut/boot.git"
println(stripSshPrefix(gitUrl))
println(stripSshPrefix(gitUrl2))
}
}

A little bit late to the party but try using backslash when defining your pattern, example:
def articleContent = "real groovy"
def matches = (articleContent =~ /gr\w{4}/) //grabs 'gr' and its following 4 chars
def firstmatch = matches[0] //firstmatch would be 'groovy'
you were on the right track, it was just the pattern definition that needed to be altered.
References:
https://www.regular-expressions.info/groovy.html
http://mrhaki.blogspot.com/2009/09/groovy-goodness-matchers-for-regular.html

One more sinle-line solution additional to tim_yates's one
def result = articleContent.replaceAll(/<!\[CDATA\[(.+)]]>/,/$1/)
Please, take into account that in case of regexp doesn't match then result will be equal to the source. Unlikely in case of
def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[0][1]
it will raise an exception.

In my case, the actual string was multi-line like below
ID : AB-223
Product : Standard Profile
Start Date : 2020-11-19 00:00:00
Subscription : Annual
Volume : 11
Page URL : null
Commitment : 1200.00
Start Date : 2020-11-25 00:00:00
I wanted to extract the Start Date value from this string so here is how my script looks like
def matches = (originalData =~ /(?<=Actual Start Date :).*/)
def extractedData = matches[0]
This regex extracts the string content from each line which has a prefix matching Start Date :
In my case, the result is is 2020-11-25 00:00:00
Note : If your originalData is a multi-line string then in groovy you can include it as follows
def originalData =
"""
ID : AB-223
Product : Standard Profile
Start Date : 2020-11-19 00:00:00
Subscription : Annual
Volume : 11
Page URL : null
Commitment : 1200.00
Start Date : 2020-11-25 00:00:00
"""
This script looks simple but took me some good time to figure out few things so I'm posting this here.

Related

Fastest way to replace phrases from sentences with Python?

I have a list of 3800 names I want to remove from 750K sentences.
The names can contain multiple words such as "The White Stripes".
Some names might also be look like a subset of a larger name, ex: 'Ame' may be one name and 'Amelie' may be another.
This is what my current implementation looks like:
def find_whole_word(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
names_lowercase = ['the white stripes', 'the beatles', 'slayer', 'ame', 'amelie'] # 3800+ names
def strip_names(sentence: str):
token = sentence.lower()
has_name = False
matches = []
for name in names_lowercase:
match = find_whole_word(name)(token)
if match:
matches.append(match)
def get_match(match):
return match.group(1)
matched_strings = list(map(get_match, matches))
matched_strings.sort(key=len, reverse=True)
for matched_string in matched_strings:
# strip names at the start, end and when they occur in the middle of text (with whitespace around)
token = re.sub(rf"(?<!\S){matched_string}(?!\S)", "", token)
return token
sentences = [
"how now brown cow",
"die hard fan of slayer",
"the white stripes kill",
"besides slayer I believe the white stripes are the best",
"who let ame out",
"amelie has got to go"
] # 750K+ sentences
filtered_list = [strip_names(sentence) for sentence in sentences]
# Expected: filtered_list = ["how now brown cow", "die hard fan of ", " kill", "besides I believe are the best", "who let out", " has got to go"]
My current implementation takes several hours. I don't care about readability as this code won't be used for long.
Any suggestions on how I can increase the run time?

My previous solution was overkill.
All I really had to do was use the word boundary \b as described in the documentation.
Usage example: https://regex101.com/r/2CZ8el/1
import re
names_joined = "|".join(names_lowercase)
names_whole_words_filter_expression = re.compile(rf"\b({names_joined})\b", flags=re.IGNORECASE)
def strip_names(text: str):
return re.sub(names_whole_words_filter_expression, "", text).strip()
Now it takes a few minutes instead of a few hours 🙌

How to extract files with date pattern using python

I have n-files in a folder like
source_dir
abc_2017-07-01.tar
abc_2017-07-02.tar
abc_2017-07-03.tar
pqr_2017-07-02.tar
Lets consider for a single pattern now 'abc'
(but I get this pattern randomly from Database, so need double filtering,one for pattern and one for last day)
And I want to extract file of last day ie '2017-07-02'
Here I can get common files but not exact last_day files
Code
pattern = 'abc'
allfiles=os.listdir(source_dir)
m_files=[f for f in allfiles if str(f).startswith(pattern)]
print m_files
output:
[ 'abc_2017-07-01.tar' , 'abc_2017-07-02.tar' , 'abc_2017-07-03.tar' ]
This gives me all files related to abc pattern, but how can filter out only last day file of that pattern
Expected :
[ 'abc_2017-07-02.tar' ]
Thanks

just a minor tweak in your code can get you the desired result.
import os
from datetime import datetime, timedelta
allfiles=os.listdir(source_dir)
file_date = datetime.now() + timedelta(days=-1)
pattern = 'abc_' +str(file_date.date())
m_files=[f for f in allfiles if str(f).startswith(pattern)]
Hope this helps!

latest = max(m_files, key=lambda x: x[-14:-4])
will find the filename with latest date among filenames in m_files.

use python regex package like :
import re
import os
files = os.listdir(source_dir)
for file in files:
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
and then you can work with day in the loop to do what ever you want. Like create that list:
import re
import os
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
return day
files = os.listdir(source_dir)
days = [extract_day(file) for file in files]
if the month is also variable you can substitute '07' with '\d\d' or also '\d{2}'. Be carefull if you have files that dont match with the pattern at all, then match.group() will cause an error since match is of type none. Then use :
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
try:
day = match.group(1)
except :
day = None
return day

part of a string contained in another string regex python

Is there a way to check if any part of a string matches with another string in python?
For e.g.: I have URLs which look like this
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
and I have strings which look like:
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
string = '|'.join(string_list)
I would like to match string with url.
Anastasia Beverly Hills with www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA and
www.ulta.com/beautyservices/benefitbrowbar/ with Benefit Cosmetics.
I've been trying url['urls'].str.contains('('+string+')', case = False) but this does not match.
What;s the correct way to do this?

I can't do it as a regex in one line but here is my attempt using itertools and any:
import pandas as pd
from itertools import product
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
"""
For each of Cartesian product (the different combinations) of
string_list and urls.
"""
for x in list(product(string_list, url['urls'])):
"""
If any of the words in the string (x[0]) are present in
the URL (x[1]) disregarding case.
"""
if any (word.lower() in x[1].lower() for word in x[0].split()):
"""
Show the match.
"""
print ("Match String: %s URL: %s" % (x[0], x[1]))
Outputs:
Match String: Benefit Cosmetics URL: www.ulta.com/beautyservices/benefitbrowbar/
Match String: Anastasia Beverly Hills URL: www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA
Updated:
The way you were looking at it you could alternatively use:
import pandas as pd
import warnings
pd.set_option('display.width', 100)
"""
Supress the warning it will give on a match.
"""
warnings.filterwarnings("ignore", 'This pattern has match groups')
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
"""
Create a pandas DataFrame.
"""
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
"""
Using one string at a time.
"""
for string in string_list:
"""
Get the individual words in the string and concatenate them
using a pipe to create a regex pattern.
"""
s = "|".join(string.split())
"""
Update the DataFrame with True or False where the regex
matches the URL.
"""
url[string] = url['urls'].str.contains('('+s+')', case = False)
"""
Show the result
"""
print (url)
which would output:
urls Benefit Cosmetics Anastasia Beverly Hills
0 www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00... False True
1 www.ulta.com/beautyservices/benefitbrowbar/ True False
Which I guess, if you want it in a DataFrame, may be better but I prefer the first way.

Regular Expression Python3---why

I have written the following python code
import re
def get_items():
text = '''
Quantitative Finance
Statistics
General information
Support and Governance Model
Find
'''
pattern = re.compile(r'(.*?)', re.S)
items = re.match(pattern, text).group(1)
print(items)
get_items()
but it does't work,why?
the regular expression as follws:
pattern = re.compile(r'(.*?)', re.S)

Your regex is correct, but you are using the wrong calls to iterate over the machtes. See the corrected version below which uses pattern.finditer(text) and match.group(1).
import re
def get_items():
text = '''
Quantitative Finance
Statistics
General information
Support and Governance Model
Find
'''
pattern = re.compile(r'(.*?)', re.S)
for match in pattern.finditer(text):
yield match.group(1)
for item in get_items():
print(item)

How to remove unwanted items from a parse file

from googlefinance import getQuotes
import json
import time as t
import re
List = ["A","AA","AAB"]
Time=t.localtime() # Sets variable Time to retrieve date/time info
Date2= ('%d-%d-%d %dh:%dm:%dsec'%(Time[0],Time[1],Time[2],Time[3],Time[4],Time[5])) #formats time stamp
while True:
for i in List:
try: #allows elements to be called and if an error does the next step
Data = json.dumps(getQuotes(i.lower()),indent=1) #retrieves Data from google finance
regex = ('"LastTradePrice": "(.+?)",') #sets parse
pattern = re.compile(regex) #compiles parse
price = re.findall(pattern,Data) #retrieves parse
print(i)
print(price)
except: #sets Error coding
Error = (i + ' Failed to load on: ' + Date2)
print (Error)
It will display the quote as: ['(number)'].
I would like it to only display the number, which means removing the brackets and quotes.
Any help would be great.

Changing:
print(price)
into:
print(price[0])
prints this:
A
42.14
AA
10.13
AAB
0.110

Try to use type() function to know the datatype, in your case type(price)
it the data type is list use print(price[0])
you will get the output (number), for brecess you need to check google data and regex.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

extract substring using regex in groovy - regex

Try: def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[ 0 ][ 1 ] However I worry that you are planning to parse xml with regular expressions. If this cdata is part of a larger valid xml document, better to use an xml parser

Related

Fastest way to replace phrases from sentences with Python?

How to extract files with date pattern using python

part of a string contained in another string regex python

Regular Expression Python3---why

How to remove unwanted items from a parse file

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

extract substring using regex in groovy - regex

Try: def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[ 0 ]​[ 1 ] However I worry that you are planning to parse xml with regular expressions. If this cdata is part of a larger valid xml document, better to use an xml parser

Related

Fastest way to replace phrases from sentences with Python?

How to extract files with date pattern using python

part of a string contained in another string regex python

Regular Expression Python3---why

How to remove unwanted items from a parse file

Categories

Resources

Try: def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[ 0 ][ 1 ] However I worry that you are planning to parse xml with regular expressions. If this cdata is part of a larger valid xml document, better to use an xml parser