Regular Expression Python3---why

Regular Expression Python3---why - regex

I have written the following python code
import re
def get_items():
text = '''
Quantitative Finance
Statistics
General information
Support and Governance Model
Find
'''
pattern = re.compile(r'(.*?)', re.S)
items = re.match(pattern, text).group(1)
print(items)
get_items()
but it does't work,why?
the regular expression as follws:
pattern = re.compile(r'(.*?)', re.S)

Your regex is correct, but you are using the wrong calls to iterate over the machtes. See the corrected version below which uses pattern.finditer(text) and match.group(1).
import re
def get_items():
text = '''
Quantitative Finance
Statistics
General information
Support and Governance Model
Find
'''
pattern = re.compile(r'(.*?)', re.S)
for match in pattern.finditer(text):
yield match.group(1)
for item in get_items():
print(item)

Related

PostgreSQL Full Text Search Accuracy

Is there any way to improve the accuracy of Full Text Search on Postgres? I'm using it with Django and a simple search for invest doesn't return results with the word investor. I assume this is because the stemming algorithm is returning invest* and investor as two different stems.
def get_queryset(self):
query_string = self.request.GET.get('q')
vector = SearchVector('description', weight='A') + SearchVector('location', weight='A') + SearchVector('name', weight='A')
query = SearchQuery(query_string)
return PeopleSnapshot.objects.annotate(rank=SearchRank(vector, query)).order_by('-rank')

For your particular example a "synonym dictionary" should help.
There are also more sophisticated "thesaurus dictionaries" and you can customise the actual stemming by changing the "ispell dictionary". Both mentioned on that same page.

I assume that you are using the english text search configuration.
investor is not reduced to invest by the stemming algorithm:
SELECT to_tsvector('english', 'investor');
to_tsvector
--------------
'investor':1
(1 row)
If you want a prefix match, you'll have to do it like this:
SELECT to_tsvector('english', 'investor')
## to_tsquery('english', 'invest:*');
?column?
----------
t
(1 row)

How to extract files with date pattern using python

I have n-files in a folder like
source_dir
abc_2017-07-01.tar
abc_2017-07-02.tar
abc_2017-07-03.tar
pqr_2017-07-02.tar
Lets consider for a single pattern now 'abc'
(but I get this pattern randomly from Database, so need double filtering,one for pattern and one for last day)
And I want to extract file of last day ie '2017-07-02'
Here I can get common files but not exact last_day files
Code
pattern = 'abc'
allfiles=os.listdir(source_dir)
m_files=[f for f in allfiles if str(f).startswith(pattern)]
print m_files
output:
[ 'abc_2017-07-01.tar' , 'abc_2017-07-02.tar' , 'abc_2017-07-03.tar' ]
This gives me all files related to abc pattern, but how can filter out only last day file of that pattern
Expected :
[ 'abc_2017-07-02.tar' ]
Thanks

just a minor tweak in your code can get you the desired result.
import os
from datetime import datetime, timedelta
allfiles=os.listdir(source_dir)
file_date = datetime.now() + timedelta(days=-1)
pattern = 'abc_' +str(file_date.date())
m_files=[f for f in allfiles if str(f).startswith(pattern)]
Hope this helps!

latest = max(m_files, key=lambda x: x[-14:-4])
will find the filename with latest date among filenames in m_files.

use python regex package like :
import re
import os
files = os.listdir(source_dir)
for file in files:
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
and then you can work with day in the loop to do what ever you want. Like create that list:
import re
import os
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
return day
files = os.listdir(source_dir)
days = [extract_day(file) for file in files]
if the month is also variable you can substitute '07' with '\d\d' or also '\d{2}'. Be carefull if you have files that dont match with the pattern at all, then match.group() will cause an error since match is of type none. Then use :
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
try:
day = match.group(1)
except :
day = None
return day

part of a string contained in another string regex python

Is there a way to check if any part of a string matches with another string in python?
For e.g.: I have URLs which look like this
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
and I have strings which look like:
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
string = '|'.join(string_list)
I would like to match string with url.
Anastasia Beverly Hills with www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA and
www.ulta.com/beautyservices/benefitbrowbar/ with Benefit Cosmetics.
I've been trying url['urls'].str.contains('('+string+')', case = False) but this does not match.
What;s the correct way to do this?

I can't do it as a regex in one line but here is my attempt using itertools and any:
import pandas as pd
from itertools import product
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
"""
For each of Cartesian product (the different combinations) of
string_list and urls.
"""
for x in list(product(string_list, url['urls'])):
"""
If any of the words in the string (x[0]) are present in
the URL (x[1]) disregarding case.
"""
if any (word.lower() in x[1].lower() for word in x[0].split()):
"""
Show the match.
"""
print ("Match String: %s URL: %s" % (x[0], x[1]))
Outputs:
Match String: Benefit Cosmetics URL: www.ulta.com/beautyservices/benefitbrowbar/
Match String: Anastasia Beverly Hills URL: www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA
Updated:
The way you were looking at it you could alternatively use:
import pandas as pd
import warnings
pd.set_option('display.width', 100)
"""
Supress the warning it will give on a match.
"""
warnings.filterwarnings("ignore", 'This pattern has match groups')
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
"""
Create a pandas DataFrame.
"""
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
"""
Using one string at a time.
"""
for string in string_list:
"""
Get the individual words in the string and concatenate them
using a pipe to create a regex pattern.
"""
s = "|".join(string.split())
"""
Update the DataFrame with True or False where the regex
matches the URL.
"""
url[string] = url['urls'].str.contains('('+s+')', case = False)
"""
Show the result
"""
print (url)
which would output:
urls Benefit Cosmetics Anastasia Beverly Hills
0 www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00... False True
1 www.ulta.com/beautyservices/benefitbrowbar/ True False
Which I guess, if you want it in a DataFrame, may be better but I prefer the first way.

How to remove unwanted items from a parse file

from googlefinance import getQuotes
import json
import time as t
import re
List = ["A","AA","AAB"]
Time=t.localtime() # Sets variable Time to retrieve date/time info
Date2= ('%d-%d-%d %dh:%dm:%dsec'%(Time[0],Time[1],Time[2],Time[3],Time[4],Time[5])) #formats time stamp
while True:
for i in List:
try: #allows elements to be called and if an error does the next step
Data = json.dumps(getQuotes(i.lower()),indent=1) #retrieves Data from google finance
regex = ('"LastTradePrice": "(.+?)",') #sets parse
pattern = re.compile(regex) #compiles parse
price = re.findall(pattern,Data) #retrieves parse
print(i)
print(price)
except: #sets Error coding
Error = (i + ' Failed to load on: ' + Date2)
print (Error)
It will display the quote as: ['(number)'].
I would like it to only display the number, which means removing the brackets and quotes.
Any help would be great.

Changing:
print(price)
into:
print(price[0])
prints this:
A
42.14
AA
10.13
AAB
0.110

Try to use type() function to know the datatype, in your case type(price)
it the data type is list use print(price[0])
you will get the output (number), for brecess you need to check google data and regex.

extract substring using regex in groovy

If I have the following pattern in some text:
def articleContent = "<![CDATA[ Hellow World ]]>"
I would like to extract the "Hellow World" part, so I use the following code to match it:
def contentRegex = "<![CDATA[ /(.)*/ ]]>"
def contentMatcher = ( articleContent =~ contentRegex )
println contentMatcher[0]
However I keep getting a null pointer exception because the regex doesn't seem to be working, what would be the correct regex for "any peace of text", and how to collect it from a string?

Try:
def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[ 0 ][ 1 ]
However I worry that you are planning to parse xml with regular expressions. If this cdata is part of a larger valid xml document, better to use an xml parser

The code below shows the substring extraction using regex in groovy:
class StringHelper {
#NonCPS
static String stripSshPrefix(String gitUrl){
def match = (gitUrl =~ /ssh:\/\/(.+)/)
if (match.find()) {
return match.group(1)
}
return gitUrl
}
static void main(String... args) {
def gitUrl = "ssh://git#github.com:jiahut/boot.git"
def gitUrl2 = "git#github.com:jiahut/boot.git"
println(stripSshPrefix(gitUrl))
println(stripSshPrefix(gitUrl2))
}
}

A little bit late to the party but try using backslash when defining your pattern, example:
def articleContent = "real groovy"
def matches = (articleContent =~ /gr\w{4}/) //grabs 'gr' and its following 4 chars
def firstmatch = matches[0] //firstmatch would be 'groovy'
you were on the right track, it was just the pattern definition that needed to be altered.
References:
https://www.regular-expressions.info/groovy.html
http://mrhaki.blogspot.com/2009/09/groovy-goodness-matchers-for-regular.html

One more sinle-line solution additional to tim_yates's one
def result = articleContent.replaceAll(/<!\[CDATA\[(.+)]]>/,/$1/)
Please, take into account that in case of regexp doesn't match then result will be equal to the source. Unlikely in case of
def result = (articleContent =~ /<!\[CDATA\[(.+)]]>/)[0][1]
it will raise an exception.

In my case, the actual string was multi-line like below
ID : AB-223
Product : Standard Profile
Start Date : 2020-11-19 00:00:00
Subscription : Annual
Volume : 11
Page URL : null
Commitment : 1200.00
Start Date : 2020-11-25 00:00:00
I wanted to extract the Start Date value from this string so here is how my script looks like
def matches = (originalData =~ /(?<=Actual Start Date :).*/)
def extractedData = matches[0]
This regex extracts the string content from each line which has a prefix matching Start Date :
In my case, the result is is 2020-11-25 00:00:00
Note : If your originalData is a multi-line string then in groovy you can include it as follows
def originalData =
"""
ID : AB-223
Product : Standard Profile
Start Date : 2020-11-19 00:00:00
Subscription : Annual
Volume : 11
Page URL : null
Commitment : 1200.00
Start Date : 2020-11-25 00:00:00
"""
This script looks simple but took me some good time to figure out few things so I'm posting this here.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expression Python3---why - regex

Related

PostgreSQL Full Text Search Accuracy

How to extract files with date pattern using python

part of a string contained in another string regex python

How to remove unwanted items from a parse file

extract substring using regex in groovy

Categories

Resources