Find Phone Numbers - regex

Looking to find phone numbers from multiple sites.
So each site more than likely has it in different sections/classes/formats etc.
I am having a hard time finding phone numbers using regex or classes containing.
So any help is appreciated
my code is
def parse1(self, response):
hxs = Selector(response)
titles = hxs.xpath('/html/body')
items = []
for titles in titles:
item = GenericCrawlerItem()
item["phone"] = re.findall('/^\s*(?:\+?(\d{1,3}))?([-. (]*(\d{3})[-. )]*)?((\d{3})[-. ]*(\d{2,4})(?:[-.x ]*(\d+))?)\s*$/gm', response.body)
item["phone"] = titles.xpath('//div[contains(text(), "tel")]/text()').extract()
items.append(item)
return items
Thanks!
edit: the formats i'm looking for will be mainly standard im suspecting such as:
(xxx)xxx-xxxx
xxx)xxx-xxxx
xxx.xxx.xxxx
xxx xxx xxxx
x(xxx)xxx-xxxx
x(xxx)xxx.xxxx
x.xxx.xxx.xxxx
+x(xxx)xxx-xxxx
+x.xxx.xxx.xxxx
Even if they aren't filling out every one of them. a couple would be super helpful!

The regex:
(\d\.?|\+\d\.?)?\(?\d{3}(\.| |-|\))\d{3}(\.| |-)\d{4}
...will match all of your examples.
If you would like clarification on any part, or if it doesn't work for you, leave a comment and we can try to figure it out. A common reason this might not work is because something isn't being escaped properly (I developed this regex using Sublime Text, not Python - Python may require that some additional things be escaped here and there), or your regex engine differs from mine. For example, not all regex engines support the \d metacharacter to match numbers 0-9, and not all engines support the use of {#} to denote a specific number of characters to match.

I found a good enough answer it results such as
xxx.xxx.xxxx
or
xxx-xxx-xxxx
def parse1(self, response):
hxs = Selector(response)
titles = hxs.xpath('/html/body')
items = []
for titles in titles:
item = GenericCrawlerItem()
item["email"] = re.findall('[\w\.-]+#[\w\.-]+', response.body)
item["website"] = response.url
item["links"] = titles.xpath('//a/#href').extract()
item["phone"] = re.findall(r'(\d{3}[-.()]\d{3}[-.]\d{4})', response.body) ##results such as xxx xxx-xxxx or xxx.xxxx
converter = html2text.HTML2Text()
converter.ignore_links = True
items.append(item)
return items
Stand alone:
item["phone"] = re.findall(r'(\d{3}[-.()]\d{3}[-.]\d{4})', response.body) ##results such as xxx xxx-xxxx or xxx.xxxx
shoutout to everyone who helped!

Related

Complicated QSortFilterProxyModel.setFilterRegex. Is it possible to match within a match?

I'm looking for help on a tricky QRegExp that I'd like to pass to my QSortFilterProxyModel.setFilterRegex. I've been struggling to find a solution that handles my use-case.
From the sample code below, I need to capture items with two underscores (_) but ONLY if they have george or brian. I do not want items that have more or less than two underscores.
string_list = [
'john','paul','george','ringo','brian','carl','al','mike',
'john_paul','paul_george','john_ringo','george_ringo',
'john_paul_george','john_paul_brian','john_paul_ringo',
'john_paul_carl','paul_mike_brian','john_george_brian',
'george_ringo_brian','paul_george_ringo','john_george_ringo',
'john_paul_george_ringo','john_paul_george_ringo_brian','john_paul_george_ringo_brian_carl',
]
view = QListView()
model = QStringListModel(string_list)
proxy_model = QSortFilterProxyModel()
proxy_model.setSourceModel(model)
view.setModel(proxy_model)
view.show()
The first part (matching two underscores) can be accomplished with the line (simplified here, but really each token can be composed of any alphanumeric character other than _, so [a-zA-Z0-9]*):
proxy_model.setFilterRegExp('^[a-z]*_[a-z]*_[a-z]*$')
The second part can be accomplished (independently with)
proxy_model.setFilterRegExp('george|brian')
To complicate matters, these additional criterial apply:
This list may grow to the realm of several thousand items,
The tokenization may reach up to 10 or so tokens
The tokenization can be in any order (so george could occur at the beginning, middle, end)
We may also want to also capture georgeH and brainW35 when they occur, so long as they begin with george or brian.
We may have N-Number of names we're searching for (i.e. george|brian|jim|al, but only when they're in strings with two underscores.
To simplify them:
Lines will never begin or end with "_", and should only ever begin/end with [a-zA-Z0-9]
Do the QRegExp and QSortFilterProxyModel even have the capabilities I'm looking for, or will I need to resort to some other approach?
For very complex conditions using regex is not very useful, in that case it is better to override the filterAcceptsRow method where you can implement the filter function as shown in the following trivial example:
class FilterProxyModel(QSortFilterProxyModel):
_words = None
_number_of_underscore = -1
def filterAcceptsRow(self, source_row, source_parent):
text = self.sourceModel().index(source_row, 0, source_parent).data()
if not self._words or self._number_of_underscore < 0:
return True
return (
any([word in text for word in self._words])
and text.count("_") == self._number_of_underscore
)
#property
def words(self):
return self._words
#words.setter
def words(self, words):
self._words = words
self.invalidateFilter()
#property
def number_of_underscore(self):
return self._number_of_underscore
#number_of_underscore.setter
def number_of_underscore(self, number):
self._number_of_underscore = number
self.invalidateFilter()
view = QListView()
model = QStringListModel(string_list)
proxy_model = FilterProxyModel()
proxy_model.setSourceModel(model)
view.setModel(proxy_model)
view.show()
proxy_model.number_of_underscore = 2
proxy_model.words = (
"george",
"brian",
)

Pandas: Grouping rows by list in CSV file?

In an effort to make our budgeting life a bit easier and help myself learn; I am creating a small program in python that takes data from our exported bank csv.
I will give you an example of what I want to do with this data. Say I want to group all of my fast food expenses together. There are many different names with different totals in the description column but I want to see it all tabulated as one "Fast Food " expense.
For instance the Csv is setup like this:
Date Description Debit Credit
1/20/20 POS PIN BLAH BLAH ### 1.75 NaN
I figured out how to group them with an or statement:
contains = df.loc[df['Description'].str.contains('food court|whataburger', flags = re.I, regex = True)]
I ultimately would like to have it read off of a list? I would like to group all my expenses into categories and check those category variable names so that it would only output from that list.
I tried something like:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
That obviously didn't work.
If there is a better way of doing this I am wide open to suggestions.
Also I have looked through quite a few posts here on stack and have yet to find the answer (although I am sure I overlooked it)
Any help would be greatly appreciated. I am still learning.
Thanks
You can assign a new column using str.extract and then groupby:
df = pd.DataFrame({"description":['Macdonald something', 'Whataburger something', 'pizza hut something',
'Whataburger something','Macdonald something','Macdonald otherthing',],
"debit":[1.75,2.0,3.5,4.5,1.5,2.0]})
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
df["found"] = df["description"].str.extract(f'({"|".join(fast_food)})',flags=re.I)
print (df.groupby("found").sum())
#
debit
found
Macdonald 5.25
Whataburger 6.50
pizza hut 3.50
Use dynamic pattern building:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
pattern = r"\b(?:{})\b".format("|".join(map(re.escape, fast_food)))
contains = df.loc[df['Description'].str.contains(pattern, flags = re.I, regex = True)]
The \b word boundaries find whole words, not partial words.
The re.escape will protect special characters and they will be parsed as literal characters.
If \b does not work for you, check other approaches at Match a whole word in a string using dynamic regex

regex year format authentication

I have a program where the user is asked for the session year which needs to be in the form of 20XX-20XX. The constraint here is that it needs to be a year followed by its next year. Eg. 2019-2020.
For example,
Vaild Formats:
2019-2020
2018-2019
2000-2001
Invalid Fromats:
2019-2021
2000-2000
2019-2018
I am trying to validate this input using regular expressions.
My work:
import re
def add_pages(matchObject):
return "{0:0=3d}".format(int(matchObject) + 1)
try:
a = input("Enter Session")
p = r'2([0-9]{3})-2'
p1= re.compile(p)
x=add_pages(p1.findall(a)[0])
p2 = r'2([0-9]{3})-2'+x
p3 = re.compile(p2)
l=p3.findall(a)
if not l:
raise Exception
else:
print("Authenticated")
except Exception as e:
print("Enter session. Eg. 2019-2020")
Question:
So far I have not been able to retrieve a single regex that will validate this input. I did have a look at backreferencing in regex but it only solved half my query. I am looking for ways to improve this authentication process. Is there any single regex statement that will check for this constraint? Let me know if you need any more information.
Do you really need to get the session year in one input?
I think its better to have two inputs (or just automatically set the session year to be the first year + 1).
I don't know if you're aiming for something bigger and this is just an example but using regex just doesn't seem appropriate for this task to me.
For example you could do this:
print("Enter session year")
first_year = int(input("First year: "))
second_year = int(input("Second year: "))
if second_year != (first_year + 1):
# some validation
else:
# program continues
First of all, why regex? Regex is terrible at math. It would be easier to do something like:
def check_years(string):
string = "2011-2012"
years = string.split("-")
return int(years[0]) == (int(years[1]) - 1)

Possible to combine two lines of code into one

Searching through a database looking for matches. Need to log the matches as well as though that don't match so I have the full database but those that match I specifically need to know the part that matches.
serv = ['6:00am-9:00pm', 'Unavailable', '7:00am-10:00pm', '8:00am-9:00pm', 'Closed']
if self.serv[datas] == 'Today':
clotime.append('')
elif self.serv[data] == 'Tomorrow':
clotime.append('')
elif self.serv[data] == 'Yesterday':
clotime.append('')
else:
clo = re.findall('-(.*?):', self.serv[data])
clotime.append(clo[0])
The bulk majority of the data ends up running through re.findall but some is still left for the initial if/elif checks.
Is there a way to condense this code down and do it all with re.findall, maybe even with just one line of code. I need the everything(entire database) gone through/logged so I can process through the database correctly when I go to display the data on a map.
Using anchors you can match a whole string
clo = re.search('^(?:To(?:day|morrow)|Yesterday)$|-(.*?):', self.serv[data])
if clo is not None:
clotime.append(clo.group(1))
With your example list:
serv = ['6:00am-9:00pm', 'Unavailable', '7:00am-10:00pm', '8:00am-9:00pm', 'Closed']
clotime = []
for data in serv:
clo = re.search('^(?:To(?:day|morrow)|Yesterday)$|-(.*?):', data)
if clo is not None:
clotime.append(clo.group(1))
print(clotime)
I would try something like this:
clo = re.findall('-(\d+):', self.serv[data])
clotime.append(clo[0] if clo else '')
If I understood your existing code it looks like you want to append an empty string in the cases where a closing hour couldn't be found in the string? This example extracts the closing hour but uses an empty string whenever the regex doesn't match anything.
Also if you're only matching digits it's better to be explicit about that.

re pulls data from one tag and not the other

I am trying to get a program to work that parses html like tags- it's for a TREC collection. I don't program often, except for databases and I am getting stuck on syntax. Here's my current code:
parseTREC ('LA010189.txt')
#Following Code-re P worked in Python
def parseTREC (atext):
atext=open(atext, "r")
filePath= "testLA.txt"
docID= []
docTXT=[]
p = re.compile ('<DOCNO>(.*?)</DOCNO>', re.IGNORECASE)
m= re.compile ('<P>(.*?)</P>', re.IGNORECASE)
for aline in atext:
values=str(aline)
if p.findall(values):
docID.append(p.findall(values))
if m.findall(values):
docID.append(p.findall(values))
print docID
atext.close()
the p re pulled the DOCNO as it was supposed. The m re though would not pull data and would print an empty list. I pretty sure that there are white spaces and also a new line. I tried the re.M and that did not help pull the data from the other lines. Ideally I would like to get to the point to where I store in a dictionary {DOCNO, Count}. Count would be determined by summing up every word that is in the P tags and also in a list []. I would appreciate any suggestions or advice.
You can try removing all the line breaks from the file if you think that is impacting your regex results. Also, make sure you don't have nested <P> tags because your regex may not match as expected. For example:
<p>
<p>
<p>here's some data</p>
And some more data.
</p>
And even more data.
</p>
will capture this section because of the "?":
<p>
<p>here's some data</p>
And some more data.
Also, is this a typo:
if p.findall(values):
docID.append(p.findall(values))
if m.findall(values):
docID.append(p.findall(values))
should that be:
docID.append(m.findall(values))
ont the last line?
Add the re.DOTALL flag like so:
m= re.compile ('<P>(.*?)</P>',
re.IGNORECASE | re.DOTALL)
You may want to add this to the other regex as well.
from xml.dom.minidom import *
import re
def parseTREC2 (atext):
fc = open(atext,'r').read()
fc = '<DOCS>\n' + fc + '\n</DOCS>'
dom = parseString(fc)
w_re = re.compile('[a-z]+',re.IGNORECASE)
doc_nodes = dom.getElementsByTagName('DOC')
for doc_node in doc_nodes:
docno = doc_node.getElementsByTagName('DOCNO')[0].firstChild.data
cnt = 1
for p_node in doc_node.getElementsByTagName('P'):
p = p_node.firstChild.data
words = w_re.findall(p)
print "\t".join([docno,str(cnt),p])
print words
cnt += 1
parseTREC2('LA010189.txt')
The code adds tags to the front of the document because there is no parent tag. The program then retrieves the information through the xml parser.