I am looking to automate ingesting some data but the format of the time I've scraped for each row is different to what I need.
The data I scrape comes like this
9a.m.–5p.m.
Open 24 hours
It should look like the below:
09:00AM,05:00AM
{empty}
So far I have the below that only works for the second attribute, I'm struggling with the time reformatting and doing it in the same regex.
[str_replace(array("Open 24 Hours"),array(""),{import_element[1]})]
I would appreciate any help.
Thanks
Assuming your data has just hours (not minutes in the input string), this should work:
import re
def process_match(m):
n = m[:-4]
AmPm = m[-4:].replace(".", "").upper()
if len(n) == 1:
res = f"0{n}:00{AmPm}"
else:
res = f"{n}:00{AmPm}"
return res
inpt = "9a.m.–5p.m."
matches = re.findall(r"\d[ap].m.", inpt)
print(f"{process_match(matches[0])},{process_match(matches[1])}")
Related
Here's the code:
# Scrape table data
alltable = driver.find_elements_by_id("song-table")
date = date.today()
simple_year_list = []
complex_year_list = []
dateformat1 = re.compile(r"\d\d\d\d")
dateformat2 = re.compile(r"\d\d\d\d-\d\d-\d\d")
for term in alltable:
simple_year = dateformat1.findall(term.text)
for year in simple_year:
if 1800 < int(year) < date.year: # Year can't be above what the current year is or below 1800,
simple_year_list.append(simple_year) # Might have to be changed if you have a song from before 1800
else:
continue
complex_year = dateformat2.findall(term.text)
complex_year_list.append(complex_year)
The code uses regular expressions to find four consecutive digits. Since there are multiple 4 digit numbers, I want to narrow it down to between 1800 and 2021 since that's a reasonable time frame. simple_year_list, however, prints out numbers that don't follow the conditions.
You aren't saving the right value here:
simple_year_list.append(simple_year)
You should be saving the year:
simple_year_list.append(year)
I would need more information to help further though. Maybe give us a sample of the data you're working through, and the output you're seeing?
You can do it all in regex.
Add start ^ and end $ anchors, and range restriction via pattern:
dateformat1 = re.compile(r"^(1[89]\d\d|20([01]\d|2[01]))$")
I have been searching for an answer to this, but can not seem to get what I need. I would like a python script that reads my text file and starting from the top working its way through each line of the file and then prints out all the matches in another txt file. Content of the text file is just 4 digit numbers like 1234.
example
1234
3214
4567
8963
1532
1234
...and so on.
I would like the output to be something like:
1234 : matches found = 2
I know that there are matches in the file do to almost 10000 lines. I appreciate any help. If someone could just point me in the right direction here would be great. Thank you.
import re
file = open("filename", 'r')
fileContent=file.read()
pattern="1234"
print len(re.findall(pattern,fileContent))
If I were you I would open the file and use the split method to create a list with all the numbers in and use the Counter method from collections to count how many of each number in the list are dupilcates.
`
from collections import Counter
filepath = 'original_file'
new_filepath = 'new_file'
file = open(filepath,'r')
text = file.read()
file.close()
numbers_list = text.split('\n')
numbers_set = set(numbers_list)
dupes = [[item,':matches found =',str(count)] for item,count in Counter(numbers_list).items() if count > 1]
dupes = [' '.join(i) for i in dupes]
new_file = open(new_filepath,'w')
for i in dupes:
new_file.write(i)
new_file.close()
`
Thanks to everyone who helped me on this. Thank you to #csabinho for the code he provided and to #IanAuld for asking me "Why do you think you need recursion here?" – IanAuld. It got me to thinking that the solution was a simple one. I just wanted to know which 4 digit numbers had duplicates and how many, and also which 4 digit combos were unique. So this is what I came up with and it worked beautifully!
import re
a=999
while a <9999:
a = a+1
file = open("4digits.txt", 'r')
fileContent = file.read()
pattern = str(a)
result = len(re.findall(pattern, fileContent))
if result >= 1:
print(a,"matches",result)
else:
print (a,"This number is unique!")
I am trying to change all date values in a spreadsheet's Date column where the year is earlier than 1900, to today's date, so I have a slice.
EDIT: previous lines of code:
df=pd.read_excel(filename)#,usecols=['NAME','DATE','EMAIL']
#regex to remove weird characters
df['DATE'] = df['DATE'].str.replace(r'[^a-zA-Z0-9\._/-]', '')
df['DATE'] = pd.to_datetime(df['DATE'])
sample row in dataframe: name, date, email
[u'Public, Jane Q.\xa0' u'01/01/2016\xa0' u'jqpublic#email.com\xa0']
This line of code works.
df["DATE"][df["DATE"].dt.year < 1900] = dt.datetime.today()
Then, all date values are formatted:
df["DATE"] = df["DATE"].map(lambda x: x.strftime("%m/%d/%y"))
But I get an error:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-
versus-copy
I have read the documentation and other posts, where using .loc is suggested
The following is the recommended solution:
df.loc[row_indexer,col_indexer] = value
but df["DATE"].loc[df["DATE"].dt.year < 1900] = dt.datetime.today() gives me the same error, except that the line number is actually the line number after the last line in the script.
I just don't understand what the documentation is trying to tell me as it relates to my example.
I started messing around with pulling out the slice and assigning to a separate dataframe, but then I'm going to have to bring them together again.
You are producing a view when you df["DATE"] and subsequently use a selector [df["DATE"].dt.year < 1900] and try to assign to it.
df["DATE"][df["DATE"].dt.year < 1900] is the view that pandas is complaining about.
Fix it with loc like this:
df.loc[df.DATE.dt.year < 1900, "DATE"] = pd.datetime.today()
My thought would be that you could do
df.loc[df.DATE.dt.year < 1900, "DATE"] = dt.datetime.today()
df.loc[:, "DATE"] = df.DATE.map(lambda x: x.strftime("%m/%d/%y")
Not at a computer so I can't test but I think that should do it.
from googlefinance import getQuotes
import json
import time as t
import re
List = ["A","AA","AAB"]
Time=t.localtime() # Sets variable Time to retrieve date/time info
Date2= ('%d-%d-%d %dh:%dm:%dsec'%(Time[0],Time[1],Time[2],Time[3],Time[4],Time[5])) #formats time stamp
while True:
for i in List:
try: #allows elements to be called and if an error does the next step
Data = json.dumps(getQuotes(i.lower()),indent=1) #retrieves Data from google finance
regex = ('"LastTradePrice": "(.+?)",') #sets parse
pattern = re.compile(regex) #compiles parse
price = re.findall(pattern,Data) #retrieves parse
print(i)
print(price)
except: #sets Error coding
Error = (i + ' Failed to load on: ' + Date2)
print (Error)
It will display the quote as: ['(number)'].
I would like it to only display the number, which means removing the brackets and quotes.
Any help would be great.
Changing:
print(price)
into:
print(price[0])
prints this:
A
42.14
AA
10.13
AAB
0.110
Try to use type() function to know the datatype, in your case type(price)
it the data type is list use print(price[0])
you will get the output (number), for brecess you need to check google data and regex.
I am using aggregate and Sum to determine the number of hours I have worked in each month. I have it working, but the "hours" variable always has extra content in it!
I should add, I am a newbie at django and I got most of this code from here (Django beginner: How to query in django ORM to calculate fields based on dates).
My code:
hours = ""
work_data = ""
month_data = ""
for month in range(1,13):
entries_per_month = Mydata.objects.filter(myTimePeriod__month=month).filter(myResource="James")
hours = str(entries_per_month.aggregate(value=Sum('myHoursLogged')))
month_data = month_data + "'" + str(month) + "',"
work_data = work_data + hours + ","
I look at the results of work_data:
work_data
This gives me a result of {'value': Decimal('136.80')},{'value': Decimal('146.40')},
I need it in the format: 136.80, 146.40 (This is the format required by the charting library). I have tried using str() to convert it but it doesnt seem to work in this case.
str is not useful here, because your data is a list of dictionaries. You just need to process that and get the results:
hours = ','.join(str(v['value']) for v in entries_per_month)