For Loop and If Statement not performing as expected - regex

Here's the code:
# Scrape table data
alltable = driver.find_elements_by_id("song-table")
date = date.today()
simple_year_list = []
complex_year_list = []
dateformat1 = re.compile(r"\d\d\d\d")
dateformat2 = re.compile(r"\d\d\d\d-\d\d-\d\d")
for term in alltable:
simple_year = dateformat1.findall(term.text)
for year in simple_year:
if 1800 < int(year) < date.year: # Year can't be above what the current year is or below 1800,
simple_year_list.append(simple_year) # Might have to be changed if you have a song from before 1800
else:
continue
complex_year = dateformat2.findall(term.text)
complex_year_list.append(complex_year)
The code uses regular expressions to find four consecutive digits. Since there are multiple 4 digit numbers, I want to narrow it down to between 1800 and 2021 since that's a reasonable time frame. simple_year_list, however, prints out numbers that don't follow the conditions.

You aren't saving the right value here:
simple_year_list.append(simple_year)
You should be saving the year:
simple_year_list.append(year)
I would need more information to help further though. Maybe give us a sample of the data you're working through, and the output you're seeing?

You can do it all in regex.
Add start ^ and end $ anchors, and range restriction via pattern:
dateformat1 = re.compile(r"^(1[89]\d\d|20([01]\d|2[01]))$")

Related

Pandas: SettingWithCopyWarning, trying to understand how to write the code better, not just whether to ignore the warning

I am trying to change all date values in a spreadsheet's Date column where the year is earlier than 1900, to today's date, so I have a slice.
EDIT: previous lines of code:
df=pd.read_excel(filename)#,usecols=['NAME','DATE','EMAIL']
#regex to remove weird characters
df['DATE'] = df['DATE'].str.replace(r'[^a-zA-Z0-9\._/-]', '')
df['DATE'] = pd.to_datetime(df['DATE'])
sample row in dataframe: name, date, email
[u'Public, Jane Q.\xa0' u'01/01/2016\xa0' u'jqpublic#email.com\xa0']
This line of code works.
df["DATE"][df["DATE"].dt.year < 1900] = dt.datetime.today()
Then, all date values are formatted:
df["DATE"] = df["DATE"].map(lambda x: x.strftime("%m/%d/%y"))
But I get an error:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-
versus-copy
I have read the documentation and other posts, where using .loc is suggested
The following is the recommended solution:
df.loc[row_indexer,col_indexer] = value
but df["DATE"].loc[df["DATE"].dt.year < 1900] = dt.datetime.today() gives me the same error, except that the line number is actually the line number after the last line in the script.
I just don't understand what the documentation is trying to tell me as it relates to my example.
I started messing around with pulling out the slice and assigning to a separate dataframe, but then I'm going to have to bring them together again.
You are producing a view when you df["DATE"] and subsequently use a selector [df["DATE"].dt.year < 1900] and try to assign to it.
df["DATE"][df["DATE"].dt.year < 1900] is the view that pandas is complaining about.
Fix it with loc like this:
df.loc[df.DATE.dt.year < 1900, "DATE"] = pd.datetime.today()
My thought would be that you could do
df.loc[df.DATE.dt.year < 1900, "DATE"] = dt.datetime.today()
df.loc[:, "DATE"] = df.DATE.map(lambda x: x.strftime("%m/%d/%y")
Not at a computer so I can't test but I think that should do it.

Time Series manipulation

So I have a dataframe that I dump a time series into. The index is the date. I need to do calculations based on date.
For eg. I have {
XRT_Close
Date
2010-01-04 35.94
2010-01-05 36.17
2010-01-06 36.50
...
2015-02-07 36.60
2015-02-08 36.52 }
How would I go about doing say... Percentage change of beginning to end of the month? How would I construct a loop to cycle through the months?
Any help will be met with huge appreciation. Thank you.
First create year and month columns:
df['year'] = [x.year for x in df.index]
df['month'] = [x.month for x in df.index]
Group by them:
grouped = df.groupby(['year','month'])
Define the function you want to run on the groups:
def PChange(df):
begin = df['column_name'].iloc[0]
end = df['column_name'].iloc[-1]
return (end-begin)/(end+begin)*100
Apply the function to the groups:
grouped.apply(PChange)
Let me know if it works.

Python sorting timestamp

I am struggling with something that should be relatively straight forward, but I am getting nowhere.
I have a bunch of data that has a timestamp in the format of hh:mm:ss. The data ranges from 00:00:00 all 24 hours of the day through 23:59:59.
I do not know how to go about pulling out the hh part of the data, so that I can just look at data between specific hours of the day.
I read the data in from a CSV file using:
with open(filename) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
time = row['Time']
This gives me time in the hh:mm:ss format, but now I do not know how to do what I want, which is look at the data from 6am until 6pm. 06:00:00 to 18:00:00.
With the times in 24 hour format, this is actually very simple:
'06:00:00' <= row['Time'] <= '18:00:00'
Assuming that you only have valid timestamps, this is true for all times between 6 AM and 6 PM inclusive.
If you want to get a list of all rows that meet this, you can put this into a list comprehension:
relevant_rows = [row for row in reader if '06:00:00' <= row['Time'] <= '18:00:00']
Update:
For handling times with no leading zero (0:00:00, 3:00:00, 15:00:00, etc), use split to get just the part before the first colon:
> row_time = '0:00:00'
> row_time.split(':')
['0', '00', '00']
> int(row_time.split(':')[0])
0
You can then check if the value is at least 6 and less than 18. If you want to include entries that are at 6 PM, then you have to check the minutes and seconds to make sure it is not after 6 PM.
However, you don't even really need to try anything like regex or even a simple split. You have two cases to deal with - either the hour is one digit, or it is two digits. If it is one digit, it needs to be at least six. If it is two digits, it needs to be less than 18. In code:
if row_time[1] == ':': # 1-digit hour
if row_time > '6': # 6 AM or later
# This is an entry you want
else:
if row_time < '18:00:00': # Use <= if you want 6 PM to be included
# This is an entry you want
or, compacted to a single line:
if (row_time[1] == ':' and row_time > '6') or row_time < '18:00:00':
# Parenthesis are not actually needed, but help make it clearer
as a list comprehension:
relevant_rows = [row for row in reader if (row['Time'][1] == ':' and row['Time'] > '6') or row['Time'] < '18:00:00']
You can use Python's slicing syntax to pull characters from the string.
For example:
time = '06:05:22'
timestamp_hour = time[0:2] #catch all chars from index 0 to index 2
print timestamp_hour
>>> '06'
should produce the first two digits: '06'. Then you can call the int() method to cast them as ints:
hour = int(timestamp_hour)
print hour
>>> 6
Now you have an interger variable that can be checked to see if is between, say, 6 and 18.

How do you remove seconds and milliseconds from a date time string in python

How I can convert a date in format "2013-03-15 05:14:51.327" to "2013-03-15 05:14", i.e. removing the seconds and milliseconds. I don't think there is way in Robot frame work. Please let me know if any one have a solution for this in python.
Try this (Thanks Blender!)
>>> date = "2013-03-15 05:14:51.327"
>>> newdate = date.rpartition(':')[0]
>>> print newdate
2013-03-15 05:14
In Robotframework the most straightforward way would be to user Split String From Right from the String library library:
${datestring}= Set Variable 2019-03-15 05:14:51.327
${parts}= Split String From Right ${datestring} : max_split=1
# parts is a list of two elements - everything before the last ":", and everything after it
# take the 1st element, it is what we're after
${no seconds}= Get From List ${parts} 0
Log ${no senods} # 2019-03-15 05:14

Regular expression for numeric range

Looking for a regular expression to cover a number range. More specifically, consider a numeric format:
NN-NN
where N is a number. So examples are:
04-11
07-12
06-06
I want to be able to specify a range. For example, anything between:
01-27 and 02-03
When I say range, it is as if the - is not there. So the range:
the range 01-27 to 02-03
would cover:
01-28, 01-29, 01-30, 01-31 and 02-01
I want the regular expression so that I can plug in values for the range very easily. Any ideas?
Validating dates is not where regexes strengths are.
for example, how would you validate February regarding leap years.
The solution is to use the available date API's in your language
'0[12]-[0-3][1-9]' would match all of the required dates, however, it would also match dates like 01-03. If you want to match exactly and only the dates in that range, you'll need to do something a little more advanced.
Here's an easily configurable example in Python:
from calendar import monthrange
import re
startdate = (1,27)
enddate = (2,3)
d = startdate
dateList = []
while d != enddate:
(month, day) = d
dateList += ['%02i-%02i' % (month, day)]
daysInMonth = monthrange(2011,month)[1] # took a random non-leap year
# but you might want to take the current year
day += 1
if day > daysInMonth:
day = 1
month+=1
if month > 12:
month = 1
d = (month,day)
dateRegex = '|'.join(dateList)
testDates = ['01-28', '01-29', '01-30', '01-31', '02-01',
'04-11', '07-12', '06-06']
isMatch = [re.match(dateRegex,x)!=None for x in testDates]
for i, testDate in enumerate(testDates):
print testDate, isMatch[i]
dateRegex looks like this:
'01-27|01-28|01-29|01-30|01-31|02-01|02-02'
And the output is:
01-28 True
01-29 True
01-30 True
01-31 True
02-01 True
04-11 False
07-12 False
06-06 False
It's not completely clear for me, and you didn't mention language as well, but in PHP it looks like this:
if (preg_match('~\d{2}-\d{2}~', $input, $matches) {
// do something here
}
Do you have any use case so we can adjust code to your needs?