Python create folders structure based on current date - python-2.7

i'm using the following var on my script to send output to one
output = "/opt/output"
i want to adjust it to make the output relative to the date current date of trigger the script it should be structured like this
output = "/opt/output/year/month/day"
i'm not sure if i'm using the correct way here i used the following approach
output = "/opt/output/" + today.strftime('%Y%m%d')
any tips here

I recommend you use the full timestamp instead of just using the date:
import os
mydir = os.path.join(output, datetime.datetime.now().strftime('%Y/%m/%d_%H-%M-%S'))
It's recommended do it this way because what'd happen if your script runs more than once a day ? You should at least add a counter or something (if you don't want the full timestamp) which will increment some variable if the folder already exist.
You can read more about os.path.join here
As per creating a folder, you can do it like this:
if not os.path.exists(directory):
os.makedirs(mydir)

i figure it by
today = datetime.datetime.now()
year = today.strftime("%Y")
month=today.strftime("%m")
day=today.strftime("%d")
output = "/opt/output/" + year +"/" + month + "/" + day
thats worked fine to me

I will suggest using os.path.join and os.path.sep:
import os
.
.
.
full_dir = os.path.join(base_dir, today.strftime('%Y{0}%m{0}%d').format(os.path.sep))

today.strftime('%Y%m%d') would print todays date as 20170607. But I guess you want it printed as 2017/06/07. You could explicitly add the slashes and print it something like this?
output = "/opt/output/" + today.year +"/" + today.month + "/" + today.date

Related

Python read text file based on partial name and file timestamp

I'm trying to pull two of the same files into python in different dataframes, with the end goal of comparing what was added in the new file and removed from the old. So far, I've got code that looks like this:
In[1] path = r'\\Documents\FileList'
files = os.listdir(path)
In[2] files_txt = [f for f in files if f[-3:] == 'txt']
In[3] for f in files_txt:
data = pd.read_excel(path + r'\\' + f)
df = df.append(data)
I've also set a variable to equal the current date minus a certain number of days, which I want to use to pull the file that has a date equal to that variable:
d7 = dt.datetime.today() - timedelta(7)
As of now, I'm unsure of how to do this, as the first part of the filename always remains the same but they add numbers at the end (eg. file_03232016 then file_03302016). I want to parse through the directory for the beginning part of the filename and add it to a dataframe if it matches the date parameter I set.
EDIT: I forgot to add that sometimes I also need to look at the system date created timestamp, as the text date in the file name isn't always there.
Here are some modifications to your original code to get a list of files containing your target date. You need to use strftime.
import os
from datetime import timedelta
d7 = dt.datetime.today() - timedelta(7)
target_date_str = d7.strftime('_%m%d%Y')
files_txt = [f for f in files if f[-13:] == target_date_str + '.txt']
>>> target_date_str + '.txt'
'_03232016.txt'
data = []
for f in files_txt:
data.append(pd.read_excel(os.path.join(path, f))
df = pd.concat(data, ignore_index=True)
Use strftime in order to represent your datetime variable as a string with desired format and glob for searching files by file mask in the directory:
import datetime as dt
import glob
fmask = r'\\Documents\FileList\*' + (dt.datetime.today() - dt.timedelta(7)).strftime('%m%d%Y') + '*.txt'
files_txt = glob.glob(fmask)
# concatenate all CSV/txt files into one data frame
df = pd.concat([pd.read_csv(f) for f in files_txt], ignore_index=True)
PS I guess you want to use read_csv instead of read_excel when working with txt files unless you really have excel files with txt extension?

Python make script run at specified time daily

I want to make this script to run automatically once or twice a day at a specified time, what would be the best way to approach this.
def get_data():
"""Reads the currency rates from cinkciarz.pl and prints out, stores the pln/usd
rate in a variable myRate"""
sock = urllib.urlopen("https://cinkciarz.pl/kantor/kursy-walut-cinkciarz-pl/usd")
htmlSource = sock.read()
sock.close()
currancyRate = re.findall(r'<td class="cur_down">(.*?)</td>',str(htmlSource))
for eachTd in currancyRate:
print(eachTd)
print currancyRate[0]
myRate = currancyRate[0]
print myRate
return myRate
You can use crontab to run any script at regular intervals. See https://stackoverflow.com/a/8727991/1517864
To run a script once a day (at 12:00) you will need an entry like this in your crontab
0 12 * * * python /path/to/script.py
You can add a bash function.
while true; do <your_command>; sleep <interval_in_seconds>; done

Paginating with Python 2.7.9 Web Crawler

I am trying to code a program in Python 2.7.9 to crawl and gather the club names, addresses and phone numbers from the website http://tennishub.co.uk/
The following code gets the job done, except for it doesn't move on to the subsequent pages for each location such as
/Berkshire/1
/Berkshire/2
/Berkshire/3
..and so on.
import requests
from bs4 import BeautifulSoup
def tennis_club():
url = 'http://tennishub.co.uk/'
r = requests.get(url)
soup = BeautifulSoup(r.text)
for link in soup.select('div.countylist a'):
href = 'http://tennishub.co.uk' + link.get('href')
pages_data(href)
def pages_data(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.text)
g_data = soup.select('table.display-table')
for item in g_data:
print item.contents[1].text
print item.contents[3].findAll('td')[1].text
try:
print item.contents[3].find_all('td',{'class':'telrow'})[0].text
except:
pass
try:
print item.contents[5].findAll('td',{'class':'emailrow'})[0].text
except:
pass
print item_url
tennis_club()
I have tried tweaking the code to the best of my understanding but it doesn't work at all.
Can someone please advise what do I need to do so that the program goes through all the pages of a location, collects the data and move on the to next location and so on.
You are going to need to put another for loop into this code:
for link in soup.select('div.countylist a'):
href = 'http://tennishub.co.uk' + link.get('href')
# new for loop goes here #
pages_data(href)
If you want to brute force it you just have the for loop go as many times as the area with the most clubs (Surrey), however you would double, triple, quadruple, etc. count the last clubs for many of the areas. This is ugly but you can get away with it if you are using a database where you don't insert duplicates. However it is unacceptable if you are writing to a file. In that case you will need to pull the number in parenthesis after the area Berkshire (39). To get that number you can do a get_text() on the div.countylist which would change the above to
for link in soup.select('div.countylist'):
for endHref in link.find_all('a'):
numClubs = endHref.next
#need to clean up endHrefNum here to remove spaces and parens
endHrefNum = numClubs//10 + 1 #add one because // gives the floor
href = 'http://tennishub.co.uk' + endHref.get('href') + / + endHrefNum
pages_data(href)
(disclaimer: I didn't run this through bs4 so there might be syntax errors (and you might need to use something other than .next, but the logic should help you)

Abbreviate the import of multiple files with loadtxt (Python)

I wanna abbreviate the way I import multiples files with loadtxt, I do the next:
rc1 =loadtxt("20120701_Gp_xr_5m.txt", skiprows=19)
rc2 =loadtxt("20120702_Gp_xr_5m.txt", skiprows=19)
rc3 =loadtxt("20120703_Gp_xr_5m.txt", skiprows=19)
rc4 =loadtxt("20120704_Gp_xr_5m.txt", skiprows=19)
rc5 =loadtxt("20120705_Gp_xr_5m.txt", skiprows=19)
rc6 =loadtxt("20120706_Gp_xr_5m.txt", skiprows=19)
rc7 =loadtxt("20120707_Gp_xr_5m.txt", skiprows=19)
rc8 =loadtxt("20120708_Gp_xr_5m.txt", skiprows=19)
rc9 =loadtxt("20120709_Gp_xr_5m.txt", skiprows=19)
rc10 =loadtxt("20120710_Gp_xr_5m.txt", skiprows=19)
Then I concatenate them using:
GOES =concatenate((rc1,rc2,rc3,rc4,rc5,rc6,rc7,rc8,rc9,
rc10),axis=0)
But my question is: Do I wanna reduce all of this? Maybe with a FOR or something like that. Since the files are a secuence of dates (strings).
I was thinking to do something like this
day= #### i dont know how define a string going from 01 to 31 for example
data="201207"+day+"_Gp_xr_5m.txt"
Then do this, but i think is not correct
GOES=loadtxt(data, skiprows=19)
Yes, you can easily get your sub-arrays with a for-loop, or with an equivalent list comprehension. Use the glob module to get the desired file names:
import numpy as np # you probably don't need this line
from glob import glob
fnames = glob('path/to/dir')
arrays = [np.loadtxt(f, skiprows=19) for f in fnames]
final_array = np.concatenate(arrays)
If memory use becomes a problem, you can also iterate over all files line by line by chaining them and feeding that generator to np.loadtxt.
edit after OP's comment
My example with glob wasn't very clear..
You can use "wildcards" * to match files, e.g. glob('*') to get a list of all files in the current directory. A part of the code above could therefor be written better as:
fnames = glob('path/to/dir/201207*_Gp_xr_5m.txt')
Or if your program already runs from the right directory:
fnames = glob('201207*_Gp_xr_5m.txt')
I forgot this earlier, but you should also sort the list of filenames, because the list of filenames from glob is not guaranteed to be sorted.
fnames.sort()
A slightly different approach, more in the direction of what you were thinking is the following. When variable day contains the day number you can put it in the filename like so:
daystr = str(day).zfill(2)
fname = '201207' + daystr + '_Gp_xr_5m.txt'
Or using a clever format specifier:
fname = '201207{:02}_Gp_xr_5m.txt'.format(day)
Or the "old" way:
fname = '201207%02i_Gp_xr_5m.txt' % day
Then simply use this in a for-loop:
arrays = []
for day in range(1, 32):
daystr = str(day).zfill(2)
fname = '201207' + daystr + '_Gp_xr_5m.txt'
a = np.loadtxt(fname, skiprows=19)
arrays.append(a)
final_array = np.concatenate(arrays)

Convert result of aggregation to string

I am using aggregate and Sum to determine the number of hours I have worked in each month. I have it working, but the "hours" variable always has extra content in it!
I should add, I am a newbie at django and I got most of this code from here (Django beginner: How to query in django ORM to calculate fields based on dates).
My code:
hours = ""
work_data = ""
month_data = ""
for month in range(1,13):
entries_per_month = Mydata.objects.filter(myTimePeriod__month=month).filter(myResource="James")
hours = str(entries_per_month.aggregate(value=Sum('myHoursLogged')))
month_data = month_data + "'" + str(month) + "',"
work_data = work_data + hours + ","
I look at the results of work_data:
work_data
This gives me a result of {'value': Decimal('136.80')},{'value': Decimal('146.40')},
I need it in the format: 136.80, 146.40 (This is the format required by the charting library). I have tried using str() to convert it but it doesnt seem to work in this case.
str is not useful here, because your data is a list of dictionaries. You just need to process that and get the results:
hours = ','.join(str(v['value']) for v in entries_per_month)