Airflow Remote file sensor - airflow-scheduler

Airflow Remote file sensor - airflow-scheduler

I am trying find if there is any files in the remote server match the provided pattern. Something as similar to the below solution
Airflow File Sensor for sensing files on my local drive
I used SSHOperator with bash command as below,
SSH_Bash = """
echo 'poking for files...'
ls /home/files/test.txt
if [ $? -eq "0" ]; then
echo 'Found file'
else
echo 'failed to find'
fi
"""
t1 = SSHOperator(
ssh_conn_id='ssh_default',
task_id='test_ssh_operator',
command=SSH_Bash,
dag=dag)
It works but doesnt look like an optimal solution. Could someone help me to get better solution than Bash script to sense the files in the remote server.
I tried the below sftp sensor ,
import os
import re
import logging
from paramiko import SFTP_NO_SUCH_FILE
from airflow.contrib.hooks.sftp_hook import SFTPHook
from airflow.operators.sensors import BaseSensorOperator
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults
class SFTPSensor(BaseSensorOperator):
#apply_defaults
def __init__(self, filepath,filepattern, sftp_conn_id='sftp_default', *args, **kwargs):
super(SFTPSensor, self).__init__(*args, **kwargs)
self.filepath = filepath
self.filepattern = filepattern
self.hook = SFTPHook(sftp_conn_id)
def poke(self, context):
full_path = self.filepath
file_pattern = re.compile(self.filepattern)
try:
directory = os.listdir(self.hook.full_path)
for files in directory:
if not re.match(file_pattern, files):
self.log.info(files)
self.log.info(file_pattern)
else:
context["task_instance"].xcom_push("file_name", files)
return True
return False
except IOError as e:
if e.errno != SFTP_NO_SUCH_FILE:
raise e
return False
class SFTPSensorPlugin(AirflowPlugin):
name = "sftp_sensor"
sensors = [SFTPSensor]
But this always poke into local machine instead of remote machine. Could someone help me where i am making a mistake.

I replaced the line from
directory = os.listdir(self.hook.full_path)
to
directory = self.hook.list_directory(full_path)

Related

Issue with django crontabs there not working

hello guys i trying to use django_crontab on my django project and there not working does anyone know something about this im using Linux centos 8. I want to schedule a task to add some data to my database. Can someone help me
The steps that i have take is:
pip install django-crontab
add to the installed apps
build my cron function
` from django.core.management.base import BaseCommand
from backups.models import Backups
from devices.models import Devices
from datetime import datetime
from jnpr.junos import Device
from jnpr.junos.exception import ConnectError
from lxml import etree
from django.http import HttpResponse
from django.core.files import File
class Command(BaseCommand):
def handle(self, *args, **kwargs):
devices = Devices.objects.all()
for x in devices:
devid = Devices.objects.get(pk=x.id)
ip = x.ip_address
username = x.username
password = x.password
print(devid, ip, username, password)
dev1 = Device(host= ip ,user= username, passwd= password)
try:
dev1.open()
stype = "sucsess"
dataset = dev1.rpc.get_config(options={'format':'set'})
datatext = dev1.rpc.get_config(options={'format':'text'})
result = (etree.tostring(dataset, encoding='unicode'))
file_name = f'{ip}_{datetime.now().date()}.txt'
print(file_name)
with open("media/"f'{file_name}','w') as f:
f.write(etree.tostring(dataset, encoding='unicode'))
f.write(etree.tostring(datatext, encoding='unicode'))
backup = Backups(device_id=devid, host=ip, savetype=stype, time=datetime.now(), backuptext=file_name)
print(backup)
backup.save()
except ConnectError as err:
print ("Cannot connect to device: {0}".format(err))
print("----- Faild ----------")
stype = ("Cannot connect to device: {0}".format(err))
backup = Backups(device_id=devid, host=ip, savetype=stype, time=datetime.now())
backup.save()
`
add my cronjob to my setting.py file :
CRONJOBS = [ ('*/5 * * * *', 'django.core.management.call_command', ['backup-dev']), ]
5)
python manage.py crontab add
6)
python manage.py crontab show
Currently active jobs in crontab:
0662c1224789b131740fddef54f273c1 -> ('* * * * *', 'django.core.management.call_command', ['backup-dev'])
and still not working any ideas
and when i run this command: " python manage.py backup-dev" my task working perfectly
i Also try to add the management command direct to the centos machine via crontab with the command
crontab -e
and still nothing any ideas

Scheduled, timestamped sqlite3 .backup?

Running a small db on pythonanywhere, and am trying to set up a scheduled .backup of my sqlite3 database. Is there any way in the command line to add a time/date stamp to the filename, so that it doesn't overwrite the previous days backup?
Here's the code I'm using, if it matters:
sqlite3 db.sqlite3
.backup dbbackup.sqlite3
.quit
Running every 24 hours. The previous day's backup gets overwritten, though. I'd love to just be able to save it as dbbackup.timestamp.sqlite3 or something, so I could have multiple backups available.
Thanks!

I suggest you to handle this case with management commands and cronjob.
This an example how to do it; save this file eg in yourapp/management/commands/dbackup.py, don't forget to add __init__.py files.
yourapp/management/__init__.py
yourapp/management/commands/__init__.py
yourapp/management/commands/dbackup.py
But, previously add these lines below to your settings.py
USERNAME_SUPERUSER = 'yourname`
PASSWORD_SUPERUSER = `yourpassword`
EMAIL_SUPERUSER = `youremail#domain.com`
DATABASE_NAME = 'db.sqlite3'
The important tree path project if you deploying at pythonanywhere;
/home/yourusername/yourproject/manage.py
/home/yourusername/yourproject/db.sqlite3
/home/yourusername/yourproject/yourproject/settings.py
/home/yourusername/yourproject/yourapp/management/commands/dbackup.py
Add these script below into yourapp/management/commands/dbackup.py, you also can custom this script as you need.
import os
import time
from django.conf import settings
from django.contrib.auth.models import User
from django.core.management.base import (BaseCommand, CommandError)
USERNAME_SUPERUSER = settings.USERNAME_SUPERUSER
PASSWORD_SUPERUSER = settings.PASSWORD_SUPERUSER
EMAIL_SUPERUSER = settings.EMAIL_SUPERUSER
DATABASE_NAME = settings.DATABASE_NAME #eg: 'db.sqlite3'
class Command(BaseCommand):
help = ('Command to deploy and backup the latest database.')
def add_arguments(self, parser):
parser.add_argument(
'-b', '--backup', action='store_true',
help='Just backup command confirmation.'
)
def success_info(self, info):
return self.stdout.write(self.style.SUCCESS(info))
def error_info(self, info):
return self.stdout.write(self.style.ERROR(info))
def handle(self, *args, **options):
backup = options['backup']
if backup == False:
return self.print_help()
# Removing media files, if you need to remove all media files
# os.system('rm -rf media/images/')
# self.success_info("[+] Removed media files at `media/images/`")
# Removing database `db.sqlite3`
if os.path.isfile(DATABASE_NAME):
# backup the latest database, eg to: `db.2017-02-03.sqlite3`
backup_database = 'db.%s.sqlite3' % time.strftime('%Y-%m-%d')
os.rename(DATABASE_NAME, backup_database)
self.success_info("[+] Backup the database `%s` to %s" % (DATABASE_NAME, backup_database))
# remove the latest database
os.remove(DATABASE_NAME)
self.success_info("[+] Removed %s" % DATABASE_NAME)
# Removing all files migrations for `yourapp`
def remove_migrations(path):
exclude_files = ['__init__.py', '.gitignore']
path = os.path.join(settings.BASE_DIR, path)
filelist = [
f for f in os.listdir(path)
if f.endswith('.py')
and f not in exclude_files
]
for f in filelist:
os.remove(path + f)
self.success_info('[+] Removed files migrations for {}'.format(path))
# do remove all files migrations
remove_migrations('yourapp/migrations/')
# Removing all `.pyc` files
os.system('find . -name *.pyc -delete')
self.success_info('[+] Removed all *.pyc files.')
# Creating database migrations
# These commands should re-generate the new database, eg: `db.sqlite3`
os.system('python manage.py makemigrations')
os.system('python manage.py migrate')
self.success_info('[+] Created database migrations.')
# Creating a superuser
user = User.objects.create_superuser(
username=USERNAME_SUPERUSER,
password=PASSWORD_SUPERUSER,
email=EMAIL_SUPERUSER
)
user.save()
self.success_info('[+] Created a superuser for `{}`'.format(USERNAME_SUPERUSER))
Setup this command with crontab
$ sudo crontab -e
And add these following below lines;
# [minute] [hour] [date] [month] [year]
59 23 * * * source ~/path/to/yourenv/bin/activate && cd ~/path/to/yourenv/yourproject/ && ./manage.py dbackup -b
But, if you need to deploy at pythonanywhere, you just need to add these..
Daily at [hour] : [minute] UTC, ... fill the hour=23 and minute=59
source /home/yourusername/.virtualenvs/yourenv/bin/activate && cd /home/yourusername/yourproject/ && ./manage.py dbackup -b
Update 1
I suggest you to update the commands to execute the file of manage.py such as os.system('python manage.py makemigrations') with function of call_command;
from django.core.management import call_command
call_command('collectstatic', verbosity=3, interactive=False)
call_command('migrate', 'myapp', verbosity=3, interactive=False)
...is equal to the following commands typed in terminal:
$ ./manage.py collectstatic --noinput -v 3
$ ./manage.py migrate myapp --noinput -v 3
See running management commands from django docs.
Update 2
Previous condition is if you need to re-deploy your project and using a fresh database. But, if you only want to backup it by renaming the database, you can using module of shutil.copyfile
import os
import time
import shutil
from django.conf import settings
from django.core.management.base import (BaseCommand, CommandError)
DATABASE_NAME = settings.DATABASE_NAME #eg: 'db.sqlite3'
class Command(BaseCommand):
help = ('Command to deploy and backup the latest database.')
def add_arguments(self, parser):
parser.add_argument(
'-b', '--backup', action='store_true',
help='Just backup command confirmation.'
)
def success_info(self, info):
return self.stdout.write(self.style.SUCCESS(info))
def error_info(self, info):
return self.stdout.write(self.style.ERROR(info))
def handle(self, *args, **options):
backup = options['backup']
if backup == False:
return self.print_help()
if os.path.isfile(DATABASE_NAME):
# backup the latest database, eg to: `db.2017-02-29.sqlite3`
backup_database = 'db.%s.sqlite3' % time.strftime('%Y-%m-%d')
shutil.copyfile(DATABASE_NAME, backup_database)
self.success_info("[+] Backup the database `%s` to %s" % (DATABASE_NAME, backup_database))

Using python to update a file on google drive

I have the following script to upload a file unto google drive, using python27. As it is now it will upload a new copy of the file, but I want the existing file updated/overwritten. I can't find help in the Google Drive API references and guides for python. Any suggestions?
from __future__ import print_function
import os
from apiclient.discovery import build
from httplib2 import Http
from oauth2client import file, client, tools
try:
import argparse
flags = argparse.ArgumentParser(parents=[tools.argparser]).parse_args()
except ImportError:
flags = None
# Gain acces to google drive
SCOPES = 'https://www.googleapis.com/auth/drive.file'
store = file.Storage('storage.json')
creds = store.get()
if not creds or creds.invalid:
flow = client.flow_from_clientsecrets('client_secret.json', SCOPES)
creds = tools.run_flow(flow, store, flags) \
if flags else tools.run(flow, store)
DRIVE = build('drive', 'v3', http=creds.authorize(Http()))
#The file that is being uploaded
FILES = (
('all-gm-keys.txt', 'application/vnd.google-apps.document'), #in google doc format
)
#Where the file ends on google drive
for filename, mimeType in FILES:
folder_id = '0B6V-MONTYPYTHONROCKS-lTcXc' #Not the real folder id
metadata = {'name': filename,'parents': [ folder_id ] }
if mimeType:
metadata['mimeType'] = mimeType
res = DRIVE.files().create(body=metadata, media_body=filename).execute()
if res:
print('Uploaded "%s" (%s)' % (filename, res['mimeType']))

I think that you are looking for the update method. Here is a link to the documentation. There is an example on overwriting the file in python.
I think that using the official google client api instead of pure http requests should make your task easier.

from apiclient import errors
from apiclient.http import MediaFileUpload
# ...
def update_file(service, file_id, new_title, new_description, new_mime_type,
new_filename, new_revision):
"""Update an existing file's metadata and content.
Args:
service: Drive API service instance.
file_id: ID of the file to update.
new_title: New title for the file.
new_description: New description for the file.
new_mime_type: New MIME type for the file.
new_filename: Filename of the new content to upload.
new_revision: Whether or not to create a new revision for this file.
Returns:
Updated file metadata if successful, None otherwise.
"""
try:
# First retrieve the file from the API.
file = service.files().get(fileId=file_id).execute()
# File's new metadata.
file['title'] = new_title
file['description'] = new_description
file['mimeType'] = new_mime_type
# File's new content.
media_body = MediaFileUpload(
new_filename, mimetype=new_mime_type, resumable=True)
# Send the request to the API.
updated_file = service.files().update(
fileId=file_id,
body=file,
newRevision=new_revision,
media_body=media_body).execute()
return updated_file
except errors.HttpError, error:
print 'An error occurred: %s' % error
return None
Link the example: https://developers.google.com/drive/api/v2/reference/files/update#examples

How to download specific Google Drive folder using Python?

I'm trying to download specific folders from Google Drive.
I tried this example
http://www.mwclearning.com/?p=1608 but its download all the files from G-Drive.
EX: If I have two folders in Google Drive say..
A folder having -> 1 , 2 Files
B folder having -> 3, 4, 5 Files
If I want to download folder A then only 1 , 2 files should get downloaded..
Any suggestion or help could be very helpful.
Thanks in advance.

Use Drive credentials.json Downloaded from your Drive API
from __future__ import print_function
import pickle
import os
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from oauth2client import client
from oauth2client import tools
from oauth2client.file import Storage
from apiclient.http import MediaFileUpload, MediaIoBaseDownload
import io
from apiclient import errors
from apiclient import http
import logging
from apiclient import discovery
# If modifying these scopes, delete the file token.pickle.
SCOPES = ['https://www.googleapis.com/auth/drive']
# To list folders
def listfolders(service, filid, des):
results = service.files().list(
pageSize=1000, q="\'" + filid + "\'" + " in parents",
fields="nextPageToken, files(id, name, mimeType)").execute()
# logging.debug(folder)
folder = results.get('files', [])
for item in folder:
if str(item['mimeType']) == str('application/vnd.google-apps.folder'):
if not os.path.isdir(des+"/"+item['name']):
os.mkdir(path=des+"/"+item['name'])
print(item['name'])
listfolders(service, item['id'], des+"/"+item['name']) # LOOP un-till the files are found
else:
downloadfiles(service, item['id'], item['name'], des)
print(item['name'])
return folder
# To Download Files
def downloadfiles(service, dowid, name,dfilespath):
request = service.files().get_media(fileId=dowid)
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print("Download %d%%." % int(status.progress() * 100))
with io.open(dfilespath + "/" + name, 'wb') as f:
fh.seek(0)
f.write(fh.read())
def main():
"""Shows basic usage of the Drive v3 API.
Prints the names and ids of the first 10 files the user has access to.
"""
creds = None
# The file token.pickle stores the user's access and refresh tokens, and is
# created automatically when the authorization flow completes for the first
# time.
if os.path.exists('token.pickle'):
with open('token.pickle', 'rb') as token:
creds = pickle.load(token)
# If there are no (valid) credentials available, let the user log in.
if not creds or not creds.valid:
if creds and creds.expired and creds.refresh_token:
creds.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
'credentials.json', SCOPES) # credentials.json download from drive API
creds = flow.run_local_server()
# Save the credentials for the next run
with open('token.pickle', 'wb') as token:
pickle.dump(creds, token)
service = build('drive', 'v3', credentials=creds)
# Call the Drive v3 API
Folder_id = "'PAST YOUR SHARED FOLDER ID'" # Enter The Downloadable folder ID From Shared Link
results = service.files().list(
pageSize=1000, q=Folder_id+" in parents", fields="nextPageToken, files(id, name, mimeType)").execute()
items = results.get('files', [])
if not items:
print('No files found.')
else:
print('Files:')
for item in items:
if item['mimeType'] == 'application/vnd.google-apps.folder':
if not os.path.isdir("Folder"):
os.mkdir("Folder")
bfolderpath = os.getcwd()+"/Folder/"
if not os.path.isdir(bfolderpath+item['name']):
os.mkdir(bfolderpath+item['name'])
folderpath = bfolderpath+item['name']
listfolders(service, item['id'], folderpath)
else:
if not os.path.isdir("Folder"):
os.mkdir("Folder")
bfolderpath = os.getcwd()+"/Folder/"
if not os.path.isdir(bfolderpath + item['name']):
os.mkdir(bfolderpath + item['name'])
filepath = bfolderpath + item['name']
downloadfiles(service, item['id'], item['name'], filepath)
if __name__ == '__main__':
main()

Try to check the Google Drive API documentation, you can see here the sample code use to perform a file download using Python.
file_id = '0BwwA4oUTeiV1UVNwOHItT0xfa2M'
request = drive_service.files().get_media(fileId=file_id)
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print "Download %d%%." % int(status.progress() * 100)
For the folders part, you can check here on how to get it.
For more information, you can check this tutorial and YT video.

Here's just the code that deals specifically with downloading a folder recursively.
I've tried to keep it to-the-point, omitting code that's described in tutorials already. I expect you to already have the ID of the folder that you want to download.
The part elif not itemType.startswith('application/'): has the purpose of skipping any Drive-format documents. However, the check is overly-simplistic, so you might want to improve it or remove it.
from __future__ import print_function
import pickle
import os.path
import io
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
# If modifying these scopes, delete the file token.pickle.
SCOPES = ['https://www.googleapis.com/auth/drive.readonly']
def main():
"""Based on the quickStart.py example at
https://developers.google.com/drive/api/v3/quickstart/python
"""
creds = getCredentials()
service = build('drive', 'v3', credentials=creds)
folderId = ""
destinationFolder = ""
downloadFolder(service, folderId, destinationFolder)
def downloadFolder(service, fileId, destinationFolder):
if not os.path.isdir(destinationFolder):
os.mkdir(path=destinationFolder)
results = service.files().list(
pageSize=300,
q="parents in '{0}'".format(fileId),
fields="files(id, name, mimeType)"
).execute()
items = results.get('files', [])
for item in items:
itemName = item['name']
itemId = item['id']
itemType = item['mimeType']
filePath = destinationFolder + "/" + itemName
if itemType == 'application/vnd.google-apps.folder':
print("Stepping into folder: {0}".format(filePath))
downloadFolder(service, itemId, filePath) # Recursive call
elif not itemType.startswith('application/'):
downloadFile(service, itemId, filePath)
else:
print("Unsupported file: {0}".format(itemName))
def downloadFile(service, fileId, filePath):
# Note: The parent folders in filePath must exist
print("-> Downloading file with id: {0} name: {1}".format(fileId, filePath))
request = service.files().get_media(fileId=fileId)
fh = io.FileIO(filePath, mode='wb')
try:
downloader = MediaIoBaseDownload(fh, request, chunksize=1024*1024)
done = False
while done is False:
status, done = downloader.next_chunk(num_retries = 2)
if status:
print("Download %d%%." % int(status.progress() * 100))
print("Download Complete!")
finally:
fh.close()

Please do download the 'client_id.json' file as specified in the tutorial link for downloading follow steps 5-7
In the last line of the code change the "folder_id" to the id of the folder you want to download from drive by right clicking on the folder and enabling share link. The id will be the part of URL after "id=" and also changing the "savepath" to the path where you want to save the downloaded folder to be on your system.
from __future__ import print_function
from googleapiclient import discovery
from httplib2 import Http
from oauth2client import file, client, tools
import os, io
from apiclient.http import MediaFileUpload, MediaIoBaseDownload
SCOPES = 'https://www.googleapis.com/auth/drive'
store = file.Storage('storage.json')
creds = store.get()
if not creds or creds.invalid:
flow = client.flow_from_clientsecrets('client_id.json', SCOPES)
creds = tools.run_flow(flow, store)
DRIVE = discovery.build('drive', 'v3', http=creds.authorize(Http()))
def retaining_folder_structure(query,filepath):
results = DRIVE.files().list(fields="nextPageToken, files(id, name, kind, mimeType)",q=query).execute()
items = results.get('files', [])
for item in items:
#print(item['name'])
if item['mimeType']=='application/vnd.google-apps.folder':
fold=item['name']
path=filepath+'/'+fold
if os.path.isdir(path):
retaining_folder_structure("'%s' in parents"%(item['id']),path)
else:
os.mkdir(path)
retaining_folder_structure("'%s' in parents"%(item['id']),path)
else:
request = DRIVE.files().get_media(fileId=item['id'])
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print("Download %d%%." % int(status.progress() * 100))
path=filepath+'/'+item['name']
#print(path)
with io.open(path,'wb') as f:
fh.seek(0)
f.write(fh.read())
retaining_floder_structure("'folder_id' in parents",'savepath')

how to import or call a python module/class in other modules with different file paths

I want to follow this structure for a WEB "WSGI pep 3333" API (educational purposes):
/home/
`--app.py
`--API_module/
`--__init__.py
`--api.py
`--exceptions.py
`--modules/
`--default/
`--__init__.py
`--default.py
app.py calls API_module with something like:
app = API_module.api()
the api.py based on "HTTP GET requests" will load modules stored on directory named modules, for now, I am just loading a module named default.
api.py looks like:
import os
import imp
from exceptions import HTTPError, HTTPException
class API(object):
def __call__(self, env, start_response):
self.env = env
self.method = env['REQUEST_METHOD']
try:
body = self.router()
body.dispatch()
except HTTPError, e:
print 'HTTP method not valid %s' % e
except, Exception e:
print 'something went wrong'
start_response(status, headers)
yield body
def router():
module_path = '/home/modules/default/default.py'
if not os.access(module_path, os.R_OK):
raise HTTPException()
else:
py_mod = imp.load_source('default', '/home/modules/default/default.py'
return py_mod.Resource(self)
and default.py contains something like:
class Resoure(object):
def __init__(self, app):
self.app = app
def dispatch(self):
raise HTTPException()
So far I can import dynamically the modules but if I want to raise an exception from the default.py module I get an:
global name 'HTTPException' is not defined
Therefore I would like to know, how to take advantage of the API_module/exceptions and use them in all the dynamically modules.
Any ideas, suggestions comments are welcome.

It's a matter of sys.path.
In your setup, since your API_module is included from app.py you should have the root of your application in your sys.path so you should just include the API_module in the usual way:
from API_module import exceptions
or
from API_module.exceptions import *
based on how you use it.
Btw, i suggest you use pymod = __import__('modules.default.default', fromlist=['']) and put an __init__.py' file in themodules/` root directory.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Airflow Remote file sensor - airflow-scheduler

I replaced the line from directory = os.listdir(self.hook.full_path) to directory = self.hook.list_directory(full_path)

Related

Issue with django crontabs there not working

Scheduled, timestamped sqlite3 .backup?

Using python to update a file on google drive

How to download specific Google Drive folder using Python?

how to import or call a python module/class in other modules with different file paths

Categories

Resources