Regex in spark.read.json - regex

I want to read all json files which are having timestamp one hour before the current time from the hadoop directory.
File name is like test_2020021418553333
import java.util.Calendar;
import java.text.SimpleDateFormat;
val form = new SimpleDateFormat("yyyyMMddhh");
val c = Calendar.getInstance();
c.add(Calendar.HOUR, -1);
val path ="/Test_"+form.format(c.getTime())+"*";
val test_df = spark.read.json(path)
When I run this code: Path does not exist error is coming.
Can anyone suggest how to read file names like Test_20200214{Any Possible combination of Digit}??

A quick test show that you have minutes
form.format(c.getTime())
res2: String = 2020021401
So remove the latest 2 cars
regards

Related

Code is Not able to find my function in Python(Spark) class

I need some help regarding the error in code. My Code consists of retrieving the zomato reviews and storing it in HDFS and again reading it performing Recommender Analtyics on it. I am getting a problem regarding my function is not recognizing in pyspark code. I am not entirely pasting the code as it might be confusing so i am writing a small similar use case for your easy understanding.
I am trying to read a file from local and converting it to dataframe from rdd and performing some operations and again converting it to rdd and performing map operation to have delimiter by '|' and then save it to HDFS.
When i try to call self.filter_data(y) in lambda func of check function its not recognizing and giving me error as
Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.
****CAN ANY ONE HELP ME WHY MY FILTER_DATA FUNCTION IS NOT RECOGNISING? SHOULD I NEED TO ADD ANY THING OR ANY THING WRONG IN THE WAY I AM CALLING. PLEASE HELP ME. THANKS IN ADVANCE****
INPUT VALUE
starting
0|0|ffae4f|0|https://b.zmtcdn.com/data/user_profile_pictures/565/aed32fa2eb18bb4a5a3ba426870fd565.jpg?fit=around%7C100%3A100&crop=100%3A100%3B%2A%2C%2A|https://www.zomato.com/akellaram87?utm_source=api_basic_user&utm_medium=api&utm_campaign=v2.1|2.5|FFBA00|Well...|unknown|16946626|2017-08-01T00-25-43.455182Z|30059877|Have been here for a quick bite for lunch, ambience and everything looked good, food was okay but presentation was not very appealing. We or...|2017-04-15 16:38:38|Big Foodie|6|Venkata Ram Akella|akellaram87|Bad Food|0.969352505662|0|0|0|0|0|0|1|1|0|0|1|0|0|0.782388212399
ending
starting
1|0|ffae4f|0|https://b.zmtcdn.com/data/user_profile_pictures/4d1/d70d7a57e1bfdf296ff4db3d8daf94d1.jpg?fit=around%7C100%3A100&crop=100%3A100%3B%2A%2C%2A|https://www.zomato.com/users/sm4-2011696?utm_source=api_basic_user&utm_medium=api&utm_campaign=v2.1|1|CB202D|Avoid!|unknown|16946626|2017-08-01T00-25-43.455182Z|29123338|Giving a 1.0 rating because one cannot proceed with writing a review, without rating it. This restaurant deserves a 0 star rating. The qual...|2017-01-04 10:54:53|Big Foodie|4|Sm4|unknown|Bad Service|0.964402034541|0|1|0|0|0|0|0|1|0|0|0|1|0|0.814540622345
ending
My code:
if __name__== '__main__':
import os,logging,sys,time,pandas,json;from subprocess
import PIPE,Popen,call;from datetime import datetime, time, timedelta
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('test')
sc = SparkContext(conf = conf,pyFiles=['/bdaas/exe/nlu_project/spark_classifier.py','/bdaas/exe/spark_zomato/other_files/spark_zipcode.py','/bdaas/exe/spark_zomato/other_files/spark_zomato.py','/bdaas/exe/spark_zomato/conf_files/spark_conf.py','/bdaas/exe/spark_zomato/conf_files/date_comparision.py'])
from pyspark.sql import Row, SQLContext,HiveContext
from pyspark.sql.functions import lit
sqlContext = HiveContext(sc)
import sys,logging,pandas as pd
import spark_conf
n = new()
n.check()
class new:
def __init__(self):
print 'entered into init'
def check(self):
data = sc.textFile('file:///bdaas/src/spark_dependencies/classifier_data/final_Output.txt').map(lambda x: x.split('|')).map(lambda z: Row(restaurant_id=z[0], rating = z[1], review_id = z[2],review_text = z[3],rating_color = z[4],rating_time_friendly=z[5],rating_text=z[6],time_stamp=z[7],likes=z[8],comment_count =z[9],user_name = z[10],user_zomatohandle=z[11],user_foodie_level = z[12],user_level_num=z[13],foodie_color=z[14],profile_url=z[15],profile_image=z[16],retrieved_time=z[17]))
data_r = sqlContext.createDataFrame(data)
data_r.show()
d = data_r.rdd.collect()
print d
data_r.rdd.map(lambda x: list(x)).map(lambda y: self.filter_data(y)).collect()
print data_r
def filter_data(self,y):
s = str()
for i in y:
print i.encode('utf-8')
if i != '':
s = s + i.encode('utf-8') + '|'
print s[0:-1]
return s[0:-1]

web Crawling and Extracting data using scrapy

I am new to python as well as scrapy.
I am trying to crawl a seed url https://www.health.com/patients/status/.This seed url contains many urls. But I want to fetch only urls that contain Faci/Details/#somenumber from the seed url .The url will be like below:
https://www.health.com/patients/status/ ->https://www.health.com/Faci/Details/2
-> https://www.health.com/Faci/Details/3
-> https://www.health.com/Faci/Details/4
https://www.health.com/Faci/Details/2 -> https://www.health.com/provi/details/64
-> https://www.health.com/provi/details/65
https://www.health.com/Faci/Details/3 -> https://www.health.com/provi/details/70
-> https://www.health.com/provi/details/71
Inside each https://www.health.com/Faci/Details/2 page there is https://www.health.com/provi/details/64
https://www.health.com/provi/details/65 ... .Finally I want to fetch some datas from
https://www.health.com/provi/details/#somenumber url.How can I achieve the same?
As of now I have tried the below code from scrapy tutorial and able to crawl only url that contains https://www.health.com/Faci/Details/#somenumber .Its not going to https://www.health.com/provi/details/#somenumber .I tried to set depth limit in settings.py file.But it doesn't worked.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from news.items import NewsItem
class MySpider(CrawlSpider):
name = 'provdetails.com'
allowed_domains = ['health.com']
start_urls = ['https://www.health.com/patients/status/']
rules = (
Rule(LinkExtractor(allow=('/Faci/Details/\d+', )), follow=True),
Rule(LinkExtractor(allow=('/provi/details/\d+', )),callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = NewsItem()
item['id'] = response.xpath("//title/text()").extract()
item['name'] = response.xpath("//title/text()").extract()
item['description'] = response.css('p.introduction::text').extract()
filename='details.txt'
with open(filename, 'wb') as f:
f.write(item)
self.log('Saved file %s' % filename)
return item
Please help me to proceed further?
To be honest, the regex-based and mighty Rule/LinkExtractor gave me often a hard time. For simple project it is maybe an approach to extract all links on page and then look on the href attribute. If the href matches your needs, yield a new Response object with it. For instance:
from scrapy.http import Request
from scrapy.selector import Selector
...
# follow links
for href in sel.xpath('//div[#class="contentLeft"]//div[#class="pageNavigation nobr"]//a').extract():
linktext = Selector(text=href).xpath('//a/text()').extract_first()
if linktext and linktext[0] == "Weiter":
link = Selector(text=href).xpath('//a/#href').extract()[0]
url = response.urljoin(link)
print url
yield Request(url, callback=self.parse)
Some remarks to your code:
response.xpath(...).extract()
This will return a list, maybe you want to have a look on extract_first() which provide the first item (or None).
with open(filename, 'wb') as f:
This will overwrite the file several times. You will only gain the last item saved. Also you open the file in binary mode ('b'). From the filename I guess you want to read it as text? Use 'a' to append? See open() docs
An alternative is to use the -o flag to use scrapys facilities to store the items to JSON or CSV.
return item
It is a good style to yield items instead of return them. At least if you need to create several items from one page you need to yield them.
Another good approach is: Use one parse() function for one type/kind of page.
For instance every page in start_urls fill end up in parse(). From that you extract could extract the links and yield Requests for each /Faci/Details/N page with a callback parse_faci_details(). In parse_faci_details() you extract again the links of interest, create Requests and pass them via callback= to e.g. parse_provi_details().
In this function you create the items you need.

Web-scraping with Python: NoneType error, can't scrape table's data

this is my first attempt at coding so please forgive my daftness. I'm trying to learn web scraping by practising with this link:
https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0
I've honestly spent hours trying to figure out what's wrong with my code here:
import csv
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('tbody')
list_of_rows = []
for row in table.find('tr'):
list_of_cells = []
for cell in row.findAll('td'):
list_of_cells.append()
list_of_rows.append(list_of_cells)
outfile = open("./indarb.csv","wb")
writer = csv.writer(outfile)
My terminal then spits out this: 'NoneType' object has no attribute 'find', saying there's an error in line 13. Not sure if it helps in queries but this is a list of what I've tried:
Different permutations of 'find'/'findAll'
Instead of '.find', used '.findAll'
Instead of '.findAll', used '.find'
Different permutations for line 10
Tried soup.find('tbody')
Tried soup.find('table')
Opened source code, tried soup.find('table', attrs={'class':'table table-condensed'})
Different permutations for line 13
similarly tried with just 'tr' tag; or
tried adding 'attrs={}' stuff
I've really tried but can't figure out why I can't scrape that simple 10 row table. If anyone could post code that works, that'd be phenomenal. Thank you for your patience!
The URL you request in your code is not HTML but JSON.
You have a few mistakes, the biggest is you are using BeautifulSoup3 which has not been developed for years, you should be use bs4, you also need to use find_all when you want want multiple tags. Also you have not passed cell to list_of_cells.append() on line 13 so that is the cause of your other error:
from bs4 import BeautifulSoup
url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0%27'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table')
list_of_rows = []
for row in table.find_all('tr'):
list_of_cells = []
for cell in row.find_all('td'):
list_of_cells.append(cell)
list_of_rows.append(list_of_cells)
I am not sure exactly what you want but that appends the tds from the first table on the page. There is also and api you can use and adownloadable csv if you do actually want the data.

parsing text file using python for fields and their values

I am new to coding with python, learning it step by step. I have an assignment to parse text file and update database. The file would have some status like
ticket summary:
Frequency :
Action taken: < something like "restarted server">
Status:
I want to parse this file, fetch the values for the fields like "ticket summary","frequency" etc. and put them in the database, where columns for them are defined. I am reading through python regex and sub string parsing, but not finding how to start. Need help
since you haven't included an example, I will provide one to use here. I am assuming you have already installed mongodb, pymongo and have an instance of it running locally. as an example I am choosing my file to have the following formatting:
Frequency: restarted server, Action taken- none, Status: active
the code will use regex to extract the fields. for demos have a look here: regex demo
import pymongo
import re
client = pymongo.MongoClient()
db = client['some-db']
freqgroup = re.compile(r"(?P<Frequency>Frequency[\:\-]\s?).*?(?=,)", flags=re.I)
actgroup = re.compile(r"(?P<ActionTaken>Action\sTaken[\:\-]\s?).*?(?=,)", flags=re.I)
statgroup = re.compile(r"(?P<Status>Status[\:\-]\s?).*", flags=re.I)
with open("some-file.txt", "rb") as f:
for line in f:
k = re.search(freqgroup, line)
db.posts.insert_one({"Frequency": line[k.end("Frequency"):k.end()]})
k = re.search(actgroup, line)
db.posts.insert_one({"ActionTaken": line[k.end("ActionTaken"):k.end()]})
k = re.search(statgroup, line)
db.posts.insert_one({"Status": line[k.end("Status"):k.end()]})

Python library to access a CalDAV server

I run ownCloud on my webspace for a shared calendar. Now I'm looking for a suitable python library to get read only access to the calendar. I want to put some information of the calendar on an intranet website.
I have tried http://trac.calendarserver.org/wiki/CalDAVClientLibrary but it always returns a NotImplementedError with the query command, so my guess is that the query command doesn't work well with the given library.
What library could I use instead?
I recommend the library, caldav.
Read-only is working really well with this library and looks straight-forward to me. It will do the whole job of getting calendars and reading events, returning them in the iCalendar format. More information about the caldav library can also be obtained in the documentation.
import caldav
client = caldav.DAVClient(<caldav-url>, username=<username>,
password=<password>)
principal = client.principal()
for calendar in principal.calendars():
for event in calendar.events():
ical_text = event.data
From this on you can use the icalendar library to read specific fields such as the type (e. g. event, todo, alarm), name, times, etc. - a good starting point may be this question.
I wrote this code few months ago to fetch data from CalDAV to present them on my website.
I have changed the data into JSON format, but you can do whatever you want with the data.
I have added some print for you to see the output which you can remove them in production.
from datetime import datetime
import json
from pytz import UTC # timezone
import caldav
from icalendar import Calendar, Event
# CalDAV info
url = "YOUR CALDAV URL"
userN = "YOUR CALDAV USERNAME"
passW = "YOUR CALDAV PASSWORD"
client = caldav.DAVClient(url=url, username=userN, password=passW)
principal = client.principal()
calendars = principal.calendars()
if len(calendars) > 0:
calendar = calendars[0]
print ("Using calendar", calendar)
results = calendar.events()
eventSummary = []
eventDescription = []
eventDateStart = []
eventdateEnd = []
eventTimeStart = []
eventTimeEnd = []
for eventraw in results:
event = Calendar.from_ical(eventraw._data)
for component in event.walk():
if component.name == "VEVENT":
print (component.get('summary'))
eventSummary.append(component.get('summary'))
print (component.get('description'))
eventDescription.append(component.get('description'))
startDate = component.get('dtstart')
print (startDate.dt.strftime('%m/%d/%Y %H:%M'))
eventDateStart.append(startDate.dt.strftime('%m/%d/%Y'))
eventTimeStart.append(startDate.dt.strftime('%H:%M'))
endDate = component.get('dtend')
print (endDate.dt.strftime('%m/%d/%Y %H:%M'))
eventdateEnd.append(endDate.dt.strftime('%m/%d/%Y'))
eventTimeEnd.append(endDate.dt.strftime('%H:%M'))
dateStamp = component.get('dtstamp')
print (dateStamp.dt.strftime('%m/%d/%Y %H:%M'))
print ('')
# Modify or change these values based on your CalDAV
# Converting to JSON
data = [{ 'Events Summary':eventSummary[0], 'Event Description':eventDescription[0],'Event Start date':eventDateStart[0], 'Event End date':eventdateEnd[0], 'At:':eventTimeStart[0], 'Until':eventTimeEnd[0]}]
data_string = json.dumps(data)
print ('JSON:', data_string)
pyOwnCloud could be the right thing for you. I haven't tried it, but it should provide a CMDline/API for reading the calendars.
You probably want to provide more details about how you are actually making use of the API but in case the query command is indeed not implemented, there is a list of other Python libraries at the CalConnect website (archvied version, original link is dead now).