Convert non-strict JSON Amazon SNAP metadata to Pandas DataFrame - python-2.7

I am trying to convert this Amazon sample Snap grocery JSON data to a Pandas dataframe in IBM Bluemix (using Python 2.x) and then analyze it with Apache Spark.
I have unzipped the JSON file and uploaded it to an Apache Spark Container.
Here is my container connection:
# In[ ]:
credentials_1 = {
'auth_uri':'',
'global_account_auth_uri':'',
'username':'myUname',
'password':"myPw",
'auth_url':'https://identity.open.softlayer.com',
'project':'object_storage_988dfce6_5b93_48fc_9575_198bbed3abfc',
'project_id':'2c05de8a36d74d32bdbe0eeec7e5a372',
'region':'dallas',
'user_id':'4976489bab7d489f8d2eba681adacb78',
'domain_id':'8b6bc3e989d644858d7b74f24119447a',
'domain_name':'1079761',
'filename':'meta_Grocery_and_Gourmet_Food.json',
'container':'grocery',
'tenantId':'s31d-8e24c13d9c36f4-43b43b7b993d'
}
I then used Apache Spark's sample of importing data from container to StringIO
# In[ ]:
import requests, StringIO, pandas as pd, json, re
# In[ ]:
def get_file_content(credentials):
"""For given credentials, this functions returns a StringIO object containing the file content."""
url1 = ''.join([credentials['auth_url'], '/v3/auth/tokens'])
data = {'auth': {'identity': {'methods': ['password'],
'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
'password': credentials['password']}}}}}
headers1 = {'Content-Type': 'application/json'}
resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
resp1_body = resp1.json()
for e1 in resp1_body['token']['catalog']:
if(e1['type']=='object-store'):
for e2 in e1['endpoints']:
if(e2['interface']=='public'and e2['region']==credentials['region']):
url2 = ''.join([e2['url'],'/', credentials['container'], '/', credentials['filename']])
s_subject_token = resp1.headers['x-subject-token']
headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
resp2 = requests.get(url=url2, headers=headers2)
return StringIO.StringIO(resp2.content)
I then converted the String content to a strict JSON pattern by appending [ and ] at the beginning and at the end and by separating the data with a comma.
print('----------------------\n')
import json
myDf=[];
def parse(data):
for l in data:
yield json.dumps(eval(l))
def getDF(data):
st='['
i = 0
df =[]
for d in parse(data):
if i<100:
i += 1
#print(str(d))
st=st+str(d)+','
#print('----------------\n')
st=st[:-1]
st=st+']'
#js=json.loads(st)
#print(json.dumps(js))
return pd.read_json(st)
content_string = get_file_content(credentials_1)
df = getDF(content_string)
df.head()
I am getting a perfectly desirable result.
Output of the code
The problem is that when I remove i < 100 condition, it just never completes and the kernel remains busy for over one hour.
Is there any other elegant ways to convert the the data into dataframe?
Also, ijson is not available with Bluemix Notebook.

Let me answer this in two parts:-
You can install ijson in your bluemix spark service using below command and then user import ijson to further use it as per your use.
!pip install --user ijson
You can use sqlContext.jsonFile to read the json from object storage rather than going around to define your schema.
This will even infer the schema for you and then you can run some spark-sql queries to do whatever you want with the dataframe.
df = sqlContext.jsonFile("swift://" + objectStorageCreds['container'] + "." + objectStorageCreds['name'] + "/" + objectStorageCreds['filename'])
Here is the link to complete notebook.
If you have to work with pandas dataframe , you can simply convert it to
df.toPandas().head()
But this will cause to get everything at the driver node(use carefully).
Thanks,
Charles.

Related

How can I copy data from CSV into QuestDB using Python?

I'm using the psychopg2 module to make queries against QuestDB from Python. I have had some trouble using the copy_from() cursor object to get CSV data into a table. What's the best way to get this into the database?
I'm trying the following:
import pandas as pd
import numpy as np
import psycopg2
import os
conn = psycopg2.connect(user="admin",
password="quest",
host="127.0.0.1",
port="8812",
database="qdb")
cursor = conn.cursor()
dest_table = "eur_fr_bulk"
temp_dataframe = "./temp_dataframe.csv"
# input
df = pd.read_csv("./data/eur_fr.csv")
df.to_csv(temp_dataframe, index_label='id', header=False)
f = open(temp_dataframe, 'r')
cursor = conn.cursor()
try:
cursor.copy_from(f, dest_table)
conn.commit()
except (Exception, psycopg2.DatabaseError) as error:
os.remove(temp_dataframe)
print("Error: %s" % error)
conn.rollback()
cursor.close()
cursor.close()
The copy_from() wrapper in psychopg2 is executing some SQL in the background that's not yet supported in QuestDB as of yet, specifically, it will run
COPY my_table FROM stdin WITH DELIMITER AS ' ' NULL AS '\\N'
The DELIMITER keyword is not yet implemented. As a workaround, you can either make the request via HTTP in python, which might be the most convenient:
import requests
csv = {'data': ('my_table_import', open('./data/eur_fr.csv', 'r'))}
server = 'http://localhost:9000/imp'
response = requests.post(server, files=csv)
print(response.text)
or you can specify a copy directory in the server.conf file which allows loading CSV files. This is documented on the COPY documentation page.

Fetch a file(.csv) from S3 bucket and copy to an RDS

I'm gonna connect to a S3 bucket, get the csv files and copy the rows to RDS DB. On this script we are using arcpy, I'm not that familiar with this package, I'm just trying to get the csv file directly from S3 bucket as source without downloading it on the server. The code is as follows:
import arcpy
from boto.s3.key import Key
import StringIO
import pandas as pd
import boto
import boto.s3.connection
access_key = ''
secret_key = ''
conn = boto.connect_s3(aws_access_key_id = access_key,aws_secret_access_key = secret_key,host = 's3.amazonaws.com')
b = conn.get_bucket('mybucket')
#for key in b.list:
b_key = b.get_key('file1.csv')
arcpy.env.overwriteOutput = True
b_url = b_key.generate_url(0, query_auth=False, force_http=True)
print b_url
##Read file
k = Key(b,file1.csv)
content = k.get_contents_as_string()
sourcefile_csv = pd.read_csv(StringIO.StringIO(content))
##CopyRows_management (in_rows, out_table, {config_keyword})
#http://pro.arcgis.com/en/pro-app/tool-reference/data-management/copy-rows.htm
arcpy.CopyRows_management(sourcefile_csv, "RDSTablePath", "")
print("copy rows done")
Error: in CopyRows arcgisscripting.ExecuteError. Failed to execute Parameters are not valid
If we use a path on the server as source path like below it works fine:
sourcefile_csv = "D:\\DEV\\file1.csv"
arcpy.CopyRows_management(sourcefile_csv, "RDSTablePath", "")
Any help would be appreciated.
It looks like you are trying to use the Pandas dataframe as the table to read from with CopyRows_management? I don't think that is a valid input for the function, thus the "Parameters are not valid" error. The documentation says that in_rows should be "The rows from a feature class, layer, table, or table view to be copied." I think the use of pandas is unnecessary here anyways.
So either save the csv somewhere that the script can access it (as you did in when you used the path on the server) or, if you don't want to save the file anywhere, just read the contents of the csv and iterate through it using an Insert Cursor to write it to your table/feature class.
See this post on how to read a csv from a string using the csv module. Then just loop through the rows of the csv and use the Insert Cursor to write to your table.
If your RDS happens to be an Aurora MySql then you should take a look into Loading Data from S3 feature, where you can skip the code and just loads line by line into your DB.

Replace White-space with hyphen then create URL

I'm trying to speed up a process of webscraping by sending raw data to python in lieu of correctly formatted data.
Current data is received as an excel file with data formatted as:
26 EXAMPLE RD EXAMPLEVILLE SA 5000
Data is formatted in excel via macros to:
Replace all spaces with hyphen
Change all text to lower-case
Paste text onto end of http://example.com/property/
Formatted data is http://www.example.com/property/26-example-rd-exampleville-sa-5000
What i'm trying to accomplish:
Get python to go into excel sheet and follow formatting rules listed above, then pass the records to the scraper.
Here is the code I have been trying to compile - please go easy i am VERY new.
Any advice or reading sources related to python formatting would be appreciated.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import csv
from lxml import html
import xlrd
# URL_BUILDER
# Source File for UNFORMATTED DATA
file_location = "C:\Python27\Projects\REA_SCRAPER\NewScraper\ScrapeFile.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('((PythonScraperDNC))')
# REA_SCRAPER
# Pass Data from URL_BUILDER to URL_LIST []
URL_LIST = []
# Search Phrase to capture suitable URL's for Scraping
text2search = \
'''<p class="property-value__title">
RECENTLY SOLD
</p>'''
# Write Sales .CSV file
with open('Results.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for (index, url) in enumerate(URL_LIST):
page = requests.get(url)
print '<Scanning Url For Sale>'
if text2search in page.text:
tree = html.fromstring(page.content)
(title, ) = (x.text_content() for x in tree.xpath('//title'))
(price, ) = (x.text_content() for x in tree.xpath('//div[#class="property-value__price"]'))
(sold, ) = (x.text_content().strip() for x intree.xpath('//p[#class="property-value__agent"]'))
writer.writerow([title, price, sold])
else:
writer.writerow(['No Sale'])
If you're just trying to figure out how to do the formatting in Python:
text = '26 EXAMPLE RD EXAMPLEVILLE SA 5000'
url = 'http://example.com/property/' + text.replace(' ', '-').lower()
print(url)
# Output:
# http://example.com/property/26-example-rd-exampleville-sa-5000

How can I use nessrest api (python) to export nessus scan reports in xml?

I am trying to automate the running of and downloading nessus scans using python. I have been using the nessrest api for python, and am able to successfully run a scan, but am not being successfully download the report in nessus format.
Any ideas how I can do this? I have been using the module scan_download, but that actually executes before my scan even finishes.
Thanks for the help in advance!
Just looking back at this question, heres an example of using Nessrest API to pull down CSV report exports from you nessus host,
#!/usr/bin/python2.7
import sys
import os
import io
from nessrest import ness6rest
file_format = 'csv' # options: nessus, csv, db, html
dbpasswd = ''
scan = ness6rest.Scanner(url="https://nessus:8834", login="admin", password="P#ssword123", insecure=True)
scan.action(action='scans', method='get')
folders = scan.res['folders']
scans = scan.res['scans']
if scan:
scan.action(action='scans', method='get')
folders = scan.res['folders']
scans = scan.res['scans']
for f in folders:
if not os.path.exists(f['name']):
if not f['type'] == 'trash':
os.mkdir(f['name'])
for s in scans:
scan.scan_name = s['name']
scan.scan_id = s['id']
folder_name = next(f['name'] for f in folders if f['id'] == s['folder_id'])
folder_type = next(f['type'] for f in folders if f['id'] == s['folder_id'])
# skip trash items
if folder_type == 'trash':
continue
if s['status'] == 'completed':
file_name = '%s_%s.%s' % (scan.scan_name, scan.scan_id, file_format)
file_name = file_name.replace('\\','_')
file_name = file_name.replace('/','_')
file_name = file_name.strip()
relative_path_name = folder_name + '/' + file_name
# PDF not yet supported
# python API wrapper nessrest returns the PDF as a string object instead of a byte object, making writing and correctly encoding the file a chore...
# other formats can be written out in text mode
file_modes = 'wb'
# DB is binary mode
#if args.format == "db":
# file_modes = 'wb'
with io.open(relative_path_name, file_modes) as fp:
if file_format != "db":
fp.write(scan.download_scan(export_format=file_format))
else:
fp.write(scan.download_scan(export_format=file_format, dbpasswd=dbpasswd))
can see more examples here,
https://github.com/tenable/nessrest/tree/master/scripts

Creation of excel file using python

Is there any way to create excel objects using pywin32 even if MS-OFFICE suite is not installed on windows based system ?
This will do what you want.
from openpyxl import Workbook
wb = Workbook()
# grab the active worksheet
ws = wb.active
# Data can be assigned directly to cells
ws['A1'] = 42
# Rows can also be appended
ws.append([1, 2, 3])
# Python types will automatically be converted
import datetime
ws['A2'] = datetime.datetime.now()
# Save the file
wb.save("C:\\Users\\your_path_here\\Desktop\\sample.xlsx")
See this link for details.
https://pypi.python.org/pypi/openpyxl