Save Neo4j Data to Spark RDD (or) DataFrame

Save Neo4j Data to Spark RDD (or) DataFrame - python-2.7

I am retrieving the data from Neo4j using Bolt Driver in Python Language. The returned result should be stored as RDD(or atleast into CSV). I am able to see the returned results but unable to store it as an RDD or a Data frame or atleast into a csv.
Here is how I am seeing the result:
session = driver.session()
result = session.run('MATCH (n) RETURN n.hobby,id(n)')
session.close()
Here, how can I store this data into RDD or CSV file.

I deleted the old post and reposted the same question. But I haven't received any pointers. So, I am posting my way of approach so that it may help others.
'''
Storing the return result into RDD
'''
session = driver.session()
result = session.run('MATCH (n:Hobby) RETURN n.hobby AS hobby,id(n) As id LIMIT 10')
session.close()
'''
Pulling the keys
'''
keys = result.peek().keys()
'''
Reading all the property values and storing it in a list
'''
values=list()
for record in result:
rec= list()
for key in keys:
rec.append(record[key])
values.append(rec)
'''
Converting list of values into a pandas dataframe
'''
df = DataFrame(values, columns=keys)
print df
'''
Converting the pandas DataFrame to Spark DataFrame
'''
sqlCtx = SQLContext(sc)
spark_df = sqlCtx.createDataFrame(df)
print spark_df.show()
'''
Converting the Pandas DataFrame to SparkRdd (via Spark Dataframes)
'''
rdd = spark_df.rdd.map(tuple)
print rdd.take(10)
Any suggestions to improve the efficiency is highly appreciated.

Instead of going from python to spark, why not use the Neo4j Spark connector? I think this would save python from being a bottle neck if you were moving a lot of data. You can put your cypher query inside of the spark session and save it as an RDD.
There has been talk on the Neo4J slack group about a pyspark implementation, which will hopefully be available later this fall. I know the ability to query neo4j from pyspark and sparkr would be very useful.

Related

Django. Moving a redis query outside of a view. Making the query results available to all views

My Django app has read-only "dashboard" views of data in a Pandas DataFrame. The DataFrame is built from a redis database query.
Code snippet below:
# Part 1. Get values from the redis database and load them into a DataFrame.
r = redis.StrictRedis(**redisconfig)
keys = r.keys(pattern="*")
keys.sort()
values = r.mget(keys)
values = [x for x in vals if x != None]
redisDataFrame = pd.DataFrame(map(json.loads, vals))
# Part 2. Manipulate the DataFrame for display
myViewData = redisDataFrame
#Manipulation of myViewData
#Exact steps vary based from one view to the next.
fig = myViewData.plot()
The code for part 1 (query redis) is inside every single view that displays that data. And the views have an update interval of 1 second. If I have 20 users viewing dashboards, the redis database is getting queried 20 times a second.
Because the query sometimes takes several seconds, Django spawns multiple threads, many of which hang and slow down the whole system.
I want to put part 1 (querying the redis database) into its own codeblock. Django will query redis (and make the redisDataFrame object) once per second. Each view will copy redisDataFrame into its own object, but it won't query the redis database over and over again. I think this will help performance.
I see some options for this, but I'm not sure what's the best option. Can you point me in the right direction?
-Custom context processor. I could put the 'Part 1' code into a custom context processor, using sched to execute once per second.
import time, sched
schedule = sched.scheduler(time.time, time.sleep)
r = redis.StrictRedis(**redisconfig)
def query_redis:
keys = r.keys(pattern="*")
keys.sort()
values = r.mget(keys)
values = [x for x in vals if x != None]
retuen redisDataFrame = pd.DataFrame(map(json.loads, vals))
scheduler.enter(1, 1, redis_data_frame = query_redis())
return redisDataFrame
from mysite.context_processors import redisDataFrame
...
myViewData = redisDataFrame
-Celery. I'm not familiar with this, but it's often recommended. That said, Celery uses redis as a "broker" between Python apps. If Celery writes to a redis database, that doesn't help my issue of improving access to redis.
I feel like this issue (multiple users accessing read-only DataFrames) is a common task that's easily solved. I just don't know how to solve it. Can you help?

from Django forms to pandas DataFrame

I am very new to Django, but facing quite a daunting task already.
I need to create multiple forms like this on the webpage where user would provide input (only floating numbers allowed) and then convert these inputs to pandas DataFrame to do data analysis. I would highly appreciate if you could advise how should I go about doing this?
Form needed:

This is a very broad question and I am assuming you are familiar with pandas and python. There might be a more efficient way but this is how I would do it. It should not be that difficult have the user submit the form then import pandas in your view. Create an initial data frame Then you can get the form data using something like this
if form.is_valid():
field1 = form.cleaned_data['field1']
field2 = form.cleaned_data['field2']
field3 = form.cleaned_data['field3']
field4 = form.cleaned_data['field4']
you can then create a new data frame like so:
df2 = pd.DataFrame([[field1, field2], [field3, field4]], columns=list('AB'))
then append the second data frame to the first like so:
df.append(df2)
Keep iterating over the data in this fashion until you have added all the data. After its all been appended you can do you analysis and whatever else you like. You note you can append more data the 2 by 2 thats just for example.
Pandas append docs:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
Django forms docs:
https://docs.djangoproject.com/en/2.0/topics/forms/
The docs are you friend

ValueError Scikit learn. Number of features of model don't match input

I am pretty new to machine learning in general and scikit-learn in specific.
I am trying to use the example given on the site http://scikit-learn.org/stable/tutorial/basic/tutorial.html
For practicing on my own, I am using my own data-set. My data set is divided into two different CSV files:
Train_data.csv (Contains 32 columns, the last column is the output value).
Test_data.csv (Contains 31 columns the output column is missing - Which should be the case, no?)
Test data is one column less than training data..
I am using the following code to learn (using training data) and then predict (using test data).
The issue I am facing is the error:
*ValueError: X.shape[1] = 31 should be equal to 29, the number of features at training time*
Here is my code (sorry if it looks completely wrong :( )
import pandas as pd #import the library
from sklearn import svm
mydata = pd.read_csv("Train - Copy.csv") #I read my training data set
target = mydata["Desired"] #my csv has header row, and the output label column is named "Desired"
data = mydata.ix[:,:-3] #select all but the last column as data
clf = svm.SVC(gamma=0.001, C=100.) #Code from the URL above
clf.fit(data,target) #Code from the URL above
test_data = pd.read_csv("test.csv") #I read my test data set. Without the output column
clf.predict(test_data[-1:]) #Code from the URL above
The training data csv labels looks something like this:
Value1,Value2,Value3,Value4,Output
The test data csv labels looks something like this:
Value1,Value2,Value3,Value4.
Thanks :)

Your problem is a Supervised Problem, you have some data in form of (input,output).
The input are the features describing your example and the output is the prediction that your model should respond given that input.
In your training data, you'll have one more attribute in your csv file because in order to train your model you need to give him the output.
The general workflow in sklearn with a Supervised Problem should look like this
X, Y = read_data(data)
n = len(X)
X_train, X_test = X[:n*0.8], X[n*0.8:]
Y_train, Y_test = Y[:n*0.8], Y[n*0.8:]
model.fit(X_train,Y_train)
model.score(X_test, Y_test)
To split your data, you can use train_test_split and you can use several metrics in order to judge your model's performance.
You should check the shape of your data
data.shape
It seems like you're not taking into the account the last 3 columns instead of only the last. Try instead :
data = mydata.ix[:,:-1]

Importing salesforce report data using python

I am new to sfdc . I have report already created by user . I would like to use python to dump the data of the report into csv/excel file.
I see there are couple of python packages for that. But my code gives an error
from simple_salesforce import Salesforce
sf = Salesforce(instance_url='https://cs1.salesforce.com', session_id='')
sf = Salesforce(password='xxxxxx', username='xxxxx', organizationId='xxxxx')
Can i have the basic steps for setting up the API and some example code

This worked for me:
import requests
import csv
from simple_salesforce import Salesforce
import pandas as pd
sf = Salesforce(username=your_username, password=your_password, security_token = your_token)
login_data = {'username': your_username, 'password': your_password_plus_your_token}
with requests.session() as s:
d = s.get("https://your_instance.salesforce.com/{}?export=1&enc=UTF-8&xf=csv".format(reportid), headers=sf.headers, cookies={'sid': sf.session_id})
d.content will contain a string of comma separated values which you can read with the csv module.
I take the data into pandas from there, hence the function name and import pandas. I removed the rest of the function where it puts the data into a DataFrame, but if you're interested in how that's done let me know.

In case it is helpful, I wanted to write out the steps I used to answer this question now (Aug-2018), based on Obol's comment. For reference, I followed the README instructions at https://github.com/cghall/force-retrieve/blob/master/README.md for the salesforce_reporting package.
To connect to Salesforce:
from salesforce_reporting import Connection, ReportParser
sf = Connection(username='your_username',password='your_password',security_token='your_token')
Then, to get the report I wanted into a Pandas DataFrame:
report = sf.get_report(your_reports_id)
parser = salesforce_reporting.ReportParser(report)
report = parser.records_dict()
report = pd.DataFrame(report)
If you were so inclined, you could also simplify the four lines above into one, like so:
report = pd.DataFrame(salesforce_reporting.ReportParser(sf.get_report(your_reports_id)).records_dict())
One difference I ran into from the README is that sf.get_report('report_id', includeDetails=True) threw an error stating get_report() got an unexpected keyword argument 'includeDetails'. Simply removing it out seemed result in the code working fine.
report can now be exported via report.to_csv('report.csv',index=False), or manipulated directly.
EDIT: parser.records() changed to parser.records_dict(), as this allows the DataFrame to have the columns already listed, rather than indexing them numerically.

The code below is rather long and might be just for our use case but the basic idea is the following:
Find out date interval length and additional needed filtering to never run into the "more the 2'000" limit. In my case I could have weekly date range filter but would need to apply some additional filters
Then run it like this:
report_id = '00O4…'
sf = SalesforceReport(user, pass, token, report_id)
it = sf.iterate_over_dates_and_filters(datetime.date(2020,2,1),
'Invoice__c.InvoiceDate__c', 'Opportunity.CustomField__c',
[('a', 'startswith'), ('b', 'startswith'), …])
for row in it:
# do something with the dict
The iterator goes through every week (if you need daily iterators or monthly then you'd need to change the code, but the change should be minimal) since 2020-02-01 and applies the filter CustomField__c.startswith('a'), then CustomField__c.startswith('b'), … and acts as a generator so you don't need to mess with the filter cycling yourself.
The iterator throws an Exception if there's a query which returns more than 2000 rows, just to be sure that the data is not incomplete.
One warning here: SF has a limit of max 500 queries per hour. Say if you have one year with 52 weeks and 10 additional filters you'd already run into that limit.
Here's the class (relies on simple_salesforce)
import simple_salesforce
import json
import datetime
"""
helper class to iterate over salesforce report data
and manouvering around the 2000 max limit
"""
class SalesforceReport(simple_salesforce.Salesforce):
def __init__(self, username, password, security_token, report_id):
super(SalesforceReport, self).__init__(username=username, password=password, security_token=security_token)
self.report_id = report_id
self._fetch_describe()
def _fetch_describe(self):
url = f'{self.base_url}analytics/reports/{self.report_id}/describe'
result = self._call_salesforce('GET', url)
self.filters = dict(result.json()['reportMetadata'])
def apply_report_filter(self, column, operator, value, replace=True):
"""
adds/replaces filter, example:
apply_report_filter('Opportunity.InsertionId__c', 'startsWith', 'hbob').
For date filters use apply_standard_date_filter.
column: needs to correspond to a column in your report, AND the report
needs to have this filter configured (so in the UI the filter
can be applied)
operator: equals, notEqual, lessThan, greaterThan, lessOrEqual,
greaterOrEqual, contains, notContain, startsWith, includes
see https://sforce.co/2Tb5SrS for up to date list
value: value as a string
replace: if set to True, then if there's already a restriction on column
this restriction will be replaced, otherwise it's added additionally
"""
filters = self.filters['reportFilters']
if replace:
filters = [f for f in filters if not f['column'] == column]
filters.append(dict(
column=column,
isRunPageEditable=True,
operator=operator,
value=value))
self.filters['reportFilters'] = filters
def apply_standard_date_filter(self, column, startDate, endDate):
"""
replace date filter. The date filter needs to be available as a filter in the
UI already
Example: apply_standard_date_filter('Invoice__c.InvoiceDate__c', d_from, d_to)
column: needs to correspond to a column in your report
startDate, endDate: instance of datetime.date
"""
self.filters['standardDateFilter'] = dict(
column=column,
durationValue='CUSTOM',
startDate=startDate.strftime('%Y-%m-%d'),
endDate=endDate.strftime('%Y-%m-%d')
)
def query_report(self):
"""
return generator which yields one report row as dict at a time
"""
url = self.base_url + f"analytics/reports/query"
result = self._call_salesforce('POST', url, data=json.dumps(dict(reportMetadata=self.filters)))
r = result.json()
columns = r['reportMetadata']['detailColumns']
if not r['allData']:
raise Exception('got more than 2000 rows! Quitting as data would be incomplete')
for row in r['factMap']['T!T']['rows']:
values = []
for c in row['dataCells']:
t = type(c['value'])
if t == str or t == type(None) or t == int:
values.append(c['value'])
elif t == dict and 'amount' in c['value']:
values.append(c['value']['amount'])
else:
print(f"don't know how to handle {c}")
values.append(c['value'])
yield dict(zip(columns, values))
def iterate_over_dates_and_filters(self, startDate, date_column, filter_column, filter_tuples):
"""
return generator which iterates over every week and applies the filters
each for column
"""
date_runner = startDate
while True:
print(date_runner)
self.apply_standard_date_filter(date_column, date_runner, date_runner + datetime.timedelta(days=6))
for val, op in filter_tuples:
print(val)
self.apply_report_filter(filter_column, op, val)
for row in self.query_report():
yield row
date_runner += datetime.timedelta(days=7)
if date_runner > datetime.date.today():
break

For anyone just trying to download a report into a DataFrame this is how you do it (I added some notes and links for clarifications):
import pandas as pd
import csv
import requests
from io import StringIO
from simple_salesforce import Salesforce
# Input Salesforce credentials:
sf = Salesforce(
username='johndoe#mail.com',
password='<password>',
security_token='<security_token>') # See below for help with finding token
# Basic report URL structure:
orgParams = 'https://<INSERT_YOUR_COMPANY_NAME_HERE>.my.salesforce.com/' # you can see this in your Salesforce URL
exportParams = '?isdtp=p1&export=1&enc=UTF-8&xf=csv'
# Downloading the report:
reportId = 'reportId' # You find this in the URL of the report in question between "Report/" and "/view"
reportUrl = orgParams + reportId + exportParams
reportReq = requests.get(reportUrl, headers=sf.headers, cookies={'sid': sf.session_id})
reportData = reportReq.content.decode('utf-8')
reportDf = pd.read_csv(StringIO(reportData))
You can get your token by following the instructions at the bottom of this page

Pandas dataframe merge issue

I am learning python and pandas via Wes McKinney's Python for Data Analysis. One of the examples in Chapter 2 is a merge of MovieLens data on movie_id that is not working. I think the issue is that in ratings the movie_id is an int64 and in movies it is an object. The merge returns an empty data frame.
I have read some of the previous posts on pandas and automatic data type assignment and found the dtype in pandas.io.parsers.read_table documentation but cant get the type to change.
The original code:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ch02/movielens/movies.dat', sep='::', header=None, names=mnames)
And what my research indicated what should work:
movies = pd.read_table('ch02/movielens/movies.dat', sep='::', header=None, names=mnames, dtype={'movie_id':np.int64})
Unfortunately, the type isn't changed and the merge still returns an empty set. I am running pandas 0.10.1

(Note I haven't looked up the book code, just your post)
First confirm the dtypes:
print ratings_df.dtypes
print movies_df.dtypes
If you find they're different types you could try (let's assume ratings_df.movie_id is object instead of int):
ratings_df.movie_id = ratings_df.movie_id.astype(int)
See if your merge now works.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Save Neo4j Data to Spark RDD (or) DataFrame - python-2.7

Related

Django. Moving a redis query outside of a view. Making the query results available to all views

from Django forms to pandas DataFrame

ValueError Scikit learn. Number of features of model don't match input

Importing salesforce report data using python

Pandas dataframe merge issue

Categories

Resources