How to Load Data into remote Neo4j AWS instance? - amazon-web-services

I want to import data into a Neo4j instance brought up in AWS (community edition from AWS marketplace). One option is to convert the data to CSV and run the LOAD CSV command in the Neo4j UI, and point it to a public http address that reads from S3. This, however, means we need to publicly expose the file externally which would expose sensitive data. How else can import this data?
Thanks!

I would suggest you use any of the Neo4j driver like Python or Java. Here is one Python example that I used in my posts:
def store_to_neo4j(distances):
data = [{'source': el[0], 'target': el[1], 'weight': distances[el]} for el in distances]
with driver.session() as session:
session.run("""
UNWIND $data as row
MERGE (c:Character{name:row.source})
MERGE (t:Character{name:row.target})
MERGE (c)-[i:INTERACTS]-(t)
SET i.weight = coalesce(i.weight,0) + row.weight
""", {'data': data})
You don't want to execute the import line by line, but you want to batch lets say 1000 rows in a parameter and then use UNWIND operator to import this data.

Related

Cloud function that queries Oracle database

Need some support in building the cloud function that calls Oracle database, wrote the python code and it's on repo and the function calls it with an HTTP trigger, so that's good.
To connect to Oracle, Oracle Client Library is needed, and it's uploaded on cloud storage bucket.
So now the repo and bucket and the function are all set and in the same region, yet the function throws an error that it can't configure the oracle client library
Here is the code if it's important
import cx_Oracle
def queryOracleDatabase(request):
# Oracle Database Connection
username = 'x'
password = 'y'
connStr = '00.00.00.00:0000/abcd'
try:
conn = cx_Oracle.connect(username, password, connStr)
except cx_Oracle.DatabaseError as e:
error = e.args
print('Error: ', error.message)
return
# Execute the query
try:
cursor = conn.cursor()
cursor.execute('select * table')
data = cursor.fetchall()
except cx_Oracle.DatabaseError as e:
error = e.args
print('Error: ', error.message)
return
# Clean up
cursor.close()
conn.close()
return data
And this is the error it throws
Error: DPI-1047: Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared object file: No such file or directory"
How to connect the function with the bucket?
Given more context provided by the comments, Cloud Functions don't fit so well the use case since they don't provide a persistent disk where you may store your oracle client lib.
Cloud functions is a very specialized service, such specialization provides a very low adopting curve and makes it the best choice when the use case and tech stack fit such specialization (eg.: No need of FS unless /tmp, or no need to customize the runtime/OS).
Instead, when the use case does require some degree of customization of the container where the function runs, Cloud Run comes to life. By simply defining a docker container you can make it host the oracle client lib in the FS (everywhere you need), as well as running your function reusing your current code almost as is.
I presume your teck stack is quite standard, so I would check the docker hub for an image based on python and maybe even the oracle SDK you need.. It would be an easy starting point.
About accessing the oracle client hosted on a bucket: the cloud function might download it to the /temp storage, but I'm not sure that you can actually load the lib from there. Such approach of storing libraries to buckets is something unusual to me (just personal experience).

neo4j use Load CSV to read data from Google Cloud Storage

My original data is from Bigquery. I have created a dag job to extract the relevant fields data based on a "WHERE" condition into a csv file stored in Google Cloud Storage
As a next step, I am aiming to use "LOAD CSV WITH EHADERS FROM gs://link-to-bucket/file.csv to read the data from the CSV to Neo4j database
It seems however that I cannot just give the the gcs uri as the CSV link. Is there anyway to establish a secure connection to read the file, other then making the the bucket public?
My attempt
uri = "gs://link-to-bucket/file.csv"
def create_LP_query(uri):
query_string = f"""
LOAD CSV WITH HEADERS FROM '{uri}' AS row
MERGE (l:Limited_Partner:Company {{id: row.id}})
SET l.Name = row.Name """
It is not possible, you would have to create a Neo4j plugin that acts as a new ProtocolHandler.
I did one in the past for S3, you might take it as inspiration, it can be similar for GS.
https://github.com/ikwattro/neo4j-load-csv-s3-protocol

Creating REST API in GCP to read data from BigQuery

Very new in Google Cloud Platform & hence asking basic question.
I am looking for an API which will be hosted in GCP. An External application will call the API to read data from BigQuery.
Can anyone help me out with any example Code/Approach?
Looking for an End-to-End cloud based solution based on Python
I can't provide you with a complete code example. But:
You can setup your python API using (Flask for example)
You can then use the python client to connect to BigQuery https://cloud.google.com/bigquery/docs/reference/libraries
Deploy your python API in Google App Engine, Cloud Run, Kubernetes, Compute, etc....
Do not forget to setup CORS and potential auth,
That's it
You can create a Python program using the Bigquery client, then deploy this program as a HTTP Cloud Function or Cloud Run service :
from flask import escape
from google.cloud import bigquery
import functions_framework
#functions_framework.http
def your_http_function(request):
#HTTP Cloud Function.
request_json = request.get_json(silent=True)
request_args = request.args
# example to retrieve argument param in the HTTP call
if request_json and 'name' in request_json:
name = request_json['name']
elif request_args and 'name' in request_args:
name = request_args['name']
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state = 'TX'
GROUP BY name, state
ORDER BY total_people DESC
LIMIT 20
"""
query_job = client.query(query) # Make an API request.
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.name)
return rows
You have to deploy your Python code as a Cloud Function in this example
Your function can be invoked with a HTTP call with a param name :
https://GCP_REGION-PROJECT_ID.cloudfunctions.net/hello_http?name=NAME
You can also use Cloud Run that gives more flexibility because you deploy a Docker image.

Upload to BigQuery from Cloud Storage

Have ~50k compressed (gzip) json files daily that need to be uploaded to BQ with some transformation, no API calls. The size of the files may be up to 1Gb.
What is the most cost-efficient way to do it?
Will appreciate any help.
Most efficient way to use Cloud Data Fusion.
I would suggest below approach
Create cloud function and trigger on every new file upload to uncompress file.
Create datafusion job with GCS file as source and bigquery as sink with desired transformation.
Refer below my youtube video.
https://youtu.be/89of33RcaRw
Here is (for example) one way - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json...
... but quickly looking over it however one can see that there are some specific limitations. So perhaps simplicity, customization and maintainability of solution can also be added to your “cost” function.
Not knowing some details (for example read "Limitations" section under my link above, what stack you have/willing/able to use, files names or if your files have nested fields etc etc etc ) my first thought is cloud function service ( https://cloud.google.com/functions/pricing ) that is "listening" (event type = Finalize/Create) to your cloud (storage) bucket where your files land (if you go this route put your storage and function in the same zone [if possible], which will make it cheaper).
If you can code Python here is some started code:
main.py
import pandas as pd
from pandas.io import gbq
from io import BytesIO, StringIO
import numpy as np
from google.cloud import storage, bigquery
import io
def process(event, context):
file = event
# check if its your file can also check for patterns in name
if file['name'] == 'YOUR_FILENAME':
try:
working_file = file['name']
storage_client = storage.Client()
bucket = storage_client.get_bucket('your_bucket_here')
blob = bucket.blob(working_file)
#https://stackoverflow.com/questions/49541026/how-do-i-unzip-a-zip-file-in-google-cloud-storage
zipbytes = io.BytesIO(blob.download_as_string())
#print for logging
print(f"file downloaded, {working_file}")
#read_file_as_df --- check out docs here = https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# if nested might need to go text --> to dictionary and then do some preprocessing
df = pd.read_json(zipbytes, compression='gzip', low_memory= False)
#write processed to big query
df.to_gbq(destination_table ='your_dataset.your_table',
project_id ='your_project_id',
if_exists = 'append')
print(f"table bq created, {working_file}")
# if you want to delete processed file from your storage to save on storage costs uncomment 2 lines below
# blob.delete()
#print(f"blob delete, {working_file}")
except Exception as e:
print(f"exception occured {e}, {working_file}")
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-storage
google-cloud-bigquery
pandas
pandas.io
pandas-gbq
PS
Some alternatives include
Starting up a VM and run your script on a schedule and shutting VM down once process is done ( for example cloud scheduler –-> pub/sub –-> cloud function –-> which starts up your vm --> which then runs your script)
Using app engine to run your script (similar)
Using cloud run to run your script (similar)
Using composer/airflow (not similar to 1,2&3) [ could use all types of approaches including data transfers etc, just not sure what stack you are trying to use or what you already have running ]
Scheduling vertex ai workbook (not similar to 1,2&3, basically write up a jupyter notebook and schedule it to run in vertex ai)
Try to query files directly (https://cloud.google.com/bigquery/external-data-cloud-storage#bq_1) and schedule that query (https://cloud.google.com/bigquery/docs/scheduling-queries) to run (but again not sure about your overall pipeline)
Setup for all (except #5 & #6) just in technical debt to me is not worth it if you can get away with functions
Best of luck,

Run Redshift Queries Periodically

I have started researching into Redshift. It is defined as a "Database" service in AWS. From what I have learnt so far, we can create tables and ingest data from S3 or from external sources like Hive into Redhshift database (cluster). Also, we can use JDBC connection to query these tables.
My questions are -
Is there a place within Redshift cluster where we can store our queries run it periodically (like Daily)?
Can we store our query in a S3 location and use that to create output to another S3 location?
Can we load a DB2 table unload file with a mixture of binary and string fields to Redshift directly, or do we need a intermediate process to make the data into something like a CSV?
I have done some Googling about this. If you have link to resources, that will be very helpful. Thank you.
I used cursor method using psycopg2 function in python. The sample code is given below. You have to set all the redshift credentials in env_vars files.
you can set your queries using cursor.execute. here I mension one update query so you can set your query in this place (you can set multiple queries). After that you have to set this python file into crontab or any other autorun application for running your queries periodically.
import psycopg2
import sys
import env_vars
conn_string = "dbname=%s port=%s user=%s password=%s host=%s " %(env_vars.RedshiftVariables.REDSHIFT_DW ,env_vars.RedshiftVariables.REDSHIFT_PORT ,env_vars.RedshiftVariables.REDSHIFT_USERNAME ,env_vars.RedshiftVariables.REDSHIFT_PASSWORD,env_vars.RedshiftVariables.REDSHIFT_HOST)
conn = psycopg2.connect(conn_string);
cursor = conn.cursor();
cursor.execute("""UPDATE database.demo_table SET Device_id = '123' where Device = 'IPHONE' or Device = 'Apple'; """);
conn.commit();
conn.close();