My original data is from Bigquery. I have created a dag job to extract the relevant fields data based on a "WHERE" condition into a csv file stored in Google Cloud Storage
As a next step, I am aiming to use "LOAD CSV WITH EHADERS FROM gs://link-to-bucket/file.csv to read the data from the CSV to Neo4j database
It seems however that I cannot just give the the gcs uri as the CSV link. Is there anyway to establish a secure connection to read the file, other then making the the bucket public?
My attempt
uri = "gs://link-to-bucket/file.csv"
def create_LP_query(uri):
query_string = f"""
LOAD CSV WITH HEADERS FROM '{uri}' AS row
MERGE (l:Limited_Partner:Company {{id: row.id}})
SET l.Name = row.Name """
It is not possible, you would have to create a Neo4j plugin that acts as a new ProtocolHandler.
I did one in the past for S3, you might take it as inspiration, it can be similar for GS.
https://github.com/ikwattro/neo4j-load-csv-s3-protocol
Related
I'm working with a large CSV (400M+ lines) located in a GCS bucket. I need to get a random sample of this csv and export it to BigQuery for a preliminary exploration. I've looked all the over the web and I just can't seem to find anything that addresses this question.
Is this possible and how do I go about doing it?
You can query your csv file directly from BigQuery using external tables.
Try it with TABLESAMPLE clause:
SELECT * FROM dataset.my_table TABLESAMPLE SYSTEM (10 PERCENT)
You can create a external table from GCS (to read directly from GCS) and then do something like that
SELECT * FROM `<project>.<dataset>.<externalTableFromGCS>`
WHERE CAST(10*RAND() AS INT64) = 0
The result of the select can be stored in GCS with an export or stored in a table with an insert select
Keep in mind that you need to fully load the file (and thus to pay for the full file size) and then query a subset of the file. You can't load only 10% of the volume in BigQuery.
There is no direct way to load sample records from GCS to BigQuery, But we can achieve it in different way, In GCS we have option to download only specific chunk of file, so following simple python code can load sample records to BQ from large big GCS file
from google.cloud import storage
from google.cloud import bigquery
gcs_client = storage.Client()
bq_client = bigquery.Client()
job_config = bigquery.LoadJobConfig(source_format='CSV', autodetect=True, max_bad_records=1)
bucket = gcs_client.get_bucket("your-bucket")
blob = storage.Blob('gcs_path/file.csv', bucket)
with open('local_file.csv', 'wb') as f: # downloading sample file
gcs_client.download_blob_to_file(blob, f, start=0, end=2000)
with open('local_file.csv', "rb") as source_file: # uploading to BQ
job = bq_client.load_table_from_file(source_file, 'your-proj.dataset.table_id', job_config=job_config)
job.result() # Wait for loading
In above code, it will download 2 kb of data from your huge GCS file, But
last row in the downloaded csv file may incomplete since we cannot define the bytes for each rows. Here the trickier part is "max_bad_records=1" in bq job config so it will ignore the uncompleted last row.
I have a Nifi Flow, which fetches a data from RDS tables and load into S3 as flat files, now i need to generate another file which will be having the name of the file that I am loading into S3 bucket, this needs to be a separate flow;
example: if the RDS extracted flat file name is RDS.txt, then the new generated file should have rds.txt as content and I need to load this file to same S3 bucket.
Problem I face is I am using a generate flowfile processor and adding the flat file name as custom text in flowfile, but i could not set up any upstream for Generate flow file processor, so this is generating more files, if I use the merge content processor after the generate flow file processor, I could see duplicate values in the flowfile.
Can anyone help me out in this
I have a Nifi Flow, which fetches a data from RDS tables and load into S3 as flat files, now i need to generate another file which will be having the name of the file that I am loading into S3 bucket, this needs to be a separate flow;
Easiest path to do this is to chain something after PutS3Object that will update the flowfile contents with what you want. It would be really simple to write with ExecuteScript. Something like this:
def ff = session.get()
if (ff) {
def updated = session.write(ff, {
it.write(ff.getAttribute("filename").bytes)
} as OutputStreamCallback)
updated = session.putAttribute(updated, "is_updated", "true")
session.transfer(updated, REL_SUCCESS)
}
Then you can put a RouteOnAttribute after PutS3Object and have it route to either a null route if it detects the attribute is_updated or route back to PutS3Object if it's not been updated.
I got a simple solution for this I have added a funnel before the put s3 object, and upstream of the funnel will receive two file, one with the extract and the other with the file name, down stream of the funnel is connected to the puts3 object, so this will load both the files at the same time
I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.
Need some guidance as I am new to Power BI and Redshift ..
My Raw JSON data is stored in Amazon S3 bucket in the form of .gz files (Each .gz file has multiple rows of JSON data)
I wanted to connect Power BI to Amazon s3 Bucket. As of now based on my research I got three ways:
Amazon S3 is a web service and supports the REST API. We can try to use web data source to get data
Question: Is it possible to unzip the .gz file (inside the S3 bucket or Inside Power BI), extract JSON data from S3 and connect to Power BI
Importing data from Amazon S3 into Amazon Redshift. Do all data manipulation inside Redshift using SQL workbench. Use Amazon Redshift connector to get data in Power BI
Question 1: Does Redshift Allows Loading .gzzipped JSON data from the S3 bucket? If Yes, is it directly possible or do I have to write any code for it?
Question 2: I have the S3 account, do I have to separately purchase Redshift Account/Space? What is the cost?
Move data from an AWS S3 bucket to the Azure Data Lake Store via Azure Data Factory, transform the data with Azure Data Lake Analytics (U-SQL), and then output the data to PowerBI
U-SQL recognize GZip compressed files with the file extension .gz and automatically decompress them as the part of the Extraction process. Is this process valid, if my gzipped files contain JSON data rows?
Please let me if there is any other method, also your valuable suggestions on this post.
Thanks in Advance.
About your first Question: I've just faced a similar issue recently (but extracting a csv) and I would like to register my solution.
Power BI still don't have a direct plugin to download S3 buckets, but you can do it using a python script.
Get data --> Python Script
PS.: make sure that boto3 and pandas libraries are installed in the same folder (or subfolders) of the Python home directory you informed in Power BI options,
OR in Anaconda library folder (c:\users\USERNAME\anaconda3\lib\site-packages).
Power BI window for Python scripts options
import boto3
import pandas as pd
bucket_name= 'your_bucket'
folder_name= 'the folder inside your bucket/'
file_name = r'file_name.csv' # or .json in your case
key=folder_name+file_name
s3 = boto3.resource(
service_name='s3',
region_name='your_bucket_region', ## ex: 'us-east-2'
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)
obj = s3.Bucket(bucket_name).Object(key).get()
df = pd.read_csv(obj['Body']) # or pd.read_json(obj['Body']) in your case
The dataframe will be imported as a new query( named "df", in this example case)
Apparently pandas library can also also get a zipped file (.gz for example). See the following topic: How can I read tar.gz file using pandas read_csv with gzip compression option?
Is there a way to have the dynamodb rows for each user, backed up in s3 with a csv file.
Then using streams, when a row is mutated, change that row in s3 in the csv file.
The csv readers that are currently out there are geared towards parsing the csv for use within the lambda.
Whereas I would like to find a specific row, given by the stream and then replace it with another row without having to load the whole file into memory as it may be quite big. The reason I would like a backup on s3, is because in the future I will need to do batch processing on it and reading 300k files from dynamo within a short period of time, is not preferable.
Read the data from S3, parse as csv using your favorite library and update, then write back to S3:
import io
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
with io.BytesIO() as data:
bucket.download_fileobj('my_key', data)
# parse csv data and update as necessary
# then write back to s3
bucket.upload_fileobj(data, 'my_key')
Note that S3 does not support object append or update if that was what you were hoping for- see here. You can only read and overwrite. You might take this into account when designing your system.