Connecting Power BI to S3 Bucket - amazon-web-services

Need some guidance as I am new to Power BI and Redshift ..
My Raw JSON data is stored in Amazon S3 bucket in the form of .gz files (Each .gz file has multiple rows of JSON data)
I wanted to connect Power BI to Amazon s3 Bucket. As of now based on my research I got three ways:
Amazon S3 is a web service and supports the REST API. We can try to use web data source to get data
Question: Is it possible to unzip the .gz file (inside the S3 bucket or Inside Power BI), extract JSON data from S3 and connect to Power BI
Importing data from Amazon S3 into Amazon Redshift. Do all data manipulation inside Redshift using SQL workbench. Use Amazon Redshift connector to get data in Power BI
Question 1: Does Redshift Allows Loading .gzzipped JSON data from the S3 bucket? If Yes, is it directly possible or do I have to write any code for it?
Question 2: I have the S3 account, do I have to separately purchase Redshift Account/Space? What is the cost?
Move data from an AWS S3 bucket to the Azure Data Lake Store via Azure Data Factory, transform the data with Azure Data Lake Analytics (U-SQL), and then output the data to PowerBI
U-SQL recognize GZip compressed files with the file extension .gz and automatically decompress them as the part of the Extraction process. Is this process valid, if my gzipped files contain JSON data rows?
Please let me if there is any other method, also your valuable suggestions on this post.
Thanks in Advance.

About your first Question: I've just faced a similar issue recently (but extracting a csv) and I would like to register my solution.
Power BI still don't have a direct plugin to download S3 buckets, but you can do it using a python script.
Get data --> Python Script
PS.: make sure that boto3 and pandas libraries are installed in the same folder (or subfolders) of the Python home directory you informed in Power BI options,
OR in Anaconda library folder (c:\users\USERNAME\anaconda3\lib\site-packages).
Power BI window for Python scripts options
import boto3
import pandas as pd
bucket_name= 'your_bucket'
folder_name= 'the folder inside your bucket/'
file_name = r'file_name.csv' # or .json in your case
key=folder_name+file_name
s3 = boto3.resource(
service_name='s3',
region_name='your_bucket_region', ## ex: 'us-east-2'
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)
obj = s3.Bucket(bucket_name).Object(key).get()
df = pd.read_csv(obj['Body']) # or pd.read_json(obj['Body']) in your case
The dataframe will be imported as a new query( named "df", in this example case)
Apparently pandas library can also also get a zipped file (.gz for example). See the following topic: How can I read tar.gz file using pandas read_csv with gzip compression option?

Related

neo4j use Load CSV to read data from Google Cloud Storage

My original data is from Bigquery. I have created a dag job to extract the relevant fields data based on a "WHERE" condition into a csv file stored in Google Cloud Storage
As a next step, I am aiming to use "LOAD CSV WITH EHADERS FROM gs://link-to-bucket/file.csv to read the data from the CSV to Neo4j database
It seems however that I cannot just give the the gcs uri as the CSV link. Is there anyway to establish a secure connection to read the file, other then making the the bucket public?
My attempt
uri = "gs://link-to-bucket/file.csv"
def create_LP_query(uri):
query_string = f"""
LOAD CSV WITH HEADERS FROM '{uri}' AS row
MERGE (l:Limited_Partner:Company {{id: row.id}})
SET l.Name = row.Name """
It is not possible, you would have to create a Neo4j plugin that acts as a new ProtocolHandler.
I did one in the past for S3, you might take it as inspiration, it can be similar for GS.
https://github.com/ikwattro/neo4j-load-csv-s3-protocol

process non csv, json and parquet files from s3 using glue

Little disclaimer have never used glue.
I have files stored in s3 that I want to process using glue but from what I saw when I tried to start a new job from a plain graph the only option I got was csv, json and parquet file formats from s3 but my files are not of these types. Is there any way processing those files using glue? or do I need to use another aws service?
I can run a bash command to turn those files to json but the command is something I need to download to a machine if there any way i can do it and than use glue on that json
Thanks.

List all forecast CSV files exported to AWS S3 bucket when using AWS Forecast Export Job

I have trained a Predictor on AWS Forecast, and used it to make some forecasts.
I want to get these forecasts as CSV files. To do so, I created a "ForecastExportJob".
After the exportation is done, I can successfully see the CSV files in my S3 bucket.
I would like to download them programmatically, so is there a way to have a list of S3 keys that correspond to the CSV files created with the "ForecastExportJob" command?
I could list all objects in the destination buckets and filter them, but I am wondering if there is a "more elegant" solution to my problem.
Put it simply, I would like to know if there is an AWS command that can list the files created by the "ForecastExportJob" command:
electricityforecast_export_job_2021-01-04T06-40-23Z_part0.csv
...
electricityforecast_export_job_2021-01-04T06-40-23Z_part7.csv
Note: I am using boto3
Thank you in advance and happy new year!

Splunk migration to S3 DataLake

We're looking at moving away from Splunk as our datastore and looking at AWS Data Lake backed by S3.
What would be the process of migrating data from Splunk to S3? I've read lots of documents talking about archiving data from Splunk to S3 but not sure if this archives the data as a usable format OR if its in some archive format that needs to be restored to splunk itself?
Check out Splunk's SmartStore feature. It moves your non-hot buckets to S3 so you save storage costs. Running SmartStore on AWS only makes sense, however, if you run Splunk on AWS. Otherwise, the data export charges will bankrupt you. Data export applies when Splunk needs to search a bucket that's stored in S3 and so copies that bucket to an indexer. See https://docs.splunk.com/Documentation/Splunk/8.0.0/Indexer/AboutSmartStore for more information.
From what I've read there are a couple of ways to do it:
Export using the Web UI
Export using REST API Endpoint
Export using CLI
Copy certain files in the filesystem
So far I've tried using the CLI to export and I've managed to export around 500,000 events at a time using
splunk search "index=main earliest=11/11/2019:00:00:01 latest=11/15/2019:23:59:59" -output rawdata -maxout 500000 > output2.dmp
However - I'm not sure how I can accurately repeat this step to make sure I include all 100 million+ events. IE search from DATE A to DATE B for 500,000 records, then search from DATE B to DATE C for the next 500,000 - without missing any events inbetween.

How to export data from table as CSV from Greenplum database to AWS s3 bucket

I have data in a table
select * from my_table
It contains 10k observations.How do I export data in the table as CSV to s3 bucket .
(I dont want to export the data to my local machine and then push to s3).
Please, please, please STOP labeling your questions with both PostgreSQL and Greenplum. The answer to your question is very different if you are using Greenplum versus PostgreSQL. I can't stress this enough.
If you are using Greenplum, you should the S3 protocol in External Tables to read and write data to S3.
So your table:
select * from my_table;
And your external table:
CREATE EXTERNAL TABLE ext_my_table (LIKE my_table)
LOCATION ('s3://s3_endpoint/bucket_name')
FORMAT 'TEXT' (DELIMITER '|' NULL AS '' ESCAPE AS E'\\');
And then writing to your s3 bucket:
INSERT INTO ext_my_table SELECT * FROM my_table;
You will need to do some configuration on your Greenplum cluster so that you have an s3 configuration file too. This goes in every segment directory too.
gpseg_data_dir/gpseg-prefixN/s3/s3.conf
Example of the file contents:
[default]
secret = "secret"
accessid = "user access id"
threadnum = 3
chunksize = 67108864
More information on S3 can be found here: http://gpdb.docs.pivotal.io/5100/admin_guide/external/g-s3-protocol.html#amazon-emr__s3_config_file
I'll suggest to first load data into your master node using WINSCP or File transfer.
Then move this file from your master node to S3 storage.
Because, moving data from Master node to S3 storage utilises Amazon's bandwidth and it will be much faster than our local connection bandwidth used to transfer file from local machine to S3.