How to create a apache nifi template,to get data from url? - templates

I want to get the data from a ftp link and store it as a hive table.

ashok,
Using following processors to achieve your requirments.
GetFTP-->PutFile-->ReplaceText-->PutHiveQL.
GetFTP-->Gets a file from ftp server with help of hostname&port,username&credentials.
PutFile-->Store file from FTP into local drive.
ReplaceText-->Search flowfile content &replace it with your hive query in which having putFile location to specify downloaded file to load into hive.
PutHiveQL-->Executes hive query present in flowfile.
Please let me know if you have any queries

Related

neo4j use Load CSV to read data from Google Cloud Storage

My original data is from Bigquery. I have created a dag job to extract the relevant fields data based on a "WHERE" condition into a csv file stored in Google Cloud Storage
As a next step, I am aiming to use "LOAD CSV WITH EHADERS FROM gs://link-to-bucket/file.csv to read the data from the CSV to Neo4j database
It seems however that I cannot just give the the gcs uri as the CSV link. Is there anyway to establish a secure connection to read the file, other then making the the bucket public?
My attempt
uri = "gs://link-to-bucket/file.csv"
def create_LP_query(uri):
query_string = f"""
LOAD CSV WITH HEADERS FROM '{uri}' AS row
MERGE (l:Limited_Partner:Company {{id: row.id}})
SET l.Name = row.Name """
It is not possible, you would have to create a Neo4j plugin that acts as a new ProtocolHandler.
I did one in the past for S3, you might take it as inspiration, it can be similar for GS.
https://github.com/ikwattro/neo4j-load-csv-s3-protocol

Azure Data Factory HDFS dataset preview error

I'm trying to connect to the HDFS from the ADF. I created a folder and sample file (orc format) and put it in the newly created folder.
Then in ADF I created successfully linked service for HDFS using my Windows credentials (the same user which was used for creating sample file):
But when trying to browse the data through dataset:
I'm getting an error: The response content from the data store is not expected, and cannot be parsed.:
Is there something I'm doing wrongly or it is kind of permissions issue?
Please advise
This appears to be a generic issue, you need to point to a file with appropriate extension rather than a folder itself. Also make sure you are using a supported data store activity.
You can follow this official MS doc to use HDFS server with Azure Data Factory

Connecting Power BI to S3 Bucket

Need some guidance as I am new to Power BI and Redshift ..
My Raw JSON data is stored in Amazon S3 bucket in the form of .gz files (Each .gz file has multiple rows of JSON data)
I wanted to connect Power BI to Amazon s3 Bucket. As of now based on my research I got three ways:
Amazon S3 is a web service and supports the REST API. We can try to use web data source to get data
Question: Is it possible to unzip the .gz file (inside the S3 bucket or Inside Power BI), extract JSON data from S3 and connect to Power BI
Importing data from Amazon S3 into Amazon Redshift. Do all data manipulation inside Redshift using SQL workbench. Use Amazon Redshift connector to get data in Power BI
Question 1: Does Redshift Allows Loading .gzzipped JSON data from the S3 bucket? If Yes, is it directly possible or do I have to write any code for it?
Question 2: I have the S3 account, do I have to separately purchase Redshift Account/Space? What is the cost?
Move data from an AWS S3 bucket to the Azure Data Lake Store via Azure Data Factory, transform the data with Azure Data Lake Analytics (U-SQL), and then output the data to PowerBI
U-SQL recognize GZip compressed files with the file extension .gz and automatically decompress them as the part of the Extraction process. Is this process valid, if my gzipped files contain JSON data rows?
Please let me if there is any other method, also your valuable suggestions on this post.
Thanks in Advance.
About your first Question: I've just faced a similar issue recently (but extracting a csv) and I would like to register my solution.
Power BI still don't have a direct plugin to download S3 buckets, but you can do it using a python script.
Get data --> Python Script
PS.: make sure that boto3 and pandas libraries are installed in the same folder (or subfolders) of the Python home directory you informed in Power BI options,
OR in Anaconda library folder (c:\users\USERNAME\anaconda3\lib\site-packages).
Power BI window for Python scripts options
import boto3
import pandas as pd
bucket_name= 'your_bucket'
folder_name= 'the folder inside your bucket/'
file_name = r'file_name.csv' # or .json in your case
key=folder_name+file_name
s3 = boto3.resource(
service_name='s3',
region_name='your_bucket_region', ## ex: 'us-east-2'
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)
obj = s3.Bucket(bucket_name).Object(key).get()
df = pd.read_csv(obj['Body']) # or pd.read_json(obj['Body']) in your case
The dataframe will be imported as a new query( named "df", in this example case)
Apparently pandas library can also also get a zipped file (.gz for example). See the following topic: How can I read tar.gz file using pandas read_csv with gzip compression option?

How to read S3 XML files query using Hive

I've XML files stored in AWS S3 bucket. I want to extract XML metadata and load in HIVE Tables on HDFS. Is there any tool, which can help to expediate this activity?
Well, you might need to use HIVE XML SerDe's to read the XML files or write/use Custom UDF's that can understand XML.
Some references that might help : https://community.hortonworks.com/articles/972/hive-and-xml-pasring.html
https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources
https://community.hortonworks.com/questions/47840/how-do-i-do-xml-string-parsing-in-hive.html

Upload txt file to server

I want create a form to upload files (txt, xls) to the server, not the database.
Does anyone kown any example showing how I can do this?
In order to get the file on to the database server's file system, you would first have to upload the file to the database which it sounds like you are already familiar with. From there, you can use the UTL_FILE package to write the BLOB to the database server's file system.