How to read, modify, and overwrite parquet files in S3 using Spark? - amazon-web-services

I am trying to read a bunch of parquet files from S3 into a Spark dataframe using df = spark.read.parquet("s3a://my-bucket/path1/path2/*.parquet").
Will this read all the Parquet files present at any level inside path2 (e.g. path2/path3/...file.parquet) or only the files present directly under path2 (e.g. path2/file1.parquet)
Will df now contain the complete filenames/filepaths (object keys) of all these Parquet files ?
While processing the contents of a single parquet file as a dataframe, I want to modify the dataframe, and overwrite the dataframe inside the same file. How can I do that ? Even if it deletes the old version of the file and creates a new file (new filename), that's fine, but I don't want any other files apart from the current file under consideration to be affected in any manner by this operation.

Related

Difference between two version of files in S3

I have a bucket in S3 with versioning enabled. There is a file that comes is and updates its contents. There is a unique identifies in that file and I sometime with the new file coming in, the content of the existing is not there, which needs to be retained.
My goal here is to have a file which has all the contents of the new file and all the stuff from the old file which was not there.
I have a small python script which does the job and I can schedule it on S3 trigger as well, but is there any AWS implementation for this issue? like using S3 -> XXXX service that would give the changes in between the files (not line by line though) and maybe creates a new file.
my python code looks something like:
old_file = 'file1.1.txt'
new_file = 'file1.2.txt'
output_file = 'output_pd.txt'
# Read the old file into a Pandas dataframe
old_df = pd.read_csv(old_file, sep="\t", header=None)
# car_df = pd.read_csv(car_file, sep="\t")
new_df = pd.read_csv(new_file, sep="\t", header=None)
# Find the values that are present in the old file and missing in the new file
missing_values = old_df[~old_df.iloc[:,0].isin(new_df.iloc[:,0])]
# Append the missing values to the new file
final_df = new_df.append(missing_values, ignore_index=True)
# Write the final dataframe to a new file
final_df.to_csv(output_file, sep=' ', index=False, header=None)
But looking for some native AWS solution/ best practice.
but is there any AWS implementation for this issue?
No, there is no any native AWS implementation for comparing files' content. You have to implement that yourself, as you did right now. You can host your code as a lambda function will will be automatically triggered by S3 uploads.

Athena - CTAS file name

I used Athena's CTAS and INSERT commands and Avro files created at the external_location
But the file name is very strange and the filename extension also disappear. (That file don't have any filename extension. File has only their strange filename like hash code)
How can I define filenames rule for Athena's file?
Thank you.
As stated on page 20 of AWS Athena's manual, ..."This location in Amazon S3 comprises all of the files representing your table. For more information, see Using Folders in the Amazon Simple Storage Service Console User Guide."...
Reference:
https://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf
So, no, you can't define the name of the file (or files, because more than one may be needed to represent a table). BUT THE RIGHT WAY TO THINK is that the BUCKET/PATH is what represents the file name, or the output table.
We might get confused because you're genereting and AVRO file, which really is a file, like PARQUET, but remember that Athena can also output to other formats, which may be multi-file.

Is it possible to validate the column order when uploading data from flat files using aws copy command

I'm uploading data from zipped flat files to redshift using copy command, I would like to understand if there is any way to validate that the column order of the files is correct? (for example, if fields are all varchar then the data could be uploaded to the wrong columns).
In the copy command documentation it shows that you can specify the column order, but not for flat files, but I was wondering if there are any other approaches that would allow me to check how the columns have been supplied (for example, uploading only the header row into a dummy table to check, but that doesn't seem a possibility).
You can't really do this inside Redshift. COPY doesn't provide any options to only load a specific number of rows or perform any validation.
Your best option would be to do this in the tool where you schedule the loads. You can get the first line from a compressed file easily enough (zcat < file.z|head -1) but for a file on S3 you may have to download the whole thing first.
FWIW, the process generating the load file should be fully automated in such a way that the column order can't change. If these files are being manually prepared you're asking for all sorts of trouble.

What is the best approach to load data into Hive using NiFi?

I have started working with NiFi. I am working on a use case to load data into Hive. I get a CSV file and then I use SplitText to split the incoming flow-file into multiple flow-files(split record by record). Then I use ConvertToAvro to convert the split CSV file into an AVRO file. After that, I put the AVRO files into a directory in HDFS and I trigger the "LOAD DATA" command using ReplaceText + PutHiveQL processor.
I'm splitting the file record by record because to get the partition value(since LOAD DATA doesn't support dynamic partitioning). The flow looks like this:
GetFile (CSV) --- SplitText (split line count :1 and header line count : 1) --- ExtractText (Use RegEx to get partition fields' values and assign to attribute) --- ConvertToAvro (Specifying the Schema) --- PutHDFS (Writing to a HDFS location) --- ReplaceText (LOAD DATA cmd with partition info) --- PutHiveQL
The thing is, since I'm splitting the CSV file into each record at a time, it generates too many avro files. For ex, if the CSV file has 100 records, it creates 100 AVRO files. Since I want to get the partition values, I have to split them by one record at a time. I want to know is there any way, we can achieve this thing without splitting record by record. I mean like batching it. I'm quite new to this so I am unable to crack this yet. Help me with this.
PS: Do suggest me if there is any alternate approach to achieve this use case.
Are you looking to group the Avro records based on the partitions' values, one Avro file per unique value? Or do you only need the partitions' values for some number of LOAD DATA commands (and use a single Avro file with all the records)?
If the former, then you'd likely need a custom processor or ExecuteScript, since you'd need to parse, group/aggregate, and convert all in one step (i.e. for one CSV document). If the latter, then you can rearrange your flow into:
GetFile -> ConvertCSVToAvro -> PutHDFS -> ConvertAvroToJSON -> SplitJson -> EvaluateJsonPath -> ReplaceText -> PutHiveQL
This flow puts the entire CSV file (as a single Avro file) into HDFS, then afterwards it does the split (after converting to JSON since we don't have an EvaluateAvroPath processor), gets the partition value(s), and generates the Hive DDL statements (LOAD DATA).
If you've placed the file at the location where the hive table is reading the data using the puthdfs processor then you don't need to call the puthiveql processor. I am also new to this but I think you should leverage the schema-on-read capability of hive.

Read CSV File in Django and then write it to DB Postgresql

How can I read a CSV file, parse the values, and then output it to a particular database table?
That's the basic problem.
Here is a 'bigger picture' of what I'm trying to do:
I'm trying to either read from multiple CSV files (every minute) and or read from an ever-updating CSV file (with additional row entries every update) every minute.