Difference between two version of files in S3 - amazon-web-services

I have a bucket in S3 with versioning enabled. There is a file that comes is and updates its contents. There is a unique identifies in that file and I sometime with the new file coming in, the content of the existing is not there, which needs to be retained.
My goal here is to have a file which has all the contents of the new file and all the stuff from the old file which was not there.
I have a small python script which does the job and I can schedule it on S3 trigger as well, but is there any AWS implementation for this issue? like using S3 -> XXXX service that would give the changes in between the files (not line by line though) and maybe creates a new file.
my python code looks something like:
old_file = 'file1.1.txt'
new_file = 'file1.2.txt'
output_file = 'output_pd.txt'
# Read the old file into a Pandas dataframe
old_df = pd.read_csv(old_file, sep="\t", header=None)
# car_df = pd.read_csv(car_file, sep="\t")
new_df = pd.read_csv(new_file, sep="\t", header=None)
# Find the values that are present in the old file and missing in the new file
missing_values = old_df[~old_df.iloc[:,0].isin(new_df.iloc[:,0])]
# Append the missing values to the new file
final_df = new_df.append(missing_values, ignore_index=True)
# Write the final dataframe to a new file
final_df.to_csv(output_file, sep=' ', index=False, header=None)
But looking for some native AWS solution/ best practice.

but is there any AWS implementation for this issue?
No, there is no any native AWS implementation for comparing files' content. You have to implement that yourself, as you did right now. You can host your code as a lambda function will will be automatically triggered by S3 uploads.

Related

Can you get filename from input_file_name() in aws s3 when using gunzipped files

Been searching for an answer to do this for quite awhile now but can't seem to figure it out. I've read Why is input_file_name() empty for S3 catalog sources in pyspark? and tried everything in this questions but none of it worked. I'm trying to get the filename of each record in the source s3 bucket but blank keeps getting returned. I'm thinking it could be to do with that the files are gunzipped as it worked perfectly before they were. Can't seem to find anywhere that this should be an issue. Does anyone know if this is an issue or if it is something else to do with my code?
Thank you!
def main():
glue_context = GlueContext(sc.getOrCreate())
#create a source dataframe for the bronze table
dyf_bronze_table = glue_context.create_dynamic_frame.from_catalog(
database=DATABASE
, table_name=TABLE
, groupFiles='none'
)
#Add file location to join postgres database on
bronze_df = dyf_bronze_table.toDF()
bronze_df = bronze_df.withColumn("s3_location", input_file_name())
bronze_df.show()
The problem was in my terraform file. I had set the
compressionType = "gzip"
and the
format = gzip
also. Once I removed these the filename was populated.
After reading through some of the documentation though I wouldn't recommend gunzipping the files (maybe use parquet instead) as when the files are gunzipped it can't shard them so instead of working on the data on multiple dpus it has to work through each file individually.

How to read, modify, and overwrite parquet files in S3 using Spark?

I am trying to read a bunch of parquet files from S3 into a Spark dataframe using df = spark.read.parquet("s3a://my-bucket/path1/path2/*.parquet").
Will this read all the Parquet files present at any level inside path2 (e.g. path2/path3/...file.parquet) or only the files present directly under path2 (e.g. path2/file1.parquet)
Will df now contain the complete filenames/filepaths (object keys) of all these Parquet files ?
While processing the contents of a single parquet file as a dataframe, I want to modify the dataframe, and overwrite the dataframe inside the same file. How can I do that ? Even if it deletes the old version of the file and creates a new file (new filename), that's fine, but I don't want any other files apart from the current file under consideration to be affected in any manner by this operation.

Copy file from s3 subfolder in another subfolder in same bucket

I'd like to copy file from subfolder into another subfolder in same s3 bucket. I've read lots of questions in SO and I came finally with this code. It has an issue, when I run it it works, but it doesn't copy the file only, it copy the folder that contain the file into the destination wanted I've have the file but inside a folder(root). How do I only copy the files inside that subfolder?
XXXBUCKETNAME:
-- XXXX-input/ # I want to copy from here
-- XXXX-archive/ # to here
import boto3
from botocore.config import Config
s3 = boto3.resource('s3', config=Config(proxies={'https': getProperty('Proxy', 'Proxy.Host')}))
bucket_obj = s3.Bucket('XXX')
destbucket = 'XXX'
jsonfiles = []
for obj in bucket_obj.objects.filter(Delimiter='/', Prefix='XXXX-input/', ):
if obj.key.endswith('json'):
jsonfiles.append(obj.key)
for k in jsonfiles:
if k.split("_")[-1:][0] == "xxx.txt":
dest = s3.Bucket(destbucket)
source= { 'Bucket' : destbucket, 'Key': k}
dest.copy(source, "XXXX-archive/"+k)
it give:
XXXBUCKETNAME:
-- XXXX-input/
-- XXXX-archive/
-- XXXX-input/file.txt
I want:
XXXBUCKETNAME:
-- XXXX-input/
-- XXXX-archive/
-- file.txt
In S3 there really aren't any "folders." There are buckets and objects, as explained in documentation. The UI may make it seem like there are folders, but the key for an object is the entire path. So if you want to copy one item, you will need to parse its key and build the destination key differently so that it has the same prefix (path) but end with a different value.
In Amazon S3, buckets and objects are the primary resources, and
objects are stored in buckets. Amazon S3 has a flat structure instead
of a hierarchy like you would see in a file system. However, for the
sake of organizational simplicity, the Amazon S3 console supports the
folder concept as a means of grouping objects. It does this by using a
shared name prefix for objects (that is, objects have names that begin
with a common string). Object names are also referred to as key names.
In your code you are pulling out each object's key, so that means the key already contains the full "path" even though there isn't really a path. So you will want to split the key on the / character instead and then take the last element in the resulting list and append that as the file name:
dest.copy(source, "XXXX-archive/" + k.split("/")[-1])

DSX: Insert to code link is missing

After uploading some files to my project and creating a catalog, I can see the list of files in the Find and Add Data section. However, there is no link Insert to code. This is true for files of type csv, json, tar.gz as well as for a data set from a catalog. What am I doing wrong?
Insert to Code Option is only available for data that you upload in Object Storage service.
I see that you are using Catalog for storage in DSX.
Catalog is still in beta state and currently insert to code is not added or supported for Catalog data assets.
Feel free to add enhancement request here:-
https://datascix.uservoice.com/forums/387207-general
If you create a project with Object storage as storage , you will see the insert to code for csv files.
For reading from catalog , you will need to use projectUtil.
Catalog data asset is considered as a resource of project so to access it you would need access token.
So first step, generate the token to access the catalog resource.
Go to Project Settings and create access token and then clear next cell and
click insert project token from those 3 dots above in notebook and
you will see code generated as below
The generated code just creates project context.
import com.ibm.analytics.projectNotebookIntegration._
val pc = ProjectUtil.newProjectContext(sc, "994b03fa-XXXXXX", "p-XXXXXXXXXX")
Lets make list of available files.
val fileList = ProjectUtil.listAvailableFilesData(pc)
fileList.indices.foreach( i => println(i + ": " + fileList(i)))
So the fileList contains your filenames.
You can directly use the name of the file as second argument.
val df = ProjectUtil.loadDataFrameFromFile(pc, fileList(1))
or
val df1 = ProjectUtil.loadDataFrameFromFile(pc, "co2.csv")
You will see below:-
"Creating DataFrame, this will take a few moments...
DataFrame created."
df.show() and you will see content.
Full Notebook:-
https://github.com/charles2588/bluemixsparknotebooks/blob/master/scala/Read_Write_Catalog_Scala.ipynb
The below doc also has python and R examples.

Ref for projectUtil:- https://datascience.ibm.com/docs/content/local/notebookfunctionsload.html
Thanks,
Charles.

How do I remove Header Row when migrating from S3 to Redshift DB?

I have a MySQL table that I'm migrating over to Redshift. The steps are pretty straightforward.
Export MySQL table to CSV
Place CSV into Amazon S3
Create table in Redshift with exact specifications as MySQL table
Copy CSV export into Redshift
I'm having a problem with the last step. I have headers in my MySQL CSV export. I can't currently recreate it, so I'm stuck with the CSV file. Step 4 is giving me an error because of the headers.
Instead of changing the CSV, I would love to add a line to account for headers. I've searched through AWS's documentation for copying tables which is pretty extensive, but nothing to account for headers. Looking for something like header = TRUE to add into the query below.
My COPY statement into Redshift right now looks like:
COPY apples FROM
's3://buckets/apples.csv'
CREDENTIALS 'aws_access_key_id=abc;aws_secret_access_key=def'
csv
;
Found the IGNOREHEADER function, but still couldn't figure out where to write it.
Pretty obvious now, but just add IGNOREHEADER at the bottom. The 1 represents the number of rows you want to skip for headers, aka my CSV had one row of headers.
COPY apples FROM
's3://buckets/apples.csv'
CREDENTIALS 'aws_access_key_id=abc;aws_secret_access_key=def'
csv
IGNOREHEADER 1
;
There is a parameter that Copy command can use.
Refer to documentation
so you can do something like this using the S3ToRedshiftOperator
You'd want to add 'IGNOREHEADER 1' under copy_options : list[str]
To use it:
copy_options_list = ["csv", "timeformat 'auto'", 'IGNOREHEADER 1']
transfer_s3_to_redshift = S3ToRedshiftOperator(
task_id="music_story_s3_to_redshift",
redshift_conn_id=redshift_connection_id,
s3_bucket=s3_bucket_name,
s3_key=s3_key,
schema=schema_name,
table=redshift_table,
column_list=cols_list,
copy_options=copy_options_list,
dag=dag,
)
The copy instruction then becomes:
COPY <schema.table> (column1, column2, column3…)
FROM 's3://<BUCKET_NAME>/<PATH_TO_YOUR_S3_FILE>’
credentials
'aws_access_key_id=<> ;aws_secret_access_key=<>;token=<>’
csv
timeformat 'auto'
IGNOREHEADER 1;
, parameters: None