I have a scenario where
1st run: 500 records loaded to target from Source.
But for 2nd run you get 450 records from Source,
The target and source are to be sync for each run,
how do you sync i.e delete the extra 50 records in the target? in Informatica PC/IICS.
Just truncate the target table at the start of the run
Truncate would be the best option, if the delta between the source and target is in millions. But in case if the delta record-set between source and target is small then you could add index on target and with the help of lookup, validate if the record is addition or deletion and perform the relevant action.
Related
I have multiple source sending incremental data and there are no metadata columns at record level. How can I ensure that Airflow is processing data in the order of receipt. I may end-up processing the file in out-of-sync order.
Does airflow have inbuilt methods/way to handle the files in the order of receival. ?
Airflow version used :2.4.3
You can use boto to retrieve the last modified timestamp from files in your S3 bucket within a PythonOperator.
This question has an answer that shows how to pull the last modified timestamp. Then you can sort the keys by the timestamp, process the files in that order and move the files to an achieve folder or bucket so only new files are processed with every DAG run.
As a general note if you have any control over your sources I would prefer trying to add a timestamp at the record level, this seems like an easier option.
I'm doing incremental data load from relatioal db to dynamically created flat file. Suppose if there are no new records in source the mapping not creating target file. I need a empty Target file if there are no records fetched From source
You can create a cmd task which will kick off on a condition.
in cmd task just put this command
touch /location/empty_file.txt
Link main session to this command task. Double click on link and add below condition to link.
$yourMainSessionName.SrcSuccessRows = 0
So, this command task will activate only when your main session pulls 0 rows.
I'm trying to restore data in EFS from recovery points managed by AWS Backup. It seems AWS Backup does not support destructive restores and will always restore to a directory in the target EFS file system, even when creating a new one.
I would like to sync the data extracted from such a recovery point to another volume, but right now I can only do this manually as I need to lookup the directory name that is used by the start-restore-job operation (e.g. aws-backup-restore_2022-05-16T11-01-17-599Z), as stated in the docs:
You can restore those items to either a new or existing file system. Either way, AWS Backup creates a new Amazon EFS directory (aws-backup-restore_datetime) off of the root directory to contain the items.
Further looking through the documentation I can't find either of:
an option to set the name of the directory used
the value of directory name returned in any call (either start-restore-job or describe-restore-job)
I have also checked how the datetime portion of the directory name maps to the creationDate and completionDate of the restore job but it seems neither match (completionDate is very close, but it's not the exact same timestamp).
Is there any way for me to do one of these two things? Both of them missing make restoring a file system from a recovery point in an automated fashion very hard.
Is there any way for me to do one of these two things?
As it stands, no.
However, since we know that the directory will always be in the root, doing find . -type d -name "aws-backup-restore_*" should return the directory name to you. You could also further filter this down based on the year, month, day, hour & minute.
You could have something polling the job status on the machine that has the EFS file system mounted, finding the correct directory and then pushing that to AWS Systems Manager Parameter Store for later retrieval. If restoring to a new file system, this of course becomes more difficult but still doable in an automated fashion.
If you're not mounting this on an EC2 instance, for example, running a Lambda with the EFS file system mounted, will let you obtain the directory & then push it to Parameter Store for retrieval elsewhere. The Lambda service mounts EFS file systems when the execution environment is prepared - in other words, during the 'cold start' duration so there are no extra costs here for extra invocation time & as such, would be the cheapest option.
There's no built-in way via the APIs however to obtain the directory or configure it so you're stuck there.
It's an AWS failure that neither do they return the filename that they use in any way nor does any of the metadata returned - creationDate/completionData - exactly match the timestamp they use to name the file.
If you're an enterprise customer, suggest this as a missing feature to your TAM or SA.
I'd like to use EMR and Spark to process an AWS S3 inventory report generated in ORC format that has many ORC files (hundreds) and the total size of all the data is around 250GB.
Is there a specific or best practice way to read all the files in to one Dataset? It seems like I can pass the sqlContext.read().orc() method a list of files, but I wasn't sure if this would scale/parallelize properly if I pass it a large list of hundreds of files.
What is the best practice way of doing this? Ultimately my goal is to have the contents of all the files in one dataset so that I can run a sql query on the dataset and then call .map on the results for subsequent processing on that result set.
Thanks in advance for your suggestions.
Just specify a folder where your orc files are located. Spark will automatically detect all of them and will put into a single DataFrame.
sparkSession.read.orc("s3://bucket/path/to/folder/with/orc/files")
You shouldn't care much about scalability since everything is done by spark based on default config provided by EMR depending on the EC2 instance type selected. You can experiment with number of slave nodes and it's instance type though.
Besides that, I would suggest to set maximizeResourceAllocation to true to configure executors to utilize maximum resources on each slave node.
I am trying to load data from a text file which resides in Amazon S3 to Redshift Database. I am using SQL Workbench and loading by using the COPY command. The file is heavy ~ 360GB. After 2 hrs , the connections gets closed throwing the error message as shown here in the subject. I tried to set the timeout to '0' ( limitless )
I found the reason after getting some help.
The table to which I was loading data had this property called "COMPUPDATE" turned ON . This basically means, part of the copy command will try to analyze the table for appropriate compression and apply them.
That was one of the issue. Having the property set OFF in copy command saves time and reduce one of the task to the database.
We can always check for compression later using ANALYZE COMPRESSION command
Second, for large datasets i assume every column uses Zstandard(ZSTD). So before you load data try to check for whether compression is needed or not.
Third, It is recommended to GZIP the files and try to load the data. More information can be found here
Fourth and most important, large files should be split into smaller files to best use the clusters available to your account. This would help divide the workload among all of your nodes. More here
Hope this helps. Please let me know if you need anything else.