I need to validate data between source and target.validatiins like record count match,data match.
I found dms validation in aws.however it not doing validation in s3 bucket files(it can be source or target).giving validation_s3 feature not available.
Kindly suggest any other tool/aws cloud services to achieve this.Thanks!
Related
I am new to Sagmaker, and I have created a pipeline from the SageMaker notebook, consisting of training and deployment components.
In the training script, we can upload the model to s3 via SM_MODEL_DIR. But now, I want to upload the classification report to s3. I tried this code. But It shows this is not a proper s3 bucket.
df_classification_report = pd.DataFrame(class_report).transpose()
classification_report_file_name = os.path.join(args.output_data_dir,
f"{args.eval_model_name}_classification_report.csv")
df_classification_report.to_csv(classification_report_file_name)
# instantiate S3 client and upload to s3
# save classification report to s3
s3 = boto3.resource('s3')
print(f"classification_report is being uploaded to s3- {args.model_dir}")
s3.meta.client.upload_file(classification_report_file_name, args.model_dir,
f"{args.eval_model_name}_classification_report.csv")
And the error
Invalid bucket name "/opt/ml/output/data": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
Can anybody help? I really appreciate any help you can provide.
SageMaker Training Jobs will compress any files located in /opt/ml/model which is the value of SM_MODEL_DIR and upload it to S3 automatically. You could look at saving your file to SM_MODEL_DIR (Your classification report will thus be uploaded to S3 in the model tar ball).
The upload_file() function requires you to pass an S3 bucket.
You could also look at manually specify an S3 bucket in your code to upload the file to.
s3.meta.client.upload_file(classification_report_file_name, <YourS3Bucket>,
f"{args.eval_model_name}_classification_report.csv")
You can save non model artifacts, such as reports, to output_data_dir. See here.
parser.add_argument("--output_data_dir", type=str,
default=os.environ.get('SM_OUTPUT_DATA_DIR'),
help="Directory to save output data artifacts.")
If you want the artifacts to be packaged with the model files then follow #Marc's answer. Maybe it makes sense in the case of a report that pertains to a specific model, though capturing this in a model registry makes more sense to me.
Note that these additional artifacts would be carried over if you deploy the model to an endpoint (might confuse the inference runtime model loading code).
We're trying to use AWS Glue for ETL operations in our nodejs project. The workflow will be like below
user uploads csv file
data transformation from XYZ format to ABC format(mapping and changing field names)
download transformed csv file to local system
Note that, this flow should happen programmatically(creating crawlers, job triggers should be done programmatically not using the console). I don't know why documentation and other articles always show how to create crawlers, create jobs from glue console?
I believe that we have to create lambda functions and triggers. but not quite sure how to achieve this end to end flow. can anyone please help me. Thanks
I'm trying to setup a workflow to backup Accounts & Contact objects from Salesforce to S3 via AWS Appflow. Perhaps, I'm able to setup the connection and able to backup the files on-demand.
However, for restoration I would like to import the mapping using .csv file and below are sample first 3 lines (using comma-separator source & destination fields).
Name, Name
Type, Account Type
AccountNumber, Account Number
But Appflow is unable to import as " Couldn't parse rows from the file" - Am I missing something ?
This was bug on AWS side and it taken up ! Workaround is to do manual mapping instead of external CSV; make sure the source field attributes match with the corresponding objects in Salesforce.
Scenario
I have Full text search requirement which can search inside the document. I am uploading documents in s3 bucket and encrypting it using envelope encryption.
Can we do full text search in encrypted document(in S3 bucket). If yes what are the rest API(NodeJS API) for the same.
Example => bucket1 =>Encrypted content in the files
bucket1/abc.pdf
bucket1/def.doc
bucket1/ghi.txt
and I want to search text like "I am from planet earth" in the above files.
I want in result file name(s) with above text.
Solution
I am reading following article:
aws article here
encryption of data at rest
Problem
Will it works if s3 bucket data is encrypted?
What will be the best solution for this scenario?
Elasticsearch does not search inside documents, you need to index the content of the documents inside elasticsearch to be able to perform searchs, it also does not support search on encrypted data, the data needs to be stored in clear text.
What you can do is configure SSL/TLS and authentication on Elasticsearch, so you only will be able to make requests if you use the correct certificate and a username and password.
We are using DMS to get data from SQL Server and load it in S3 bucket, after which the data is finally loaded into Snowflake DB using Snowpipe for Full Load.
Now, in order for Snowpipe to know there is new data in S3 bucket, the filename needs to be different than the last one. Have tried all the task setting options available (DROP_AND_CREATE, DO_NOTHING, TRUNCATE) to have the file name different, but still not working. It loads the file name as LOAD00000001.csv
In documentation it shows that file name will be incremental (eg. LOAD00000001.csv, LOAD00000002.csv .. and so on) but it's not happening. Which is why the Snowpipe is not able to register the changes.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html
Can someone please help?
For DMS the incremental counter is started over from 1 each time the task is run. It does not have a "Don't override existing objects" feature.
Your best bet may be to handle the load yourself by looking for updated object timestamps in your folder or setting up S3 event notifications.