s3 command for analyse - amazon-web-services

Suppose that you work in an e-commerce company, which keeps records of multiple products (more than a thousand) in the S3 bucket ‘records’. The files have the following structure for filename: ‘category-productid.csv’.
Now, you have to analyse the records associated with the ‘Electronics’ category only. You are expected to download the specific category reports and then perform the analysis over your local machine.
Provide the command that helps you perform this task.
i am trying on s3 help page , but i can download the csv file , but how to analyze particular category in csv ?

aws s3 cp s3://records “local_path” --recursive --exclude “” --include “electronics”

Related

Upload custom file to s3 from training script in training component of AWS SageMaker Pipeline

I am new to Sagmaker, and I have created a pipeline from the SageMaker notebook, consisting of training and deployment components.
In the training script, we can upload the model to s3 via SM_MODEL_DIR. But now, I want to upload the classification report to s3. I tried this code. But It shows this is not a proper s3 bucket.
df_classification_report = pd.DataFrame(class_report).transpose()
classification_report_file_name = os.path.join(args.output_data_dir,
f"{args.eval_model_name}_classification_report.csv")
df_classification_report.to_csv(classification_report_file_name)
# instantiate S3 client and upload to s3
# save classification report to s3
s3 = boto3.resource('s3')
print(f"classification_report is being uploaded to s3- {args.model_dir}")
s3.meta.client.upload_file(classification_report_file_name, args.model_dir,
f"{args.eval_model_name}_classification_report.csv")
And the error
Invalid bucket name "/opt/ml/output/data": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
Can anybody help? I really appreciate any help you can provide.
SageMaker Training Jobs will compress any files located in /opt/ml/model which is the value of SM_MODEL_DIR and upload it to S3 automatically. You could look at saving your file to SM_MODEL_DIR (Your classification report will thus be uploaded to S3 in the model tar ball).
The upload_file() function requires you to pass an S3 bucket.
You could also look at manually specify an S3 bucket in your code to upload the file to.
s3.meta.client.upload_file(classification_report_file_name, <YourS3Bucket>,
f"{args.eval_model_name}_classification_report.csv")
You can save non model artifacts, such as reports, to output_data_dir. See here.
parser.add_argument("--output_data_dir", type=str,
default=os.environ.get('SM_OUTPUT_DATA_DIR'),
help="Directory to save output data artifacts.")
If you want the artifacts to be packaged with the model files then follow #Marc's answer. Maybe it makes sense in the case of a report that pertains to a specific model, though capturing this in a model registry makes more sense to me.
Note that these additional artifacts would be carried over if you deploy the model to an endpoint (might confuse the inference runtime model loading code).

How I Can Search Unknown Folders in S3 Bucket. I Have millions of object in my bucket I only want Folder List?

I Have a bucket with 3 million objects. I Even don't know how many folders are there in my S3 bucket and even don't know the names of folders in my bucket.I want to show only list of folders of AWS s3. Is there any way to get list of all folders ?
I would use AWS CLI for this. To get started - have a look here.
Then it is a matter of almost standard linux commands (ls):
aws s3 ls s3://<bucket_name>/path/to/search/folder/ --recursive | grep '/$' > folders.txt
where:
grep command just reads what aws s3 ls command has returned and searches for entries with ending /.
ending > folders.txt saves output to a file.
Note: grep (if I'm not wrong) is unix only utility command. But I believe, you can achieve this on windows as well.
Note 2: depending on the number of files there this operation might (will) take a while.
Note 3: usually in systems like AWS S3, term folder is there only for user to maintain visual similarity with standard file systems however inside it does treat it as a part of a key. You can see in your (web) console when you filter by "prefix".
Amazon S3 buckets with large quantities of objects are very difficult to use. The API calls that list bucket contents are limited to returning 1000 objects per API call. While it is possible to request 'folders' (by using Delimiter='/' and looking at CommonPrefixes), this would take repeated calls to obtain the hierarchy.
Instead, I would recommend using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then play with that CSV file from code (or possibly Excel? Might be too big?) to obtain your desired listings.
Just be aware that doing anything on that bucket will not be fast.

How to get a listing of WARC files using HTTP for Common Crawl News Dataset?

I can obtain listing for Common Crawl by:
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz
How can I do this with Common Crawl News Dataset ?
I tried different options, but always getting errors:
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS-2017-09/warc.paths.gz
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2017/09/warc.paths.gz
Since every few hours a new WARC file is added to the news dataset, a static file list does not make sense. Instead you can get a list of files using the AWS CLI - for any subset by year or month, e.g.
aws --no-sign-request s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2017/09/
See also the news data release announcement.

List all forecast CSV files exported to AWS S3 bucket when using AWS Forecast Export Job

I have trained a Predictor on AWS Forecast, and used it to make some forecasts.
I want to get these forecasts as CSV files. To do so, I created a "ForecastExportJob".
After the exportation is done, I can successfully see the CSV files in my S3 bucket.
I would like to download them programmatically, so is there a way to have a list of S3 keys that correspond to the CSV files created with the "ForecastExportJob" command?
I could list all objects in the destination buckets and filter them, but I am wondering if there is a "more elegant" solution to my problem.
Put it simply, I would like to know if there is an AWS command that can list the files created by the "ForecastExportJob" command:
electricityforecast_export_job_2021-01-04T06-40-23Z_part0.csv
...
electricityforecast_export_job_2021-01-04T06-40-23Z_part7.csv
Note: I am using boto3
Thank you in advance and happy new year!

automating file archival from ec2 to s3 based on last modified date

I want to write an automated job in which the job will go through my files stored on the ec2 storage and check for the last modified date.If the date is more than (x) days the file should automatically get archived to my s3.
Also I don't want to convert the file to a zip file for now.
What I don't understand is how to give the path of the ec2 instance storage and the how do i put the condition for the last modified date.
aws s3 sync your-new-dir-name s3://your-s3-bucket-name/folder-name
Please correct me if I understand this wrong
Your requirement is to archive the older files
So you need a script that checks the modified time and if its not being modified since X days you simply need to make space by archiving it to S3 storage . You don't wish to store the file locally
is it correct ?
Here is some advice
1. Please provide OS information ..this would help us to suggest shell script or power shell script
Here is power shell script
$fileList = Get-Content "c:\pathtofolder"
foreach($file in $fileList) {
Get-Item $file | select -Property fullName, LastWriteTime | Export-Csv 'C:\fileAndDate.csv' -NoTypeInformation
}
then AWS s3 cp to s3 bucket.
You will do the same with Shell script.
Using aws s3 sync is a great way to backup files to S3. You could use a command like:
aws s3 sync /home/ec2-user/ s3://my-bucket/ec2-backup/
The first parameter (/home/ec2-user/) is where you can specify the source of the files. I recommend only backing-up user-created files, not the whole operating system.
There is no capability for specifying a number of days. I suggest you just copy all files.
You might choose to activate Versioning to keep copies of all versions of files in S3. This way, if a file gets overwritten you can still go back to a prior version. (Storage charges will apply for all versions kept in S3.)