My question is a little bit general, we want to build a solution based on amazon rekognition. But we want to make sure that amazon don't keep our data after the process is completed for example. When i use the detect_text function in boto3 like this.
response = client.detect_text(Image={'Bytes': images_bytes})
After i get the response, what happen to the images_bytes that has been uploaded to amazon for processing? Is it automatically destroyed or amazon keeps it locally?
It is unlikely that AWS would be able to provide you with specific details of how it implements a service.
AWS does have various security certifications that might address your general question of how customer data is handled.
See: Cloud Compliance - Amazon Web Services (AWS)
Related
I read the documentation from the official website. but it does not give me a clear picture.
Why would need to use AWS Transfer Family since AWS DataSync can also achieve the same result?
I notice the protocol differences, but am not quite sure about the data migration use case.
Why would we pick one over the other?
Why would need to use AWS Transfer Family since AWS DataSync can also achieve the same result?
It depends on what you mean by achieving the same result.
If it is transferring data to & from AWS then - yes both achieve the same result.
However, the main difference is that AWS Transfer Family is practically an always-on server endpoint enabled for SFTP, FTPS, and/or FTP.
If you need to maintain compatibility for current users and applications that use SFTP, FTPS, and/or FTP then using AWS Transfer Family is a must as that ensures the contract is not broken and that you can continue to use them without any modifications. Existing transfer workflows for your end-users are preserved & existing client-side configurations are maintained.
On the other hand, AWS DataSync is ideal for transferring data between on-premises & AWS or between AWS storage services. A few use-cases that AWS suggests are migrating active data to AWS, archiving data to free up on-premises storage capacity, replicating data to AWS for business continuity, or transferring data to the cloud for analysis and processing.
At the core, both can be used to transfer data to & from AWS but serve different business purposes.
Your exact question in the AWS DataSync FAQs:
Q: When do I use AWS DataSync and when do I use AWS Transfer Family?
A: If you currently use SFTP to exchange data with third parties, AWS Transfer Family provides a fully managed SFTP, FTPS, and FTP transfer directly into and out of Amazon S3, while reducing your operational burden.
If you want an accelerated and automated data transfer between NFS servers, SMB file shares, self-managed object storage, AWS Snowcone, Amazon S3, Amazon EFS, and Amazon FSx for Windows File Server, you can use AWS DataSync. DataSync is ideal for customers who need online migrations for active data sets, timely transfers for continuously generated data, or replication for business continuity.
Also see: AWS Transfer Family FAQs - Q: Why should I use the AWS Transfer Family?
(Please feel free to mark this question as duplicate and share pointer to duplicates.)
Hi,
We are developing spring boot based application and will be using docker in production.
Currently it is using MongoDB (Atlas) for storing its log. Looks like MongoDB Cloud will be expensive option to store logs/audit trails.
Since we are going to use AWS, which AWS service we should use to store Log4J Logs and audit messages?
Usually people do store logs in s3, where you can archive logs with a combination of infrequent access and glacier for a reasonable money and you can apply also some life-cycle policy so the logs are automatically removed after a defined amount of time.
If you are looking for some kind of streaming/logging over a network, you may start with some AWS Lambda functions or SQS or you may want to go with some kind of service like https://aws.amazon.com/kinesis/data-firehose/ if you believe that you are really big.
The other advantage of S3 (beside the lowest price) is that most of the other services support reading data from S3. So if you decide later that you want to analyze data with ElasticSearch or Elastic Map-Reduce cluster you will probably have some way how to do it.
What is the best way to scan data in S3 (for auditing purposes, possibly)? I was asked to do some research on this and utilizing AWS Athena was my first idea I could think of. But if you can provide more knowledge/ideas, I'd appreciate it.
Thanks!
You want to use Amazon Macie:
Amazon Macie is a security service that uses machine learning to automatically discover, classify, and protect sensitive data in AWS. Amazon Macie recognizes sensitive data such as personally identifiable information (PII) or intellectual property, and provides you with dashboards and alerts that give visibility into how this data is being accessed or moved.
Video: AWS Summit Series 2017 - New York: Introducing Amazon Macie
After reading about AWS Lambda I've taken a quite interest in it. Although there is one thing that I couldn't really find any info on. So my question is, is it possible to have lambda work outside Amazon services? Say if I have a database from some other provider, would it be possible to perform operations on it through AWS Lambda?
Yes.
AWS Lambda functions run code just like any other application. So, you can write a Lambda function that calls any service on the Internet, just like any computer. Of course, you'd need access to the external service across the Internet and access permissions.
There are some limitations to Lambda functions, such as functions only running for a maximum of five minutes and only 500MB of local disk space.
So when should you use Lambda? Typically, it's when you wish some code to execute in response to some event, such as a file being uploaded to Amazon S3, data being received via Amazon Kinesis, or a skill being activated on an Alexa device. If this fits your use-case, go for it!
If I create a function to get webpages. Will it execute it on different IP per execution so that my scraping requests dont get blocked?
I would use this AWS pipeline:
Where at source on the left you will have an EC2 instance with JAUNT which then feeds the URLS or HTML pages into a Kinesis Stream. The Lambda will do your HTML parsing and via Firehose stuff everything into S3 or Redshift.
The JAUNT can run via a standard WebProxy service with a rotating IP.
Yes, lambda by default executes with random IPs. You can trigger it using things like event bridge so you can have a schedule to execute the script every hour or similar. Others can possibly recommend using API Gateway, however, it is highly insecure to expose API endpoints available for anyone to trigger. So you have to write additional logic to protect it either by hard coded headers or say oauth.
AWS Lambda doesn't have a fixed IP source as mentioned here
however, I guess this will happen when it gets cooled down, not during the same invocation.
Your internal private IP address isn't what is seen by the web server, it's the public ip address from AWS. As long as you are setting the headers and signaling to the webserver that your application is a bot. See this link for some best practices on creating a web scrapper:
https://www.blog.datahut.co/post/web-scraping-best-practices-tips
Just be respectful of how much data you pull and how often is the main thing in my opinion.
Lambda is triggered when a file is placed in S3, or data is being added to Kinesis or DynamoDB. That is often backwards of what a web scraper needs, though certainly things like S3 could perform as a queue/job runner.
Scraping on different IPs? Certainly lambda is deployed on many machines, though that doesn't actually help you, since you can't control the machines or their IP.