Automating CSV analysis? - amazon-web-services

My e-commerce company generates lots of CSV data. To track order status, the team must download a number of trackers. Creating a relationship and subsequently analyse,its a time-consuming process. Which AWS low-code solution can be used to automate the workflow?

Depending on what 'workflow' you require, a few options are:
Amazon Honeycode, which is a low-code application builder
You can Filter and retrieving data using Amazon S3 Select, which works on individual CSV files. This can be scripted via the AWS CLI or an AWS SDK
If you want to run SQL and create JOINs between multiple files, then Amazon Athena is fantastic. This, too, can be scripted.

While not exactly low code, AWS Athena uses SQL-like queries to analyze CSV files, among many other formats

Related

What is the best way to transfer BigQuery data to SFTP?

The data is a few hundred Mb up to a few Gb. It could be running some BQ procedures and in the end a select. The values of this need to be transferred as a valid CSV to an SFTP.
Cloud functions could be problematic because of the 9 minute timeout limit and the 2Gb RAM limit.
Is there a serverless solution or do I have to run manual instances?
There are two scenarios I would consider:
Export table with standard BQ options (here) into GCP bucket storage. Then you can pick it up and upload it to SFTP by Cloud Run. There are containers, which are built for this, e.g. this one
Run a pipelining project. Considering you want to use simple export, I would suggest Dataflow. You can write a small Python or Java code to pick up a file and upload it to SFTP. If you would like more complex logic in processing - have a look at Dataproc

what is the efficient way of pulling data from s3 among boto3, athena and aws command line utils

Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.
If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.
If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.
If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.
Some options:
Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.
Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.
Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).
Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.
Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.

Handling Very Large volume(500TB) data using spark

I have large volume of data nearly 500TB , I have to do some ETL on that data.
This data is there in the AWS S3, so I planning to use AWS EMR setup to process this data but I am not sure what should be the config I should select .
What kind of cluster I need(master and how many slaves)?
Do I need to process chunk by chunk(10GB) or can I process all data at once?
What should be Master and slave(executor) memory both Ram and storage?
What kind of processor (speed) I need?
Based on this I want to calculate the cost of AWS EMR and start process the data
Based upon your question, you have little or no experience with Hadoop. Get some training first so that you understand how the Hadoop ecosystem works. Plan on spending three months to get to a starter level.
You have a lot of choices to make, some are fundamental to a project's success. For example, what language (Scala, Java or Python)? Which tools (Spark, Hive, Pig, etc.). What format is your data in (CSV, XML, JSON, Parquet, etc.). Do you only need batch processing or do you require near real-time analysis, etc. etc. etc.
You may find other AWS services more applicable such as Athena or Redshift depending on what format your data is in and what information you are trying to extract / process.
With 500 TB in AWS, open a ticket with support. Explain what you have, what you want and your time frame. An SA will be available to direct you on a path.

How can I implement Amazon EMR to read data from my API calls?

All the examples i've seen are with Java programs?
I want to be able to track the a user's behaviour while navigating my website by looking at all the API calls made by that user. All the API calls are based on data stored in a SQL database.
I also for example want to check all the keywords passed to my search API to have a list of most search terms.
I thought about using Oozie but does anyone have any other suggestions ?
There are several option for analyzing the data in your database.
Normal SQL experimentation
I'd suggest starting with normal SQL statements against your database to experiment with finding what data is of interest. This might be a little slow if you have millions of records, but gives you full flexibility to play around with the data.
Amazon EMR
Once you have identified the types of analysis you'd like to run on a regular basis (eg daily or weekly), you could launch an EMR cluster to perform analysis. Please note that this is a powerful but rather complex toolset and the time required to fully utilize it might not be worthwhile.
You can launch a transient cluster, which means that the cluster terminates once it has finished the jobs it has been given. Thus, the cluster can be triggered via a scheduled API call and will automatically terminate.
Amazon Athena
Amazon Athena provides an SQL interface to data stored in Amazon S3. The common use-case is to analyze log files that are in S3 without having to load them into a database. Athena is powerful and processes data in parallel to give results back very quickly.
Bottom line: Start simple. Play with the existing data to figure out what you'd like to discover. Then optimize.

How to replicate salesforce data to AWS S3/RDS?

I am trying to build a data warehouse using RedShift in AWS. I want preprocess salesforce data my moving it to RDS or S3(use them as stage) before finally moving it to RedShift. I am trying to find out what are different ways on how I can replicate salesforce data in S3/RDS for this purpose. I have seen many third party tools which are able to do this.But I am looking for something which can be built in-house. I would like to use this data for dimensional modeling.
Thanks for your help!