Is there a way to store the results from Pig directly to a table on Redshift?
Yes, but you won't like it probably - it's not efficient.
Download jdbc driver (http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html)
Use PiggyBank DBStorage, mysql example here.
A better way is to prepare csv and then import it.
Related
Using CSV upload in Apache Superset works as expected. I can use it to add data from CSV to a databse, e.g. Postgres. Now I want to apped data from a different CSV to this table/dataset. But how?
The CSVs all have the same format. But there is a new one for every day. In the end I want to have a dashboard which updates every day, taking the new data into account.
Generally, I agree with Ana that if you want to repeatedly upload new CSV data then you're better off operationalizing this into some type of process, pipeline, etc that runs on a schedule.
But if you need to stick with the uploading CSV route through the Superset UI, then you can set the Table Exists field to Append instead of Replace.
You can find a helpful GIF in the Preset docs: https://docs.preset.io/docs/tips-tricks#append-csv-to-a-database
Probably you'll be better served by creating a simple process to load the CSV to a table in the database and then querying that table in Superset.
Superset is a tool to visualize data, it allows uploading CSV for quick and dirty "only once" kind of charts, but if this is going to be a recurrent and structured periodical load of data, it's better to use whatever integrating tool you want to load the data, there are zillions of ETL (Extract-Transform-Load) tools out there (or scripting programs to do it), ask if your company is already using one, or choose the one that is simpler for you.
We want to use AWS Athena for analytics and segmentation, our problem is that our data is schemaless, rows are different with some similar columns.
Is it possible to create table without defining all the columns?
When we query we know the type (string/int) of each column so if there is a way to define on the query it will be great.
We can structure the data in anyway needed to support schemaless and in any format: CSV / JSON.
Is Athena an option for schemaless uses?
There are many ways to use Athena in schemaless uses and you need to give specific examples of scenarios that you want to support more efficiently as in Athena you pay based on the data that you scan and optimizing your data to minimize the data scan is critical to make it a useful tool in scale.
The simplest way to get you started as you are learning the tool, and the types of queries that you can run on your data, is to define a table with a single column ("line"), and then do the parsing of the data that you want using string functions, or JSON functions if the lines are in JSON format.
You will get good time performance if you have multiple files, but it will be expensive as you need to scan all your data for every query. I suggest that you start with these queries as a good way to define your requirements. As you see the growth of usage, start optimizing the use cases by using the CTAS (Create Table As Select) commands that will generate parquet versions of the original raw data to support the more popular (and expensive) use cases.
You are welcome to read my blog post that is describing the strategy and tactics of a cloud environment using Athena and the other AWS tools around it.
I have a .sql file filled with Athena queries.
Is there a way I can tell Athena to run the sql queries saved in s3://my-bucket/path/to/queries.sql?
In MySQL can do something like this (based in SO answer), but curious if possible in Athena
mysql> source \home\user\Desktop\test.sql;
Is there a way I can tell Athena to run the sql queries saved in s3://my-bucket/path/to/queries.sql?
I think there is no direct way to tell Athena to run query stored in S3.
In MySQL can do something like this (based in SO answer), but curious if possible in Athena.
If you want to do it at all, then yes, you should be able to run the query using AWS CLI.
Your steps should be look like this.
Get the query from S3 using CLI and store in temp variable
Pass the query stored in a temp variable to Athena Query CLI
Hope this will help.
I have been looking at options to load (basically empty and restore) Parquet file from S3 to DynamoDB. Parquet file itself is created via spark job that runs on EMR cluster. Here are few things to keep in mind,
I cannot use AWS Data pipeline
File is going to contain millions of rows (say 10 million), so would need an efficient solution. I believe boto API (even with batch write) might not be that efficient ?
Are there any other alternatives ?
Can you just refer to the Parquet files in a Spark RDD and have the workers put the entries to dynamoDB? Ignoring the challenge of caching the DynamoDB client in each worker for reuse in different rows, it some bit of scala to take a row, build an entry for dynamo and PUT that should be enough.
BTW: Use DynamoDB on demand here, as it handles peak loads well without you having to commit to some SLA.
Look at the answer below:
https://stackoverflow.com/a/59519234/4253760
To explain the process:
Create desired dataframe
Use .withColumn to create new column and use psf.collect_list to convert to desired collection/json format, in the new column in the
same dataframe.
Drop all un-necessary (tabular) columns and keep only the JSON format Dataframe columns in Spark.
Load the JSON data into DynamoDB as explained in the answer.
My personal suggestion: whatever you do, do NOT use RDD. RDD interface even in Scala is 2-3 times slower than Dataframe API of any language.
Dataframe API's performance is programming language agnostic, as long as you dont use UDF.
I have a CSV table in S3 with 100's of attributes/features, I don't want to create table in RedShift with all these attributes before importing data. Is there anyway to select only the columns I need while copying data from S3 into Redshift?
You cannot achieve the above using just a copy command it is doable using a python script. Please go through this
Read specific columns from a csv file with csv module?
There are couple of options listed in aws forum for this problem, take a look at https://forums.aws.amazon.com/message.jspa?messageID=432590 if they may work for you.