I have a csv with approx. 108,000 rows, each of which is a unique long/lat combination. The file is uploaded into S3 and visualised in Quicksight.
My problem is that Quicksight is only showing the first 10,000 points. The points that it shows are in the correct place, the map works perfectly, it's just missing 90%+ of the points I wish to show. I don't know if it makes a difference but I am using an admin enabled role for both S3 and Quicksight as this is a Dev environment.
Is there a way to increase this limit so that I can show all of my data points?
I have looked in the visualisation settings (the drop doen in the viz) and explored the tab on the left as much as I can. I am quite new to AWS so this may be a really easy one.
Thanks in advance!
You could consider combining lat/lng that are near each other based on some rule you come up when preparing your data.
There appears to be limitations on how many rows and columns you can serve to QuickSight:
https://docs.aws.amazon.com/quicksight/latest/user/data-source-limits.html
Related
what I have seen so far is that the aws glue crawler creates the table based on the latest changes in the s3 files.
let's say crawler creates a table and then I upload a CSV with updated values in one column. the crawler is run again and it updates the table's column with the updated values. I want to be able to show a comparison of the old and new data in quick sight eventually, is this scenario possible?
for example,
right now my csv file is set up as details of one aws service, like RDS is the csv file name and the columns are account id, account name, what region is it in, etc etc
there was one column of percentage with a value 50%, it gets updated with 70%. would I be able to somehow get the old value as well to show in quicksight, to say like previously it was 50% and now its 70%
Maybe this scenerio is not even valid? because I want to be able to show like what account has what cost in xyz month and show how the cost is different in other months. If I make separate tables on each update of csv then there would be 1000+ tables at one point.
If I have understood your question correctly, you are aiming to track data over time. Above you suggest creating a table for each time series, why not instead maintain a record in a table for each time series, you can then create various Analysis over the data, comparing specific months or tracking month-by-month values.
I've gone through some forensic analysis and implemented iCloud Account access programmatically (with authentication).
The problem I'm facing is that, when getting responses for Notes Table attachment, the table cells are out of order. E.g. sometimes the cells from the 2nd row appears first, and cells from the 1st row comes next. I couldn't find any meta data from which I can know the ordering of these cells.
FYI, I have decoded the attachment responses with Versioned-Document and Crframework protocall-buffer objects. If someone has worked with this technology or followed similar forensic analysis, can you help me getting order of the table cell data?
Thanks in advance.
Pretty new to the dynamoDb and the whole AWS, it's very exciting but I feel the learning curve is a bit steep. Anyway, here is my situation and my problem.
We have a mobile react native app which stores into a dynamoDb table one row each time the users are doing a search. (the database is a search history with a UUID and then the search criteria). On average we have a few thousands new searches into the table every day. The table has just a primary key which is the search id.
The app is quite new but we are reaching the few hundred thousand rows in the table already and can expect having a million in the following months. The data is plain simple data with unique id and string and numbers in the other attributes. No connection, no relationship, etc... That's already when I felt maybe DynamoDb may not have been the best choice but still, I read everywhere it can be suitable for anything if properly managed.
Next to this there is a webapp dashboard which -thanks to a rest api using nodejs lambdas- queries the dynamoDB to make statistics about the searches: how many searches per day, list of last searches... the problem is DynamoDb is not really suitable to query hundred thousands of data (the 1mb limit, query limitations, credits...).
When I do a scan I get only 3000 searches. I tried to make a loop on the scan using the last index requested but after a few test I did not get data and I blocked the maximum throughput. It seems really clear that I don't have the right approach to bring all these searches to my web app. So now what would be the right approach? My ideas are the following but I am open to more experienced one:
Switching to a SQL database (using the aws migration ?). Will it really be easier then?
creating lambdas to execute scheduled jobs every night to make statistics every day so that I don't have to query the full database all the time but just some of the most recent searches and the statistics rows? Is it doable? any node.js / lambdas tutorial you may know regarding this?
better management of indexes? I am still very lost regarding those.
Looking forward to your opinions.
Add another layer to take care for full text search.
For example, with Elasticsearch, or Algolia or other similars.
Notes:
Elasticsearch may be cost you a lot if compare the cost on dynamodb
Reference:
https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-dynamodb-elasticsearch-integration/
I have a pipe delimited text file that is 360GB, compressed (gzip).
It has over 1,620 columns. I can't show the exact field names, but here's basically what it is:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
Seriously, there are over 800 of these property name/value fields.
There are roughly 280 million rows.
The file is in an S3 bucket.
I need to get the data into Redshift, but the column limit in Redshift is 1,600.
The users want me to pivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
What is a good way to pivot the file in the aws environment? The data is in a single file, but I'm planning on splitting the data into many different files to allow for parallel processing.
I've considered using Athena. I couldn't find anything that states the maximum number of columns allowed by Athena. But, I found a page about Presto (on which Athena is based) that says “there is no exact hard limit, but we've seen stuff break with more than few thousand.” (https://groups.google.com/forum/#!topic/presto-users/7tv8l6MsbzI).
Thanks.
First, pivot your data, then load to Redshift.
In more detail, the steps are:
Run a spark job (using EMR or possibly AWS Glue) which reads in your
source S3 data and writes out (to a different s3 folder) a pivoted
version. by this i mean if you have 800 value pairs, then you would
write out 800 rows. At the same time, you can split the file into multiple parts to enable parallel load.
"COPY" this pivoted data into Redshift
What I learnt from most of the time from AWS is, if you are reaching a limit, you are doing it in a wrong way or not in a scalable way. Most of the time architects designed with scalability, performance in mind.
We had similar problems, having 2000 columns. Here is how we solved it.
Split the file across 20 different tables, 100+1 (primary key) column each.
Do a select across all those tables in a single query to return all the data you want.
If you say you want to see all the 1600 columns in a select, then the business user is looking at wrong columns for their analysis or even for machine learning.
To load 10TB+ of data we had split the data into multiple files and load them in parallel, that way loading was faster.
Between Athena and Redshift, performance is the only difference. Rest of them are same. Redshift performs better than Athena. Initial Load time and Scan Time is higher than Redshift.
Hope it helps.
We are looking at Amazon Redshift to implement our Data Warehouse and I would like some suggestions on how to properly design Schemas in Redshift, please.
I am completely new to Redshift. In the past when I worked with "traditional" data warehouses, I was used to creating schemas such as "Source", "Stage", "Final", etc. to group all the database objects according to what stage the data was in.
By default, a database in Redshift has a single schema, which is named PUBLIC. So, my question to those who have worked with Redshift, does the approach that I have outlined above apply here? If not, I would love some suggestions.
Thanks.
With my experience in working with Redshift, I can assert the following points with confidence:
Multiple schema: You should create multiple schema and create tables accordingly. When you'll scale, it'll be easier for you to pin-point where exactly the table is supposed to be. Let us say, you have 3 schema, named production, aggregates and rough. Now, you know that the table production will contain the tables that are not supposed to be changed (mostly OLTP data) - such as user, order, transactions tables. Table aggregates will have aggregated data built over raw tables - such as number of orders placed per user per day per category. Finally, rough will contain any table that doesn't hold a business logic but is required for some temporary work - let us say to check the genre of movies for a list of 1 lakh users, which is shared with you in an excel file. Simply create a table in rough schema, perform your operations and drop the table. Now you very clearly know where you'll find the tables based on whether they are raw, aggregated or simply temporary tables.
Public schema: Forget it exists. Any table that is not preceded with a schema name, gets created there. A lot of clutter - no point in storing any important data there.
Cross schema joins: There's no stopping here. You may join as many tables from as many schema as required. In fact, it is desirable you create dimension tables and join on a PK later, rather than to keep all the information in a single table.
Spend some quality time in designing the schema and underlying table structure. When you expand, it'll be easier for you to classify things better in terms of access control. Do let me know if I've missed some obvious points.
You can have multiple databases in a Redshift cluster but I would stick with one. You are correct that schemas (essentially namespaces) are a good way to divide things up. You can query across schemas but not databases.
I would avoid using the public schema as managing certain permissions there can be difficult (easier to deny someone access to public than prevent them from being able to create a table for example).
For best results if you have the time, learn about the permissions system up front. You want to create groups that have access to schemas or tables and add/remove users from groups to control what they can do. Once you have that going it becomes pretty easy to manage.
In addition to the other responses, here are some suggestions for improving schema performance.
First: Automatic compression encodings using COPY command
Improve the performance of Amazon Redshift using the COPY command. It will get data into Redshift database. The COPY command is clever enough. It automatically chooses the most appropriate encoding settings for the data it uploads. You don’t have to think about it. However, it does so only for the first data upload into an empty table.
So, make sure to use a significant data set while uploading data for the first time, which Redshift can assess to set the column encodings in the best way. Uploading a few lines of test data will confuse Redshift to know how best to optimize the compression to handle the real workload.
Second: Use Best Distribution Style and Key
Distribution-style decides how data is distributed across the nodes. Applying a distribution style at table level tells Redshift how you want to distribute the table and the key. So, how you specify distribution style is important for good query performance with Redshift. The style you choose may affect requirements for data storage and cluster. It also affects the time taken by the COPY command to execute.
I recommend setting the distribution style to all tables with a smaller dimension. For large dimension, distribute both the dimension and associated fact on their join column. To optimize the second large dimension, take the storage-hit and distribute ALL. You can even design the dimension columns into the fact.
Third: Use the Best Sort Key
A Redshift database maintains data in a table with an arrangement of a sort-key-column if specified. Since it’s sorted in each partition; each cluster node upholds its partition in predefined order. (While designing your Redshift schema, also consider the impact on your budget. Redshift is priced by amount of stored data and by the number of nodes.)
Sort key optimizes Amazon Redshift performance significantly. You can do it in many ways. First, use data filtering. If where-clause filters on a sort-key-column, it skips the entire data blocks. It’s because Redshift saves data in blocks. Each block header records the minimum and maximum sort key value. Filter outside of that range, the entire block may get skipped.
Alternatively, when joining two tables, sorted on their joint keys, the data is read in matching order. Also, you can merge-join without separate sort-steps. Joining large dimension to a large fact table will be easy with this method because neither will fit into a hash table.