I'm getting an HIVE_PARTITION_SCHEMA_MISMATCH error that I'm not quite sure what to do about. When I look at the 2 different schemas, the only thing that's different is the order of the keys in one of my structs (created by a glue crawler). I really don't care about the order of the data, and I'm receiving the data as a JSON blob, so I cannot guarantee the order of the keys.
struct<device_id:string,user_id:string,payload:array<struct<channel:string,sensor_id:string,type:string,unit:string,value:double,name:string>>,topic:string,channel:string,client_id:string,hardware_id:string,timestamp:bigint,application_id:string>
struct<device_id:string,user_id:string,payload:array<struct<channel:string,name:string,sensor_id:string,type:string,unit:string,value:double>>,topic:string,channel:string,client_id:string,hardware_id:string,timestamp:bigint,application_id:string>
I suggest you stop using Glue crawlers. It's probably not the response you had hoped for, but crawlers are really bad at their job. They can be useful sometimes as a way to get a schema from a random heap of data that someone else produced and that you don't want to spend time looking at to figure out what its schema is – but once you have a schema, and you know that new data will follow that schema, Glue crawlers are just in the way, and produce unnecessary problems like the one you have encountered.
What to do instead depends on how new data is added to S3.
If you are in control of the code that produces the data, you can add code that adds partitions after the data has been uploaded. The benefit of this solution is that partitions are added immediately after new data has been produced so tables are always up to date. However, it might tightly couple the data producing code with Glue (or Athena if you prefer to add partitions through SQL) in a way that is not desirable.
If it doesn't make sense to add the partitions from the code that produces the data, you can create a Lambda function that does it. You can either set it to run at a fixed time every day (if you know the location of the new data you don't have to wait until it exists, partitions can point to empty locations), or you can trigger it by S3 notifications (if there are multiple files you can either figure out a way to debounce the notifications through SQS or just create the partition over and over again, just swallow the error if the partition already exists).
You may also have heard of MSCK REPAIR TABLE …. It's better than Glue crawlers in some ways, but just as bad in other ways. It will only add new partitions, never change the schema, which is usually what you want, but it's extremely inefficient, and runs slower and slower the more files there are. Kind of like Glue crawlers.
Related
I have been working on AWS Redshift and kind of curious about which of the data loading (full reload) method is more performant.
Approach 1 (Using Truncate):
Truncate the existing table
Load the data using Insert Into Select statement
Approach 2 (Using Drop and Create):
Drop the existing table
Load the data using Create Table As Select statement
We have been using both in our ETL, but I am interested in understanding what's happening behind the scene on AWS side.
In my opinion - Drop and Create Table As statement should be more performant as it reduces the overhead of scanning/handling associated data blocks for table needed in Insert Into statement.
Moreover, truncate in AWS Redshift does not reseed identity columns - Redshift Truncate table and reset Identity?
Please share your thoughts.
Redshift operates on 1MB blocks as the base unit of storage and coherency. When changes are made to a table it is these blocks that are "published" for all to see when the changes are committed. A table is just a list (data structure) of block ids that compose it and since there can be many versions of a table in flight at any time (if it is being changed while others are viewing it).
For the sake of the is question let's assume that the table in question is large (contains a lot of data) which I expect is true. These two statements end up doing a common action - unlinking and freeing all the blocks in the table. The blocks is where all the data exists so you'd think that the speed of these two are the same and on idle systems they are close. Both automatically commit the results so the command doesn't complete until the work is done. In this idle system comparison I've seen DROP run faster but then you need to CREATE the table again so there is time needed to recreate the data structure of the table but this can be in a transaction block so do we need to include the COMMIT? The bottom line is that in the idle system these two approaches are quite close in runtime and when I last measured them out for a client the DROP approach was a bit faster. I would advise you to read on before making your decision.
However, in the real world Redshift clusters are rarely idle and in loaded cases these two statements can be quite different. DROP requires exclusive control over the table since it does not run inside of a transaction block. All other uses of the table must be closed (committed or rolled-back) before DROP can execute. So if you are performing this DROP/recreate procedure on a table others are using the DROP statement will be blocked until all these uses complete. This can take an in-determinant amount of time to happen. For ETL processing on "hidden" or "unpublished" tables the DROP/recreate method can work but you need to be really careful about what other sessions are accessing the table in question.
Truncate does run inside of a transaction but performs a commit upon completion. This means that it won't be blocked by others working with the table. It's just that one version of the table is full (for those who were looking at it before truncate ran) and one version is completely empty. The data structure of the table has versions for each session that has it open and each sees the blocks (or lack of blocks) that corresponds to their version. I suspect that it is managing these data structures and propagating these changes through the commit queue that slows TRUNCATE down slightly - bookkeeping. The upside for this bookkeeping is that TRUNCATE will not be blocked by other sessions reading the table.
The deciding factors on choosing between these approaches is often not performance, it is which one has the locking and coherency features that will work in your solution.
I'm starting on a new project and am going to be using DynamoDB as the main data source. A lot of what it does works perfectly for the needs, with a couple exceptions.
Those are the sorting and paginating needs of the UI. Users can sort the data by anywhere from 8-10 different columns, and a result set of 20-30k+ rows should be paginated over.
From what I can tell of DynamoDB, the only way to do sorting by all those columns would be to expose that many sort keys through a variety of additional indexes, and that seems like a misuse of those concepts. If I'm not going to sort the data with DynamoDb queries, I can't paginate there either.
So my question is, what's the quickest way once I have the data to paginate & sort? Should I move the result set into Aurora and then sort & page with SQL? I've thought about exporting to S3 and then utilizing something like Athena to page & sort, but that tool really seems to be geared to much larger datasets than this. What are other options?
One option is to duplicate the data and store it once for each sorting option, with each version of the record having different data in the sort key. If you are okay with eventual consistency that might be a little more delayed you can accomplish this by having a lambda that reads from a DynamoDB stream and insert/update/delete the sorted records as the main records are inserted/updated/deleted.
Sorting, pagination and returning 20-30K records are not Dynamo's strong suit...
Why not just store the data in Aurora in the first place?
Depending on the data, Elasticsearch may be a better choice. Might even look at Redshift.
EDIT
If you've not seen this before...
Here is my basic data structure (or the relevant portions anyway) in DynamoDB; I have a files table that holds file data and has an id for the file. I also have a 'Definitions' table that holds items defined in the file. Definitions also have an ID (as the primary key) as well as a field called 'SourceFile' that references the file id in order to tie the definition to it's source file.
Most of the time I want to just get the definition by it's id and optionally get the file later which works just fine. However, in some cases I need to get all definitions for a set of files. I can do this with a scan but it's slow and I know that it will get slower as the table grows and isn't recommended. However I'm not sure how to do this with a query.
I can create a GSI that uses the SourceFile field as the primary key and use that to query against. This sounds like an answer (and may be), however I'm not sure. The problem is that some libraries may have 5k or 10k files (maybe more in rare cases). In a GSI I can only query against 1 file ID per query so I would have to throw a new query for each file and I can't imagine it's going to be very efficient to throw 10K queries at DynamoDB...
Is it better to create a tight loop (or multiple threads) and hit it with a ton of queries or to scan the table? Is there another way to do this that I'm not thinking of?
This is during an indexing and analysis process that is expected to take a bit of time so it's ok that it's not instant but I'd like it to be as efficient as possible...
Scans are the most efficient if you expect to be looking for a majority of data in your database. You can retrieve up to 1MB per scan request, and for each unit of capacity available you can read 4KB, so assuming you have enough capacity provisioned, you can retrieve thousands of items in a single request (assuming the items are pretty small).
The only alternative I can think of is to add more metadata that can help you index the files & definitions at a higher level - like, for instance, the library name/id. With that you can create a GSI on library name/id and query that way.
Running thousands of queries is going to less efficient than scanning assuming you are storing on the order of tens/hundreds of thousands of items.
I know there is a way to place the results of a query into a table; there is a way to copy a whole table into another table; and there is a way to list a table piecemeal (tabledata:list using startIndex, maxResults and pageToken).
However, what I want to do is go over an existing table with tabledata:list and output the results piecemeal into other tables. I want to use this as an efficient way to shard a table.
I cannot find a reference to such a functionality, or any workaround to it for that matter.
Important to realize: Tabledata.List API is not part of BQL (BigQuery SQL) but rather BigQuery API that you can use in client of your choice.
That said, the logic you outlined in your question can be implemented in many ways, below is an example (high level steps):
Calling Tabledata.List within the loop using pageToken for next iteration or for exiting loop.
In each iteration, process response from Tabledata.List, extract actual data and insert into destination table using streaming data with Tabledata.InsertAll API. You can also have inner loop to go thru rows extracted in given iteration and define which one to go to which table/shard.
This is very generic logic and particular implementation depends on client you use.
Hope this helps
For what you describe, I'd suggest you use the batch version of Cloud Dataflow:
https://cloud.google.com/dataflow/
Dataflow already supports BigQuery tables as sources and sinks, and will keep all data within Google's network. This approach also scales to arbitrarily large tables.
TableData.list-ing your entire table might work fine for small tables, but network overhead aside, it is definitely not recommended for anything of moderate size.
I'm really confused about how or what AWS services to use for my case.
I have a web application which stores user interaction events. Currently these events are stored on a RDS table. Each event contains about 6 fields like timestamp, event type, userID, pageID, etc etc. Currently I have millions of event records on each account schema. When I try to generate reports out of this raw data - the reports are extremely slow since I do complex aggregation queries over long time period. a report of a time period of 30 days might take 4 minutes to generate on RDS.
Is there any way to make these reports running MUCH faster? I was thinking about storing the events on DynamoDB, but I cannot run such complex queries on the data, and to do any attribute based sorting.
Is there a good service combination to achieve this? Maybe using RedShift, EMP, Kinesis?
I think Redshift is your solution.
I'm working with a dataset that generates about 2.000.000 new rows each day and I made really complex operations on it. You could take advance of Redshift sort keys, and order your data by date.
Also if you do complex aggregate functions I really recommend to denormalize all the information and insert it in only one table with all the data. Redshift uses a very efficient, and automatic, column compression you won't have problems with the size of the dataset.
My usual solution to problems like this is to have a set of routines that rollup and store the aggregated results, to various levels in additional RDS tables. This transactional information you are storing isn't likely to change once logged, so, for example, if you find yourself running daily/weekly/monthly rollups of various slices of data, run the query and store those results, not necessarily at the final level that you will need, but at a level that significantly reduces the # of rows that goes into those eventual rollups. For example, have a daily table that summarizes eventtype, userid and pageId one row per day, instead of one row per event (or one row per hour instead of day) - you'll need to figure out the most logical rollups to make, but you get the idea - the goal is to pre-summarize at the levels that will reduce the amount of raw data, but still gives you plenty of flexibility to serve your reports.
You can always go back to the granular/transactional data as long as you keep it around, but there is not much to be gained by constantly calculating the same results every time you want to use the data.