Reflecting changes on big tables in hdfs - hdfs

I have an order table in the OLTP system.
Each order record has a OrderStatus field.
When end users created an order, OrderStatus field set as "Open".
When somebody cancels the order, OrderStatus field set as "Canceled".
When an order process finished(transformed into invoice), OrderStatus field set to "Close".
There are more than one hundred million record in the table in the Oltp system.
I want to design and populate data warehouse and data marts on hdfs layer.
In order to design data marts, I need to import whole order table to hdfs and then I need to reflect changes on the table continuously.
First, I can import whole table into hdfs in the initial load process by using sqoop. I may take long time but I will do this once.
When an order record is updated or a new order record entered, I need to reflect changes in hdfs. How can I achieve this in hdfs for such a big transaction table?
Thanks

One of the easier ways is to work with database triggers in your OLTP source db and every change an update happens use that trigger to push an update event to your Hadoop environment.
On the other hand (this depends on the requirements for your data users) it might be enough to reload the whole data dump every night.
Also, if there is some kind of last changed timestamp, it might be a possible way to load only the newest data and do some kind of delta check.
This all depends on your data structure, your requirements and your ressources at hand.
There are several other ways to do this but usually those involve messaging, development and new servers and I suppose in your case this infrastructure or those ressources are not available.
EDIT
Since you have a last changed date, you might be able to pull the data with a statement like
SELECT columns FROM table WHERE lastchangedate < (now - 24 hours)
or whatever your interval for loading might be.
Then process the data with sqoop or ETL tools or the like. If the records are already available in your Hadoop environment, you want to UPDATE it. If the records are not available, INSERT them with your appropriate mechanism. This is also called UPSERTING sometimes.

Related

Creating a BigQuery Transfer Service with more complex reges

I have a bucket that stores files based on a transaction time into a filename structure like
gs://my-bucket/YYYY/MM/DD/[autogeneratedID].parquet
Lets assume this structure dates back to 2015/01/01
Some of the files might arrive late, so in theory a new file could be written to the 2020/07/27 structure tomorrow.
I now want to create a BigQuery table that inserts all files with transaction date 2019-07-01 and newer.
My current strategy is to slice the past into small enough chunks to just run batch loads, e.g. by month. Then I want to create a transfer service that listens for all new files coming in.
I cannot just point it to gs://my-bucket/* as this would try to load the date prior to 2019-07-01.
So basically I was thinking about encoding the "future looking file name structures" into a suitable regex, but it seems like the wildcard names https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames only allow for very limited syntax which is not as flexible as awk regex for instance.
I know there are streaming inserts into BQ but still hoping to avoid that extra complexity and just make a smart configuration of the transfer config following the batch load.
You can use scheduled queries with external table. When you query your external table, you can use the pseudo column _FILE_NAME in the where condition of your request and filter on this.

AWS Athena Query Partitioning

I am trying to use AWS Athena to provide analytics for an existing platform. Currently the flow looks like this:
Data is pumped into a Kinesis Firehose as JSON events.
The Firehose converts the data to parquet using a table in AWS Glue and writes to S3 either every 15 mins or when the stream reaches 128 MB (max supported values).
When the data is written to S3 it is partitioned with a path /year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/...
An AWS Glue crawler update a table with the latest partition data every 24 hours and makes it available for queries.
The basic flow works. However, there are a couple of problems with this...
The first (and most important) is that this data is part of a multi-tenancy application. There is a property inside each event called account_id. Every query that will ever be issued will be issued by a specific account and I don't want to be scanning all account data for every query. I need to find a scalable way query only the relevant data. I did look into trying to us Kinesis to extract the account_id and use it as a partition. However, this currently isn't supported and with > 10,000 accounts the AWS 20k partition limit quickly becomes a problem.
The second problem is file size! AWS recommend that files not be < 128 MB as this has a detrimental effect on query times as the execution engine might be spending additional time with the overhead of opening Amazon S3 files. Given the nature of the Firehose I can only ever reach a maximum size of 128 MB per file.
With that many accounts you probably don't want to use account_id as partition key for many reasons. I think you're fine limits-wise, the partition limit per table is 1M, but that doesn't mean it's a good idea.
You can decrease the amount of data scanned significantly by partitioning on parts of the account ID, though. If your account IDs are uniformly distributed (like AWS account IDs) you can partition on a prefix. If your account IDs are numeric partitioning on the first digit would decrease the amount of data each query would scan by 90%, and with two digits 99% – while still keeping the number of partitions at very reasonable levels.
Unfortunately I don't know either how to do that with Glue. I've found Glue very unhelpful in general when it comes to doing ETL. Even simple things are hard in my experience. I've had much more success using Athena's CTAS feature combined with some simple S3 operation for adding the data produced by a CTAS operation as a partition in an existing table.
If you figure out a way to extract the account ID you can also experiment with separate tables per account, you can have 100K tables in a database. It wouldn't be very different from partitions in a table, but could be faster depending on how Athena determines which partitions to query.
Don't worry too much about the 128 MB file size rule of thumb. It's absolutely true that having lots of small files is worse than having few large files – but it's also true that scanning through a lot of data to filter out just a tiny portion is very bad for performance, and cost. Athena can deliver results in a second even for queries over hundreds of files that are just a few KB in size. I would worry about making sure Athena was reading the right data first, and about ideal file sizes later.
If you tell me more about the amount of data per account and expected life time of accounts I can give more detailed suggestions on what to aim for.
Update: Given that Firehose doesn't let you change the directory structure of the input data, and that Glue is generally pretty bad, and the additional context you provided in a comment, I would do something like this:
Create an Athena table with columns for all properties in the data, and date as partition key. This is your input table, only ETL queries will be run against this table. Don't worry that the input data has separate directories for year, month, and date, you only need one partition key. It just complicates things to have these as separate partition keys, and having one means that it can be of type DATE, instead of three separate STRING columns that you have to assemble into a date every time you want to do a date calculation.
Create another Athena table with the same columns, but partitioned by account_id_prefix and either date or month. This will be the table you run queries against. account_id_prefix will be one or two characters from your account ID – you'll have to test what works best. You'll also have to decide whether to partition on date or a longer time span. Dates will make ETL easier and cheaper, but longer time spans will produce fewer and larger files, which can make queries more efficient (but possibly more expensive).
Create a Step Functions state machine that does the following (in Lambda functions):
Add new partitions to the input table. If you schedule your state machine to run once per day it can just add the partition that correspond to the current date. Use the Glue CreatePartition API call to create the partition (unfortunately this needs a lot of information to work, you can run a GetTable call to get it, though. Use for example ["2019-04-29"] as Values and "s3://some-bucket/firehose/year=2019/month=04/day=29" as StorageDescriptor.Location. This is the equivalent of running ALTER TABLE some_table ADD PARTITION (date = '2019-04-29) LOCATION 's3://some-bucket/firehose/year=2019/month=04/day=29' – but doing it through Glue is faster than running queries in Athena and more suitable for Lambda.
Start a CTAS query over the input table with a filter on the current date, partitioned by the first character(s) or the account ID and the current date. Use a location for the CTAS output that is below your query table's location. Generate a random name for the table created by the CTAS operation, this table will be dropped in a later step. Use Parquet as the format.
Look at the Poll for Job Status example state machine for inspiration on how to wait for the CTAS operation to complete.
When the CTAS operation has completed list the partitions created in the temporary table created with Glue GetPartitions and create the same partitions in the query table with BatchCreatePartitions.
Finally delete all files that belong to the partitions of the query table you deleted and drop the temporary table created by the CTAS operation.
If you decide on a partitioning on something longer than date you can still use the process above, but you also need to delete partitions in the query table and the corresponding data on S3, because each update will replace existing data (e.g. with partitioning by month, which I would recommend you try, every day you would create new files for the whole month, which means that the old files need to be removed). If you want to update your query table multiple times per day it would be the same.
This looks like a lot, and looks like what Glue Crawlers and Glue ETL does – but in my experience they don't make it this easy.
In your case the data is partitioned using Hive style partitioning, which Glue Crawlers understand, but in many cases you don't get Hive style partitions but just Y/M/D (and I didn't actually know that Firehose could deliver data this way, I thought it only did Y/M/D). A Glue Crawler will also do a lot of extra work every time it runs because it can't know where data has been added, but you know that the only partition that has been added since yesterday is the one for yesterday, so crawling is reduced to a one-step-deal.
Glue ETL is also makes things very hard, and it's an expensive service compared to Lambda and Step Functions. All you want to do is to convert your raw data form JSON to Parquet and re-partition it. As far as I know it's not possible to do that with less code than an Athena CTAS query. Even if you could make the conversion operation with Glue ETL in less code, you'd still have to write a lot of code to replace partitions in your destination table – because that's something that Glue ETL and Spark simply doesn't support.
Athena CTAS wasn't really made to do ETL, and I think the method I've outlined above is much more complex than it should be, but I'm confident that it's less complex than trying to do the same thing (i.e. continuously update and potentially replace partitions in a table based on the data in another table without rebuilding the whole table every time).
What you get with this ETL process is that your ingestion doesn't have to worry about partitioning more than by time, but you still get tables that are optimised for querying.

Tracking changes in Django with PostgreSQL

I have a Django project with PostgreSQL as database.
There are few tables that describe state (let's call them "state tables")
There are several servers that can modify state (Each one modifies its own table)
There is few servers that read the state (let's call them "readers") and modifies internal stuff based on current state of the tables.
What I'd like to do is to give the readers ability to know what row in state tables was changed, so that it won't have to scan all the tables all the time.
Currently I have a special tracking table and a post_save() trigger on all state tables. The post_save trigger saves the table name and the ID.
Initially the plan was to define sequence ID on the tracking table and to check whether "last known tracking ID" is the largest. If it's not - I would scan all of the tracked entries and know what states were changed.
However, it seems that PostgreSQL's indexes are not promised to be sequential. I don't mind the gaps between them, but I do rely on tracking record N+1 to have ID bigger than record N.
Any advice?

Amazon Redshift schema design

We are looking at Amazon Redshift to implement our Data Warehouse and I would like some suggestions on how to properly design Schemas in Redshift, please.
I am completely new to Redshift. In the past when I worked with "traditional" data warehouses, I was used to creating schemas such as "Source", "Stage", "Final", etc. to group all the database objects according to what stage the data was in.
By default, a database in Redshift has a single schema, which is named PUBLIC. So, my question to those who have worked with Redshift, does the approach that I have outlined above apply here? If not, I would love some suggestions.
Thanks.
With my experience in working with Redshift, I can assert the following points with confidence:
Multiple schema: You should create multiple schema and create tables accordingly. When you'll scale, it'll be easier for you to pin-point where exactly the table is supposed to be. Let us say, you have 3 schema, named production, aggregates and rough. Now, you know that the table production will contain the tables that are not supposed to be changed (mostly OLTP data) - such as user, order, transactions tables. Table aggregates will have aggregated data built over raw tables - such as number of orders placed per user per day per category. Finally, rough will contain any table that doesn't hold a business logic but is required for some temporary work - let us say to check the genre of movies for a list of 1 lakh users, which is shared with you in an excel file. Simply create a table in rough schema, perform your operations and drop the table. Now you very clearly know where you'll find the tables based on whether they are raw, aggregated or simply temporary tables.
Public schema: Forget it exists. Any table that is not preceded with a schema name, gets created there. A lot of clutter - no point in storing any important data there.
Cross schema joins: There's no stopping here. You may join as many tables from as many schema as required. In fact, it is desirable you create dimension tables and join on a PK later, rather than to keep all the information in a single table.
Spend some quality time in designing the schema and underlying table structure. When you expand, it'll be easier for you to classify things better in terms of access control. Do let me know if I've missed some obvious points.
You can have multiple databases in a Redshift cluster but I would stick with one. You are correct that schemas (essentially namespaces) are a good way to divide things up. You can query across schemas but not databases.
I would avoid using the public schema as managing certain permissions there can be difficult (easier to deny someone access to public than prevent them from being able to create a table for example).
For best results if you have the time, learn about the permissions system up front. You want to create groups that have access to schemas or tables and add/remove users from groups to control what they can do. Once you have that going it becomes pretty easy to manage.
In addition to the other responses, here are some suggestions for improving schema performance.
First: Automatic compression encodings using COPY command
Improve the performance of Amazon Redshift using the COPY command. It will get data into Redshift database. The COPY command is clever enough. It automatically chooses the most appropriate encoding settings for the data it uploads. You don’t have to think about it. However, it does so only for the first data upload into an empty table.
So, make sure to use a significant data set while uploading data for the first time, which Redshift can assess to set the column encodings in the best way. Uploading a few lines of test data will confuse Redshift to know how best to optimize the compression to handle the real workload.
Second: Use Best Distribution Style and Key
Distribution-style decides how data is distributed across the nodes. Applying a distribution style at table level tells Redshift how you want to distribute the table and the key. So, how you specify distribution style is important for good query performance with Redshift. The style you choose may affect requirements for data storage and cluster. It also affects the time taken by the COPY command to execute.
I recommend setting the distribution style to all tables with a smaller dimension. For large dimension, distribute both the dimension and associated fact on their join column. To optimize the second large dimension, take the storage-hit and distribute ALL. You can even design the dimension columns into the fact.
Third: Use the Best Sort Key
A Redshift database maintains data in a table with an arrangement of a sort-key-column if specified. Since it’s sorted in each partition; each cluster node upholds its partition in predefined order. (While designing your Redshift schema, also consider the impact on your budget. Redshift is priced by amount of stored data and by the number of nodes.)
Sort key optimizes Amazon Redshift performance significantly. You can do it in many ways. First, use data filtering. If where-clause filters on a sort-key-column, it skips the entire data blocks. It’s because Redshift saves data in blocks. Each block header records the minimum and maximum sort key value. Filter outside of that range, the entire block may get skipped.
Alternatively, when joining two tables, sorted on their joint keys, the data is read in matching order. Also, you can merge-join without separate sort-steps. Joining large dimension to a large fact table will be easy with this method because neither will fit into a hash table.

DynamoDb table design: Single table or multiple tables

I’m quite new to NoSQL and DynamoDB and I used to RDBMS. I’m designing database for a game and we're using DynamoDB and AWS Lambda for our backend. I created a table name “Users” for player profile that contains the user information and resources. Because the game has inventory system I also created a table name “UserItems”.
It’s all good until I realized DynamoDB don’t have transaction and any operation that is executed on both table (for example using an item that increase resource) has a chance of failure on one table while success on other and will cause missing data which affect our customers.
So I was thinking maybe my multiple tables design is not good since it’s a habit of me to design multiple table when I’m working with RDBMS. Which let me to think of storing the entire “UserItems” as hash in “Users” but I’m not sure this is a good practice because the size of a single row in Users table will be really big (we may have 500 unique items per users) and each time I pull or put data from/to “Users” (most of the time don’t need “UserItems” data) the read/write throughput will be also really large.
What should I do, keep the multiple tables design and handle transaction manually or switch to single table design? Or maybe there is a 3rd option?
Updated: more information about my use case
Currently I have 2 tables
Users: UserId (key), Username, Gold
UserItems: UserId (partition key), ItemId (sort key), Name, GoldValue
Scenarios:
User buy an item: Users.Gold will be deduced, new UserItem will be add to UserItems table.
User sell an item: Users.Gold will be increased, the Item will be deleted from UserItems table.
In both scenarios above I will have to do 2 update operation for 2 tables which without transaction there is a chance one of them failed.
To solve that I consider using single table solution which is a single Users table with 4 columns UserId(key), Username, Gold, UserItems. However there are two things I'm worried about:
Data in UserItems might be come to big for a single cell because one user could have up to 500 items.
To add/delete item I have to pull the UserItems from dynamodb, add/delete item and then put it back into Users. So I have to do 1 read and 1 write operation for 1 action. And because of issue (1) the read/write data size could become really big.
FWIW, the AWS documentation on NoSQL Design for DynamoDB suggests to use a single table:
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.
NoSql database is best suited for non-trasactional data. If you bring normalization(splitting your data into multiple tables) into noSQL, then you are beating the whole purpose of it. If performance is what matters most, then you should consider only having a single table for your use case. DynamoDB supports Range Keys, and also supports Secondary Indices. For your usecase, it would be better to redesign your table to use Range Keys.
If you can share more details about your current table, maybe i can help you with more inputs.