I wanted to back-up my datomic DB. I am familiar with the steps defined at https://docs.datomic.com/on-prem/backup.html but since the data size is huge (in TBs), I wanted to only backup a small preview of the entire DB, like one or two attribute values from every entity.
Also, if I had that backup present at some location, like S3 or something, is it possible to just copy only a certain part of the entire backup so that it becomes a small preview of the entire DB as described above?
Related
we are considering using AWS Neptune as graphdb solution.
I am coming from Django world so I used to use db migrations a lot.
I could not find any info about how AWS Neptune does change management on DB?
ie. what happens if I want to reload a backup from a month ago and there has been schema changes since then? How do we track these changes?
Should we write custom scripts?
Unlike something like an RDBMS and some other data stores, Amazon Neptune, and many other graph dbs for that matter, are called "schemaless" meaning there is no need to explicitly define or maintain a schema. The schema is implicitly defined by the data stored in the database. In the case you mentioned, restoring a backup, there is no need for a migration/change script to be run. When you restore the backup the schema will be defined by the restored data.
This "schemaless" nature of the database allows applications to begin adding new entity types and data properties without any sort of ETL process. However, this also means that the application does need to manage some sort of schema internally to maintain sanity over the data being stored (e.g. first_name and firstName could be used and would be separate properties.).
I have looked into this post on s3 vs database. But I have a different use case and want to know whether s3 is enough. The primary reason for using s3 instead of other databases on cloud is because of cost.
I have multiple __scraper__s that download data from websites and apis everyday. Most of them return data as Json format. Currently, I will insert them into mongodb. I will then run analysis by querying data out on a specific date or some specific fields or records that match a certain criteria. After querying the data, usually I will load them into a dataframe and do what is needed.
The data will not be updated. They need to be stored and ready for retrieval according to some criteria. I am aware of S3 Select which may be able to do the retrieval task.
Any recommendations?
The use cases you have mentioned above, it seems that you are not using the MongoDB capabilities(any database capability for say) to a greater degree.
I think S3 suites well for your use cases, in fact, you should go for S3-Infrequent access with life cycle policy to archive and then finally purge to be cost efficient.
I hope it will helps!
I think your code will be more efficient if you use dynamodb with all its feature. using s3 for database or data storage will make you code more complex. since you need to retrieve file from s3 every time and have to iterate thorough the file every time. And in case of dynamodb you can easily query and filter the data which is required. At the end s3 is a file storage and dynmodb is a database.
I use Ionic to create a mobile app which can take photo and can upload image from mobile to s3. I wonder how to make a prefix or tag beside the upload image which help me query to this fast and unique. I think about make a prefix and create folder:
year/month/day/filename ( e.g: 2018/11/27/image.png )
If there are a lot of image in 2018/11/27/ folder, I think it will query slow and sometime the image filename not unique. Any suggest for that ?? Tks a lot.
Amazon S3 is an excellent storage service, but it is not a database.
You can store objects in Amazon S3 with whatever name you wish, but if you wish to list/sort/find objects quickly you should store the name of the object, together with its metadata, in a database. Then you can query the database to find the object of interest.
DynamoDB would be a good choice because it can be configured for guaranteed speed. You could also put DAX in front of DynamoDB for even greater performance.
With information about the objects stored in a database, you can quite frankly name each individual object anything you wish. Many people just use a UUID since it just needs to be a unique identifier. The object name itself does not need to convey any meaning - it is simply a Key to identify the object when it needs to be accessed later.
If, however, objects are typically processed in groups (such as having daily files grouped together into months for processing with Hadoop clusters), then locating objects in a particular path is useful. It allows the objects to be processed together without having to consult the database.
I have an order table in the OLTP system.
Each order record has a OrderStatus field.
When end users created an order, OrderStatus field set as "Open".
When somebody cancels the order, OrderStatus field set as "Canceled".
When an order process finished(transformed into invoice), OrderStatus field set to "Close".
There are more than one hundred million record in the table in the Oltp system.
I want to design and populate data warehouse and data marts on hdfs layer.
In order to design data marts, I need to import whole order table to hdfs and then I need to reflect changes on the table continuously.
First, I can import whole table into hdfs in the initial load process by using sqoop. I may take long time but I will do this once.
When an order record is updated or a new order record entered, I need to reflect changes in hdfs. How can I achieve this in hdfs for such a big transaction table?
Thanks
One of the easier ways is to work with database triggers in your OLTP source db and every change an update happens use that trigger to push an update event to your Hadoop environment.
On the other hand (this depends on the requirements for your data users) it might be enough to reload the whole data dump every night.
Also, if there is some kind of last changed timestamp, it might be a possible way to load only the newest data and do some kind of delta check.
This all depends on your data structure, your requirements and your ressources at hand.
There are several other ways to do this but usually those involve messaging, development and new servers and I suppose in your case this infrastructure or those ressources are not available.
EDIT
Since you have a last changed date, you might be able to pull the data with a statement like
SELECT columns FROM table WHERE lastchangedate < (now - 24 hours)
or whatever your interval for loading might be.
Then process the data with sqoop or ETL tools or the like. If the records are already available in your Hadoop environment, you want to UPDATE it. If the records are not available, INSERT them with your appropriate mechanism. This is also called UPSERTING sometimes.
I am using Microsoft Synch Service Framework 4.0 for synching Sql server Database tables with SqlLite Database on the Ipad side.
Before making any Database schema changes in the Sql Server Database, We have to Deprovision the database tables. ALso after making the schema changes, we ReProvision the tables.
Now in this process, the tracking tables( i.e. the Synching information) gets deleted.
I want the tracking table information to be restored after Reprovisioning.
How can this be done? Is it possible to make DB changes without Deprovisioning.
e.g, the application is in Version 2.0, The synching is working fine. Now in the next version 3.0, i want to make some DB changes. SO, in the process of Deprovisioning-Provisioning, the tracking info. gets deleted. So all the tracking information from the previous version is lost. I do not want to loose the tracking info. How can i restore this tracking information from the previous version.
I believe we will have to write a custom code or trigger to store the tracking information before Deprovisioning. Could anyone suggest a suitable method OR provide some useful links regarding this issue.
the provisioning process should automatically populate the tracking table for you. you don't have to copy and reload them yourself.
now if you think the tracking table is where the framework stores what was previously synched, the answer is no.
the tracking table simply stores what was inserted/updated/deleted. it's used for change enumeration. the information on what was previously synched is stored in the scope_info table.
when you deprovision, you wipe out this sync metadata. when you synch, its like the two replicas has never synched before. thus you will encounter conflicts as the framework tries to apply rows that already exists on the destination.
you can find information here on how to "hack" the sync fx created objects to effect some types of schema changes.
Modifying Sync Framework Scope Definition – Part 1 – Introduction
Modifying Sync Framework Scope Definition – Part 2 – Workarounds
Modifying Sync Framework Scope Definition – Part 3 – Workarounds – Adding/Removing Columns
Modifying Sync Framework Scope Definition – Part 4 – Workarounds – Adding a Table to an existing scope
Lets say I have one table "User" that I want to synch.
A tracking table will be created "User_tracking" and some synch information will be present in it after synching.
WHen I make any DB changes, this Tracking table "User_tracking" will be deleted AND the tracking info. will be lost during the Deprovisioning- Provisioning process.
My workaround:
Before Deprovisioning, I will write a script to copy all the "User_tracking" data into another temporary table "User_tracking_1". so all the existing tracking info will be stored in "User_tracking_1". WHen I reprovision the table, a new trackin table "User_Tracking" will be created.
After Reprovisioning, I will copy the data from table "User_tracking_1" to "User_Tracking" and then delete the contents from table "User_Tracking_1".
UserTracking info will be restored.
Is this the right approach...