Fetch desired ID contents from multiple servers - mapreduce

Imagine I have a distributed system with 500 servers. I have a main database server that stores some metadata and each entry’s primary key is the content ID. The actual content that’s related to the content ID spreads across 500 servers. But not all contentID’s content is in the 500 servers yet. Say only half of them are on the 500 servers.
How could I find out the contentIDs that are not deployed to any one of the 500 servers yet?
I’m thinking using map reduce style way to solve this but not sure how would the process be like.

Given the context in the question:
You can build a table in your database containing information about contentID to instance mapping.
Whenever an instance has a data for the given content ID, it needs to make a call and register the contentID.
If your instances can crash and you need to remove those content, you can implement health-check which will try to update your database every 30seconds~ 1 minute.
Now, whenever you need to access the instanceID for a given contentID and whether it has been loaded or not you can refer to the table above and check if the contentID has a instanceID with health-check time within 1 min.
Note: You can also consider using Zookeeper or In-Memory datastore like Redis for storing this data as well.

Related

How to get the AWS RDS maximum connections from the AWS API?

Is there a way to get the maximum number of connections for an RDS database from the AWS API?
I know that you can get the current number of connections from the DatabaseConnections Cloudwatch metric, but I'm looking to get the maximum/limit of connections possible for the database.
I also know that you can get it from within the database. For example, in Postgres, you can run:
postgres=> show max_connections;
However, I would like to get the value from outside the database.
I read in this this documentation about the max_connections DB instance parameter.
In most cases, the max_connections instance parameter is a value like this LEAST({DBInstanceClassMemory/9531392},5000) which depends on the DBInstanceClassMemory formula variable.
I read in this documentation that DBInstanceClassMemory can depend on several factors and is lower than the memory figures shown in the instance class tables.
Is there a way to get the DBInstanceClassMemory value from the API?
It looks like the AWS Console is able to get the value from outside of the database. See the red line in the graph below:
Edit: I found the JavaScript function that calculates the maximum number of connections in the AWS Console. It's called getEstimatedDefaultMaxConnections and it basically just divides the instance class' memory (from the instance class table) by the default memory-per-maximum-connections value (i.e. the default formula listed in the documentation). So, it ignores the fact that DBInstanceClassMemory will be less than the instance class' memory and it also ignores any changes you make to the max_connections DB instance parameter.
Is there a way for me to get that value using the API or to calculate it based on the DBInstanceClassMemory value (if it is available via the API)?
I ended up calculating an estimate of the number of maximum connections by fetching the max_connections DB parameter of the database, parsing it and evaluating it.
To get an estimate of the DBInstanceClassMemory value, I first fetched all of the available instance types using describe-instance-types and saved it to a file. I set DBInstanceClassMemory to 90% of this value to account for the memory lost to OS and RDS processes.
Then I:
Iterated through all of my RDS instances using DescribeDBInstances,
Fetched the DB parameters for each database using DescribeDBParameters and filtered for the max_connections parameter.
Parsed the max_connections parameter and evaluated the function using my estimated DBInstanceClassMemory for the database.

Elasticache / Redis Timestamp of new data

Does Elasticache store the time when a data is added to the cache? I want to filter data on my cache based on the time it was added but I can't find a clear answer if this information is stored in Elasticache automatically or if I have to add this information (timestamp) manually for each data inserted in the cache?
Thanks!
Neither Redis nor ElastiCache's Redis-compatible service store the timestamp automatically.
This would be inneficient as many use causes don't require it, so it's a client application implementation detail.
You may use a sorted set to store this information, so you can query for date ranges. And you can use Redis server time automatically if you use a Lua script. See How to store in Redis sorted set with server-side timestamp as score?.
This is particularly important if you have multiple nodes connecting, as they may have clock differences.

DynamoDB local db limits - use for initial beta-go-live

given Dynamo's pricing, the thought came to mind to use DynamoDB Local DB on an EC2 instance for the go-live of our startup SaaS solution. I've been trying to find like a data sheet for the local db, specifying limits as to # of tables, or records, or general size of the db file. Possibly, we could even run a few local db instances on dedicated EC2 servers as we know at login what user needs to be connected to what db.
Does anybody have any information on the local db limits or on this approach? Also, anybody knows of any legal/licensing issues with using dynamo-local in that way?
Every item in DynamoDB Local will end up as a row in the SQLite database file. So the limits are based on SQLite's limitations.
Maximum Number Of Rows In A Table = 2^64 but the database file limit will likely be reached first (140 terabytes).
Note: because of the above, the number of items you can store in DynamoDB Local will be smaller with the preview version of local with Streams support. This is because to support Streams the update records for items are also stored. E.g. if you are only doing inserts of these items then the item will effectively be stored twice: once in a table containing item data and once in a table containing the INSERT UpdateRecord data for that item (more records will also be generated if the item is being updated over time).
Be aware that DynamoDB Local was not designed for the same performance, availability, and durability as the production service.

API Gateway generating 11 sql queries per second on REG_LOG

We have sysdig running on our WSO2 API gateway machine and we notice that it fires a large number of SQL queries to the database for a minute, than waits a minute and repeats.
The query looks like this:
Every minute it goes wild, waits for a minute and goes wild again with a request of the following format:
SELECT REG_PATH, REG_USER_ID, REG_LOGGED_TIME, REG_ACTION, REG_ACTION_DATA
FROM REG_LOG
WHERE REG_LOGGED_TIME>'2016-02-29 09:57:54'
AND REG_LOGGED_TIME<'2016-03-02 11:43:59.959' AND REG_TENANT_ID=-1234
There is no load on the server. What is causing this? What can we do to avoid this?
screen shot sysdig api gateway process
This particular query is the result of the registry indexing task that runs in the background. The REG_LOG table is being queried periodically to retrieve the latest registry actions. The indexing task cannot be stopped. However, one can configure the frequency of the indexing task through the following parameter that is in the registry.xml. See [1] for more information.
indexingFrequencyInSeconds
If this table is filled up, one can clean the data using a simple SQL query. However, when deleting the records, one must be careful not to delete all the data. The latest records of each resource path should be left in the REG_LOG table since reindexing of data requires at least one reference of each resource path.
Also, if required, before clearing up the REG_LOG table, you can take a dump of the data in case you do not want to loose old records. Hope this answer provides information you require.
[1] - https://docs.wso2.com/display/Governance510/Configuration+for+Indexing

Django transaction management among different http requests

I have been working on a DJANGO back-end which main use case would be the capability to store a given set of pictures with its related tags.
The current design foresees dedicated REST-ful APIs for creating a new set, adding a picture to a given set and associating tags to a given set : this results into distinct client calls.
For instance :
BEGIN the "create new set" transaction
create a new set and receive the set ID
upload the first picture of the set
upload the second picture of the set (And so on depending on the total number of pictures...)
Add the tags related to this newly added set
END the transaction
How can I commit/rollback such a transaction knowing that it is split among different HTTP requests ?
Am I having a design issue here ? Shall I favor a single cumulative HTTP request approach ?
Please take into account that such a back-end is to be used with mobile devices which might suffer from temporary signal loss.
Any advice is welcome.
UPDATE:
Would it be convenient to use model versioning packages such as django-revisions to solve the issue ?