How can I ensure a read-only transaction in SqLite? - c++

I have public interface which allows people to interact with the database by typing in sql commands. However, I do not want them to change the database in any way (and if possible, not access certain tables). As I understand though, SQLite has no concept of users, so how do I accomplish this?

If within the query there are no application defined sql functions, which indirectly modifies the database(e.g: SELECT eval('DELETE FROM t1') FROM t2; ), then use sqlite3_stmt_readonly to determine whether the prepared sql statement writes the database, otherwise you can try to open an other, read_only, database connection handler(SQLITE_OPEN_READONLY) which will be used for read_only access.

Copy the "master" database file first and open that :-) No, really, this is a serious suggestion.
Otherwise, depending on how SQLite is accessed, the SQLITE_OPEN_READONLY flag that can be passed to sqlite3_open_v2. This applies to the entire connection -- and all transactions on that connection.
Another option is to limit the SQL entry, but this is very very hard to do correctly and thus I don't recommend this route.
Happy coding.

Related

Automate sequential integer IDs without using Identity Specification?

Are there any tried/true methods of managing your own sequential integer field w/o using SQL Server's built in Identity Specification? I'm thinking this has to have been done many times over and my google skills are just failing me tonight.
My first thought is to use a separate table to manage the IDs and use a trigger on the target table to manage setting the ID. Concurrency issues are obviously important, but insert performance is not critical in this case.
And here are some gotchas I know I need to look out for:
Need to make sure the same ID isn't doled out more than once when
multiple processes run simultaneously.
Need to make sure any solution to 1) doesn't cause deadlocks
Need to make sure the trigger works properly when multiple records are
inserted in a single statement; not only for one record at a time.
Need to make sure the trigger only sets the ID when it is not already
specified.
The reason for the last bullet point (and the whole reason I want to do this without an Identity Specification field in the first place) is because I want to seed multiple environments at different starting points and I want to be able to copy data between each of them so that the ID for a given record remains the same between environments (and I have to use integers; I cannot use GUIDs).
(Also yes, I could set identity insert on/off to copy data and still use a regular Identity Specification field but then it reseeds it after every insert. I could then use DBCC CHECKIDENT to reseed it back to where it was, but I feel the risk with this solution is too great. It only takes one time for someone to make a mistake and then when we realize it, it would be a real pain to repair the data... probably enough pain that it would have made more sense just to do what I'm doing now in the first place).
SQL Server 2012 introduced the concept of a SEQUENCE database object - something like an "identity" column, but separate from a table.
You can create and use sequence from your code, you can use the values in various place, and more.
See these links for more information:
Sequence numbers (MS Docs)
CREATE SEQUENCE statement (MS Docs)
SQL Server SEQUENCE basics (Red Gate - Joe Celko)

Why are direct writes to Amazon S3 eliminated in EMR 5.x versions?

After reading this page:
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
"Operational Differences and Considerations" -> "Direct writes to Amazon S3 eliminated" section.
I wonder - does this mean that writing to S3 from Hive in EMR 4.x versions will be faster than 5.x versions?
If so, isn't it kind of regression? why would AWS want to eliminate this optimization?
Writing to a Hive table which is located in S3 is a very common scenario.
Can someone clear up that issue?
This optimization originally was developed by Qubole and pushed to Apache Hive.
See here.
This feature is rather dangerous because it is a bypass of Hive fault-tolerance mechanism and also forces developers to use normally unnecessary intermediate tables which, in its turn leads to performance degradation and increases cost.
Very common use-case is when we need to merge increment data into partitioned target table, described here The query is an insert overwrite table from itself, without intermediate table (in a single query) it is rather efficient. The query can be much more complex, with many tables joined. This is what happens with Direct Writes enabled in this use-case:
Partition folder is being deleted before query finished, this causing FileNotFound exception in Mapper reading the same table which is being written fails because partition folder deleted before mapper executed.
If the target table is initially empty, first run succeeds because Hive knows there is no any partition and does not read folder. Second run fails because see (1) folder deleted before mapper finishes.
Known workaround has performance impact. Loading data incrementally is quite often use-case. Direct writing to S3 feature forces developers to use temporary table in this case to eliminate FileNotFoundException and table corruption. As a result we are doing this task much slower and much more costly than if this feature is disabled and we are writing target table from itself.
After the first failure, successful restart is impossible, table is not selectable and not writable because Hive partition exists in metadata but folder does not exists and this causing FileNotFoundException in other queries from this table, which are not overwriting it.
The same described with less details on Amazon page which you are refering: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
Another possible issue is described on Qubole page, some existing fix with using preffixes is mentioned, though this does not work with use-case above because writing new files into folder which is being read will anyway create a problem.
Also mappers, reducers may fail and restart, whole session may fail and restart, writing files directly even with postponed deleting the old ones seems not so good idea because it increases the chance of unrecoverable failure or data corruption.
To disable direct writes, set this configuration property:
set hive.allow.move.on.s3=true; --this disables direct write
You can use this feature for small tasks and when not reading the same table which is being written, though for small tasks it will not give you much. This optimizetion is most efficient when you are rewriting many partitions in a very big table and move task at the end is extremely slow, then you may want to enable it at risk of data corruption.

Should I store a list in memory or in a database and should I build a class to connect to DB?

I am writing a C++ program, I have a class that provides services for the rest of the clases in the program.
I am writing now the clases and the UML.
1) the class that I refer to has a task list that is changing over time and conditions are being checked on this list, I am thinking to keep it in a table in a databasse that every line in the table would represent a task, this way in case that the program crashes or stops working I can restore the last situation, the other option is to keep the task list in memory and keep a copy in the database.
the task list should be searched every second
Which approach is more recommended?
2) In order to write and to read to the database I can call the database directly from the class or build a database communication class, if I write a data communication class I need to give specific options and to build a mini server for this,
e.g. write a line to the database, read a line to the database, update only the first column etc..
what is the recommended approach for this?
Thanks.
First, if the database is obvious and easy, and there are no performance problems, just do that. You're talking about running a query once/second, and maybe marking a task done or adding a new one every so often; even sqlite on a slow SMB share should be able to handle that just fine.
If you do need to optimize it, then there are two approaches: Either still with the database and cache it in-memory, or use memory as your primary storage and come up with a persistence mechanism that uses the database. But until you need to optimize it, don't.
Next, how should you do it? Your question makes it sound like you're thinking in terms of a whole three-tier system, with a "mini-server" sitting between the database server and your task list. There's really no need for that. What you want is a bespoke ORM, but that makes it sound more complicated than it is. All you're doing is writing a class that wraps a database connection and provides a handful of methods—get_due, mark_done, add, get_next_id—each of which maps SQL parameters to Task members. For example (with no error handling):
void mark_done(Task task) {
db.execute("UPDATE Task SET done=true WHERE id=%s", task.id);
}
Three more methods like that, plus a constructor to connect to the database (including creating the Task table if it didn't already exist), and your class is done.
The reason you don't want to write the database stuff directly into Task is that you don't really have anywhere to store shared information like the database connection object; either you need globals (or class attributes, which are effectively globals), or you need copies in every single Task instance (or, really, weak references—which you're going to fake with either a reference or a raw pointer, either way leading to shutdown problems somewhere down the line).
Finally, your whole reason for doing this is error recovery, and databases do a great job of journaling so nothing ever gets inconsistent, but you do have to make sure to structure your app to take advantage of that. For example, you may want to mark all the now-due tasks "in process", then process them, then mark them all "done"; that way, at recovery time, you know exactly which tasks may or may not have been done, and can act appropriately. The more steps you can commit to the database, the less data loss you have to deal with—but of course the more code you have to write, and the slower it gets. So, do as much as necessary, but no more.
Saving information in Database just to recover crashed information may be bit of an overkill.
You ideally want to serialize the list and save it - as binary, xml or csv based values. This can be done based on a timer or certain events in your applications.
Databases may also be used if you can come up with a structure that looks exactly similar to tables - so that you can do one-to-one mapping between the objects and probably write SQL queries easily. But keep that on a separate layer for abstraction.

Updating a field in all records in elasticsearch

I'm new to ElasticSearch, so this is probably something quite trivial, but I haven't figured out anything better that fetching everything, processing with a script and updating the registers one by one.
I want to make something like a simple SQL update:
UPDATE RECORD SET SOMEFIELD = SOMEXPRESSION
My intent is to replace the actual bogus data with some data that makes more sense (so the expression is basically randomly choosing from a pool of valid values).
There are a couple of open issues about making possible to update documents by query.
The technical challenge is that lucene (the text search engine library that elasticsearch uses under the hood) segments are read only. You can never modify an existing document. What you need to do is delete the old version of the document (which by the way will only be marked as deleted till a segment merge happens) and index the new one. That's what the existing update api does. Therefore, an update by query might take a long time and lead to issues, that's why it's not released yet. A mechanism that allows to interrupt running queries would be a nice to have too for this case.
But there's the update by query plugin that exposes exactly that feature. Just beware of the potential risks before using it.

RDbms name and version

I m doing a software in c++ that permit to access to some rdbms with qt library.
The software is only for pc.
The software need to know the name of rdbms and version beacuse the program needs to choose some sql query to execute. Is there any way to retrieve thisdata in any rdbms?
Is there any way to retrieve thisdata in any rdbms?
This is specific to each RDBMS - for example, in Oracle, you can do
select *
from v$version;
but in other RDBMS this view does not exist.
Some possible approaches:
You need to define connect parameters somewhere anyways. Even they are often RDBMS-specific, so you could simply add another parameter like SQLFlavor=Oracle or SQLFlavor=MySQL and then use the value of SQLFlavor in your code to determine which SQL statement to use
You can use some heuristics to find out the RDBMS. For example, query the v$version view - if it does not exist, you will get an error and you know that it is not Oracle and you can continue with the next try (like SELECT VERSION() to see if it is MySQL). Otherwise, you can use the result to find out the concrete Oracle version.