Sqlite DB File is too huge....?

Sqlite DB File is too huge....? - c++

I've been creating a dictionary file that contains 85000 records with Sqlite and QT. the file of sqlite is too huge, it's 134MB, and another dictionary like MDic DB have same data that created with sqlite and same records is 10 MB!
query.exec("PRAGMA journal_mode = MEMORY");
query.exec("PRAGMA synchronous = OFF");
Dic_entry DictionaryEntry = DictionaryInstance.readEntry();
QString definition, headword, displayedHeadWord;
query.exec("BEGIN Transaction");
int Count = 0;
while(!DictionaryEntry.headword.empty())
{
definition = QString::fromStdString(DictionaryEntry.definition);
definition.replace("'", "''");
headword = QString::fromStdString(DictionaryEntry.headword);
headword.replace("'", "''");
displayedHeadWord = QString::fromStdString(DictionaryEntry.displayedHeadword);
displayedHeadWord.replace("'", "''");
string strQuery = "INSERT INTO Dictionary_Words([Definition], [HeadWord], [DisplayedHeadWord]) "
"values('"
+ definition.toStdString() + "', '"
+ headword.toStdString() + "', '"
+ displayedHeadWord.toStdString()
+ "')";
query.exec(QString::fromStdString(strQuery));
if(Count == 200)
{
query.exec("COMMIT TRANSACTION");
Count = 0;
}
Count++;
DictionaryEntry = DictionaryInstance.readEntry();
}
query.exec("End Transaction");
query.exec("CREATE INDEX HW_idx ON [Dictionary_Words](HeadWord)");
query.exec("CREATE INDEX Def_idx ON [Dictionary_Words](Definition)");
query.exec("CREATE INDEX DHW_idx ON [Dictionary_Words](DisplayedHeadword)");
query.exec("PRAGMA auto_vacuum=FULL");
db.close();
Please help me that how can i reduce my SQlite DB file

I have no way of proving it, but I suspect the indexes are the cause of the trouble. Indexes can take up a huge amount of space, and you've got three of them. Try it without them and see if the access is still acceptably fast; the database size should be much smaller that way.

Related

Does QT have reflections for C++?

I want to create a SQL table in QT C++. So I have made this code.
And it is going to create a database for me, where the first argument tableName is the name of the table I want to create. Then the next argument is quite tricky.
Here, columns, specify the column name and it's data type. I think this is a bad way to do. Example
QVector<QPair<QString, QString>> myColumns = new QVector<QPair<QString, QString>>{{"ID", "BIGINT"}, {"logging_id", "INT"}};
Because If i have for example like 50 columns. The myColumns is going to be very long.
My question if QT C++ have some kind of reflections, so I can:
Get the name if every field
Get the data type of every field
If the field is an array, then I'm going to know how many elements there are inside that array
I was planning to have an entity class where I create a class, and use that class to get the information to create each columns in the database.
void Database::createTable(QString tableName, const QVector<QPair<QString, QString>> columns){
QSqlQuery query;
for (int i = 0; i < columns.length(); i++){
/* Get the Qpair values */
QString columnName = columns.at(i).first;
QString dataType = columns.at(i).second;
/* If key is ID, then try to create a new table */
if(columnName == "ID"){
query.exec("CREATE TABLE " + tableName + "(" + columnName + " " + dataType + " NOT NULL AUTO_INCREMENT PRIMARY KEY)");
continue;
}
/* If not, then try append new columns */
query.exec("ALTER TABLE " + tableName + " ADD " + columnName + " " + dataType);
}
}

Can't check if a table exist in QT MYSQL

I'm trying to check if a table exist in a schema for QMYSQL inside QT framework.
I have connected the MySQL server and it can create a table, but NOT check if a table exist.
This is the code for checking if a table exist
query.exec("CREATE TABLE " + table_name + "(ID BIGINT PRIMARY KEY)");
QStringList tables = this->qSqlDatabase.tables();
qDebug() << "Table name: " + table_name;
for(int i = 0; i < tables.length(); i++)
qDebug() << tables[i];
qDebug() << tables.length();
if(tables.contains(table_name))
The if-statement does not run and the output is:
"Table name: table0"
0
In this case table_name = "table0". But why does this happening?

try this line:
query.exec("CREATE TABLE " + table_name + " (ID BIGINT, PRIMARY KEY (ID));");

How to do multiple parallel readers for data export using Google Spanner?

External Backups/Snapshots for Google Cloud Spanner recommends to use queries with timestamp bounds to create snapshots for export. On the bottom of the Timestamp Bounds documentation it states:
Cloud Spanner continuously garbage collects deleted and overwritten data in the background to reclaim storage space. This process is known as version GC. By default, version GC reclaims versions after they are one hour old. Because of this, Cloud Spanner cannot perform reads at a read timestamp more than one hour in the past.
So any export would need to complete within an hour. A single reader (i.e. select * from table; using timestamp X) would not be able to export the entire table within an hour.
How can multiple parallel readers be implemented in spanner?
Note: It is mentioned in one of the comments that support for Apache Beam is coming, but it looks like that uses a single reader:
/** A simplest read function implementation. Parallelism support is coming. */
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/NaiveSpannerReadFn.java#L26
Is there a way to do the parallel reader that beam requires today using exising APIs? Or will Beam need to use something that isn't released yet on google spanner?

It is possible to read data in parallel from Cloud Spanner with the BatchClient class. Follow read_data_in_parallel for more information.
If you are looking to export data from Cloud Spanner, I'd recommend you to use Cloud Dataflow (see the integration details here) as it provides higher level abstractions and takes care data processing details, like scaling and failure handling.

Edit 2018-03-30 - The example project has been updated to use the BatchClient offered by Google Cloud Spanner
After the release of the BatchClient for reading/downloading large amounts of data, the example project below has been updated to use the new batch client instead of the standard database client. The basic idea behind the project is still the same: Copy data to/from Cloud Spanner and any other database using standard jdbc functionality. The following code snippet sets the jdbc connection in batch read mode:
if (source.isWrapperFor(ICloudSpannerConnection.class))
{
ICloudSpannerConnection con = source.unwrap(ICloudSpannerConnection.class);
// Make sure no transaction is running
if (!con.isBatchReadOnly())
{
if (con.getAutoCommit())
{
con.setAutoCommit(false);
}
else
{
con.commit();
}
con.setBatchReadOnly(true);
}
}
When the connection is in 'batch read only mode', the connection will use the BatchClient of Google Cloud Spanner instead of the standard database client. When one of the Statement#execute(String) or PreparedStatement#execute() methods are called (as these allow multiple result sets to be returned) the jdbc driver will create a partitioned query instead of a normal query. The results of this partitioned query will be a number of result sets (one per partition) that can be fetched by the Statement#getResultSet() and Statement#getMoreResults(int) methods.
Statement statement = source.createStatement();
boolean hasResults = statement.execute(select);
int workerNumber = 0;
while (hasResults)
{
ResultSet rs = statement.getResultSet();
PartitionWorker worker = new PartitionWorker("PartionWorker-" + workerNumber, config, rs, tableSpec, table, insertCols);
workers.add(worker);
hasResults = statement.getMoreResults(Statement.KEEP_CURRENT_RESULT);
workerNumber++;
}
The result sets that are returned by the Statement#execute(String) are not executed directly, but only after the first call to ResultSet#next(). Passing these result sets to separate worker threads ensures parallel download and copying of the data.
Original answer:
This project was initially created for conversion in the other direction (from a local database to Cloud Spanner), but as it uses JDBC for both source and destination it can also be used the other way around: Converting a Cloud Spanner database to a local PostgreSQL database. Large tables are converted in parallel using a thread pool.
The project uses this open source JDBC driver instead of the JDBC driver supplied by Google. The source Cloud Spanner JDBC connection is set to read-only mode and autocommit=false. This ensures that the connection automatically creates a read-only transaction using the current time as timestamp the first time you execute a query. All subsequent queries within the same (read-only) transaction will use the same timestamp giving you a consistent snapshot of your Google Cloud Spanner database.
It works as follows:
Set the source database to read-only transactional mode.
The convert(String catalog, String schema) method iterates over all
tables in the source database (Cloud Spanner)
For each table the number of records is determined, and depending on the size of the table, the table is copied using either the main thread of the application or by a worker pool.
The class UploadWorker is responsible for the parallel copying. Each worker is assigned a range of records from the table (for example rows 1 to 2,400). The range is selected by a select statement in this format: 'SELECT * FROM $TABLE ORDER BY $PK_COLUMNS LIMIT $BATCH_SIZE OFFSET $CURRENT_OFFSET'
Commit the read-only transaction on the source database after ALL tables have been converted.
Below is a code snippet of the most important parts.
public void convert(String catalog, String schema) throws SQLException
{
int batchSize = config.getBatchSize();
destination.setAutoCommit(false);
// Set the source connection to transaction mode (no autocommit) and read-only
source.setAutoCommit(false);
source.setReadOnly(true);
try (ResultSet tables = destination.getMetaData().getTables(catalog, schema, null, new String[] { "TABLE" }))
{
while (tables.next())
{
String tableSchema = tables.getString("TABLE_SCHEM");
if (!config.getDestinationDatabaseType().isSystemSchema(tableSchema))
{
String table = tables.getString("TABLE_NAME");
// Check whether the destination table is empty.
int destinationRecordCount = getDestinationRecordCount(table);
if (destinationRecordCount == 0 || config.getDataConvertMode() == ConvertMode.DropAndRecreate)
{
if (destinationRecordCount > 0)
{
deleteAll(table);
}
int sourceRecordCount = getSourceRecordCount(getTableSpec(catalog, tableSchema, table));
if (sourceRecordCount > batchSize)
{
convertTableWithWorkers(catalog, tableSchema, table);
}
else
{
convertTable(catalog, tableSchema, table);
}
}
else
{
if (config.getDataConvertMode() == ConvertMode.ThrowExceptionIfExists)
throw new IllegalStateException("Table " + table + " is not empty");
else if (config.getDataConvertMode() == ConvertMode.SkipExisting)
log.info("Skipping data copy for table " + table);
}
}
}
}
source.commit();
}
private void convertTableWithWorkers(String catalog, String schema, String table) throws SQLException
{
String tableSpec = getTableSpec(catalog, schema, table);
Columns insertCols = getColumns(catalog, schema, table, false);
Columns selectCols = getColumns(catalog, schema, table, true);
if (insertCols.primaryKeyCols.isEmpty())
{
log.warning("Table " + tableSpec + " does not have a primary key. No data will be copied.");
return;
}
log.info("About to copy data from table " + tableSpec);
int batchSize = config.getBatchSize();
int totalRecordCount = getSourceRecordCount(tableSpec);
int numberOfWorkers = calculateNumberOfWorkers(totalRecordCount);
int numberOfRecordsPerWorker = totalRecordCount / numberOfWorkers;
if (totalRecordCount % numberOfWorkers > 0)
numberOfRecordsPerWorker++;
int currentOffset = 0;
ExecutorService service = Executors.newFixedThreadPool(numberOfWorkers);
for (int workerNumber = 0; workerNumber < numberOfWorkers; workerNumber++)
{
int workerRecordCount = Math.min(numberOfRecordsPerWorker, totalRecordCount - currentOffset);
UploadWorker worker = new UploadWorker("UploadWorker-" + workerNumber, selectFormat, tableSpec, table,
insertCols, selectCols, currentOffset, workerRecordCount, batchSize, source,
config.getUrlDestination(), config.isUseJdbcBatching());
service.submit(worker);
currentOffset = currentOffset + numberOfRecordsPerWorker;
}
service.shutdown();
try
{
service.awaitTermination(config.getUploadWorkerMaxWaitInMinutes(), TimeUnit.MINUTES);
}
catch (InterruptedException e)
{
log.severe("Error while waiting for workers to finish: " + e.getMessage());
throw new RuntimeException(e);
}
}
public class UploadWorker implements Runnable
{
private static final Logger log = Logger.getLogger(UploadWorker.class.getName());
private final String name;
private String selectFormat;
private String sourceTable;
private String destinationTable;
private Columns insertCols;
private Columns selectCols;
private int beginOffset;
private int numberOfRecordsToCopy;
private int batchSize;
private Connection source;
private String urlDestination;
private boolean useJdbcBatching;
UploadWorker(String name, String selectFormat, String sourceTable, String destinationTable, Columns insertCols,
Columns selectCols, int beginOffset, int numberOfRecordsToCopy, int batchSize, Connection source,
String urlDestination, boolean useJdbcBatching)
{
this.name = name;
this.selectFormat = selectFormat;
this.sourceTable = sourceTable;
this.destinationTable = destinationTable;
this.insertCols = insertCols;
this.selectCols = selectCols;
this.beginOffset = beginOffset;
this.numberOfRecordsToCopy = numberOfRecordsToCopy;
this.batchSize = batchSize;
this.source = source;
this.urlDestination = urlDestination;
this.useJdbcBatching = useJdbcBatching;
}
#Override
public void run()
{
// Connection source = DriverManager.getConnection(urlSource);
try (Connection destination = DriverManager.getConnection(urlDestination))
{
log.info(name + ": " + sourceTable + ": Starting copying " + numberOfRecordsToCopy + " records");
destination.setAutoCommit(false);
String sql = "INSERT INTO " + destinationTable + " (" + insertCols.getColumnNames() + ") VALUES \n";
sql = sql + "(" + insertCols.getColumnParameters() + ")";
PreparedStatement statement = destination.prepareStatement(sql);
int lastRecord = beginOffset + numberOfRecordsToCopy;
int recordCount = 0;
int currentOffset = beginOffset;
while (true)
{
int limit = Math.min(batchSize, lastRecord - currentOffset);
String select = selectFormat.replace("$COLUMNS", selectCols.getColumnNames());
select = select.replace("$TABLE", sourceTable);
select = select.replace("$PRIMARY_KEY", selectCols.getPrimaryKeyColumns());
select = select.replace("$BATCH_SIZE", String.valueOf(limit));
select = select.replace("$OFFSET", String.valueOf(currentOffset));
try (ResultSet rs = source.createStatement().executeQuery(select))
{
while (rs.next())
{
int index = 1;
for (Integer type : insertCols.columnTypes)
{
Object object = rs.getObject(index);
statement.setObject(index, object, type);
index++;
}
if (useJdbcBatching)
statement.addBatch();
else
statement.executeUpdate();
recordCount++;
}
if (useJdbcBatching)
statement.executeBatch();
}
destination.commit();
log.info(name + ": " + sourceTable + ": Records copied so far: " + recordCount + " of "
+ numberOfRecordsToCopy);
currentOffset = currentOffset + batchSize;
if (recordCount >= numberOfRecordsToCopy)
break;
}
}
catch (SQLException e)
{
log.severe("Error during data copy: " + e.getMessage());
throw new RuntimeException(e);
}
log.info(name + ": Finished copying");
}
}

How to delete all the items from all tables of Amazon's dynamo db?

Just like backing up all the tables of dynamo db, i also want clear all the tables of my test environment after testing without deleting tables.
We used backup service such a way that we don't want schema structure or java object of table schema as below:
Map<String, AttributeValue> exclusiveStartKey = null;
do {
// Let the rate limiter wait until our desired throughput "recharges"
rateLimiter.acquire(permitsToConsume);
ScanSpec scanSpec = new ScanSpec().withReturnConsumedCapacity(ReturnConsumedCapacity.TOTAL)
.withMaxResultSize(25);
if(exclusiveStartKey!=null){
KeyAttribute haskKey = getExclusiveStartHashKey(exclusiveStartKey, keySchema);
KeyAttribute rangeKey = getExclusiveStartRangeKey(exclusiveStartKey, keySchema);
if(rangeKey!=null){
scanSpec.withExclusiveStartKey(haskKey, rangeKey);
}else{
scanSpec.withExclusiveStartKey(haskKey);
}
}
Table table = dynamoDBInstance.getTable(tableName);
ItemCollection<ScanOutcome> response = table.scan(scanSpec);
StringBuffer data = new StringBuffer();
Iterator<Item> iterator = response.iterator();
while (iterator.hasNext()) {
Item item = iterator.next();
data.append(item.toJSON());
data.append("\n");
}
logger.debug("Data read from table: {} ", data.toString());
if(response.getLastLowLevelResult()!=null){
exclusiveStartKey = response.getLastLowLevelResult().getScanResult().getLastEvaluatedKey();
}else{
exclusiveStartKey = null;
}
// Account for the rest of the throughput we consumed,
// now that we know how much that scan request cost
if(response.getTotalConsumedCapacity()!=null){
double consumedCapacity = response.getTotalConsumedCapacity().getCapacityUnits();
if(logger.isDebugEnabled()){
logger.debug("Consumed capacity : " + consumedCapacity);
}
permitsToConsume = (int)(consumedCapacity - 1.0);
if(permitsToConsume <= 0) {
permitsToConsume = 1;
}
}
} while (exclusiveStartKey != null);
is it possible to delete all items without knowing table schema? can we do it using DeleteItemSpec

PQexecParams in C++, query error

I'm using pqlib with postgresql version 9.1.11
I have the following code
const char *spid = std::to_string(pid).c_str();
PGresult *res;
const char *paramValues[2] = {u->getID().c_str(), spid};
std::string table;
table = table.append("public.\"").append(Constants::USER_PATTERNS_TABLE).append("\"");
std::string param_name_pid = Constants::RELATION_TABLE_PATTERN_ID;
std::string param_name_uid = Constants::RELATION_TABLE_USER_ID;
std::string command = Constants::INSERT_COMMAND + table + " (" + param_name_uid + ", " + param_name_pid + ") VALUES ($1, $2::int)";
std::cout << "command: " << command << std::endl;
res = PQexecParams(conn, command.c_str(), 2, NULL, paramValues, NULL, NULL,0);
Where
INSERT_COMMAND = "INSERT INTO " (string)
USER_PATTERN_TABLE = "User_Patterns" (string)
RELATION_TABLE_PATTERN_ID = "pattern_id" (string)
RELATION_TABLE_USER_ID = "user_id" (string)
pid = an int
u->getID() = a string
conn = the connection to the db
The table "User_Patterns" is defined as
CREATE TABLE "User_Patterns"(
user_id TEXT references public."User" (id) ON UPDATE CASCADE ON DELETE CASCADE
,pattern_id BIGSERIAL references public."Pattern" (id) ON UPDATE CASCADE
,CONSTRAINT user_patterns_pkey PRIMARY KEY (user_id,pattern_id) -- explicit pk
)WITH (
OIDS=FALSE
);
I already have a user and a pattern loaded into their respective tables.
The command generated is :
INSERT INTO public."User_Patterns" (user_id, pattern_id) VALUES ($1, $2::int)
I also tried with $2, $2::bigint, $2::int4
The problem is:
I receive the error :
ERROR: invalid input syntax for integer: "public.""
I already use PQexecParams to store users and patterns, the only difference is that they all have text/xml fields (the only int field on patterns is a serial one and I don't store that value myself) but because the user_patterns is a relation table I need to store and int for the pattern_id.
I already read the docs for pqlib and saw the examples, both are useless.

The problem is in the lines:
const char *spid = std::to_string(pid).c_str();
const char *paramValues[2] = {u->getID().c_str(), spid};
std::to_string(pid) creates temporary string and .c_str() returns a pointer to an internal representation of this string, which is destroyed at the end of the line, resulting in a dead pointer. You may also see answer to the question
stringstream::str copy lifetime

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Sqlite DB File is too huge....? - c++

I have no way of proving it, but I suspect the indexes are the cause of the trouble. Indexes can take up a huge amount of space, and you've got three of them. Try it without them and see if the access is still acceptably fast; the database size should be much smaller that way.

Related

Does QT have reflections for C++?

Can't check if a table exist in QT MYSQL

How to do multiple parallel readers for data export using Google Spanner?

How to delete all the items from all tables of Amazon's dynamo db?

PQexecParams in C++, query error

Categories

Resources