Testing neo4j with indexes - unit-testing

I want to test my neo4j project with nosql unit. This works fine as long as I don't need a lucene index. Is there a way to create a test database with an index?
I think graphml offers no possibility for indexes, so I try to use the auto-index like this:
#Before
public void startAutoIndex(){
AutoIndexer<Node> nodeAutoIndexer = graphDb.index().getNodeAutoIndexer();
nodeAutoIndexer.startAutoIndexingProperty( "id" );
nodeAutoIndexer.startAutoIndexingProperty( "refname" );
nodeAutoIndexer.setEnabled(true);
}
this doesn't work for me.
Is there another way to implement the auto-index?
Best regards
Jan

generally , two ways.
either you use the geoff xml export format
or use your graphml, but set up autoindexing on the server side using the conf/server.properties file. there, set up these rows:
node_auto_indexing=true
node_keys_indexable=id,refname
restart the db and do the graphml import (assuming the imported nodes have id and refname as their properties - in case you need a general id of the neo4j db and not your unique one, there is no need to specify the id as an index.).

Related

Great Expectations - Run Validation over specific subset of a PostgreSQL table

I am fairly new to Great Expectations - and have a question. Essentially I have a PostgreSQL database, and every time I run my data pipeline, i want to validate a specific subset of the PostgreSQL table based off some key. Eg: If the data pipeline is run every day, the would be a field called current_batch. And the validation would occur for the below query:
SELECT * FROM jobs WHERE current_batch = <input_batch>.
I am unsure the best way to complete this. I am a using v3-api of great expectations and am a bit confused as to whether to use a checkpoint, or a validator. I assume I want to use a checkpoint but I can't seem to figure out how to create a checkpoint, but then only validate a specific subset of the PostgreSQL datasource.
Any help or guidance would be much appreciated.
Thanks,
I completely understand your confusion because I am working with GE too and the documentation is not really clear.
First of all "Validators" are now called "Checkpoints", so they are not a different entity, as you can read here.
I am working on an Oracle database and the only way I found to apply a query before testing my data with expectations is to put the query inside the checkpoint.
To create a checkpoint you should run the great_expectations checkpoint new command from your terminal. After creating it, you should add the "query" field inside the .yml file that is your checkpoint.
Below you can see a snippet of a checkpoint I am working with. When I want to validate my data, I run the command great_expectations checkpoint run check1
name: check1
module_name: great_expectations.checkpoint
class_name: LegacyCheckpoint
batches:
- batch_kwargs:
table: pso
schema: test
query: SELECT p AS c,
[ ... ]
AND lsr = c)
datasource: my_database
data_asset_name: test.pso
expectation_suite_names:
- exp_suite1
Hope this helps! Feel free to ask if you have any doubts :)
I managed this using Views (in Postgres). Before running GE, I create (or replace the existing) view as a query with all necessary joins, filtering, aggregations, etc. And then specify the name of this view in GE checkpoints.
Yes, it is not the ideal solution. I would rather use a query in checkpoints too. But as a workaround, it covers all my cases.
Let's have view like this:
CREATE OR REPLACE VIEW table_to_check_1_today AS
SELECT * FROM initial_table
WHERE dt = current_date;
And checkpoint be configured something like this:
name: my_task.my_check
config_version: 1.0
validations:
- expectation_suite_name: my_task.my_suite
batch_request:
datasource_name: my_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: table_to_check_1_today
Yes, a view can be created using the "current_date" - and the checkpoint can simply run the view. However, this would mean that the variable (current_date) is stored in the database - which may not be desirable; you might want to run the query in the checkpoint for a different date - which could be coming from a environment variable or elsewhere - to the CLI or python/notebook
Yet to find a solution where we can substitute a string in the checkpoint query; using a config variable from the file is a very static way - there may be different checkpoints running for different dates.

Doctrine Migration from string to Entity

I've an apparently simple task to perform, i have to convert several tables column from a string to a new entity (integer FOREIGN KEY) value.
I have DB 10 tables with a column called "app_version" which atm are VARCHAR columns type. Since i'm going to have a little project refactor i'd like to convert those VARCHAR columns to a new column which contains an ID representing the newly mapped value so:
V1 -> ID: 1
V2 -> ID: 2
and so on
I've prepared a Doctrine Migration (i'm using symfony 3.4) which performs the conversion by DROPPING the old column and adding the new id column for the AppVersion table.
Of course i need to preserve my current existing data.
I know about preUp and postUp but i can't figure how to do it w/o hitting the DB performance too much. I can collect the data via SELECT in the preUp, store them in some PHP vars to use later on inside postUp to write new values to DB but since i have 10 tables with many rows this become a disaster real fast.
Do you guys have any suggestion i could apply to make this smooth and easy?
Please do not ask why i have to do this refactor now and i didn't setup the DB correctly in the first time. :D
Keywords for ideas: transaction? bulk query? avoid php vars storage? write sql file? everything can be good
I feel dumb but the solution was much more simple, i created a custom migration with all the "ALTER TABLE [table_name] DROP app_version" to be executed AFTER one that simply does:
UPDATE [table_name] SET app_version_id = 1 WHERE app_version = "V1"

Simple data update with talend

I have a job in talend which migrates some data from one database to another... At the end of data migration, I should update the date of last extraction in the source database with SYSDATE so it could be used as a criteria for the next extraction. The SQL query would be something like :
UPDATE MIGR_FOLLOWUP SET LAST_EXTR = SYSDATE WHERE SYSTEM = 'TARGET3'
I'd like to do that update in talend, and I guess it should be a component triggered by OnSubjobOK, but I just can't seem to understand how to do this in a simple manner... The only way I could possibly think of is using both tOracleInput and tOracleOutput components, in order to first extract the wanted row and then update it, but it really doesn't sound like a good manner to do this...
Can anyone point me out on how to do this?
Thanks!!
You can run arbitrary SQL by using the database row components such as tOracleRow.
If you linked this with an on subjob ok link from your main migration then once your main migration completes successfully it would update the LAST_EXTR field with the current time.
Alternatively you could update this using a tOracleOutput component but you would need to have Talend define the date-time stamp using something like Talend.getCurrentDate in a tMap or tFixedFlowInput.

Verify the structure of a database? (SQLite in C++ / Qt)

I was wondering what the "best" way to verify the structure of my database is with SQLite in Qt / C++. I'm using SQLite so there is a file which contains my database, and I want to make sure that, when launching the program, the database is structured the way it should be- i.e., it has X tables each with their own Y columns, appropriately named, etc. Could someone point my in the right direction? Thanks so much!
You can get a list of all the tables in the database with this query:
select tbl_name from sqlite_master;
And then for each table returned, run this query to get column information
pragma table_info(my_table);
For the pragma, each row of the result set will contain: a column index, the column name, the column's type affinity, whether the column may be NULL, and the column's default value.
(I'm assuming here that you know how to run SQL queries against your database in the SQLite C interface.)
If you have QT and thus QtSql at hand, you can also use the QSqlDatabase::tables() (API doc) method to get the tables and QSqlDatabase::record(tablename) to get the field names. It can also give you the primary key(s), but for further details you will have to follow pkh's advice to use the table_info pragma.

How can I persist a single value in Django?

My Django application retrieves an RSS feed every day. I would like to persist the time the feed was last updated somewhere in the app. I'm only retrieving one feed, it will never grow to be multiple feeds. How can I persist the last updated time?
My ideas so far
Create a model and add a datetime field to it. This seems like overkill as it adds another table to the database, in which there will only ever be one row. Other than that, it's the most obvious and straight-forward solution.
Create a settings object which just stores key/value mappings. The last updated date would just be row in this database. This is essentially a generic version of the previous solution.
Use dbsettings/django-values, which allows you to store settings in the database. The last updated date would just be a 'setting'.
Any other ideas that I'm missing?
In spite of the fact databases regularly store many rows in any given table, having a table with only one row is not especially costly, so long as you don't have (m)any indexes, which would waste space. In fact most databases create many single row tables to implement some features, like monotonic sequences used for generating primary keys. I encourage you to create a regular model for this.
RAM is volatile, thus not persistent: memcached is not what you asked for.
XML it is not the right technology to store a single value.
RDMS is not the right technology to store a single value.
Django cache framework will answer your question if CACHE_BACKEND is set to anything else than file://...
The filesystem is the right technology to "persist a single value".
In settings.py:
RSS_FETCH_DATETIME_PATH=os.path.join(
os.path.abspath(os.path.dirname(__file__)),
'rss_fetch_datetime'
)
In your rss fetch script:
from django.conf import settings
handler = open(RSS_FETCH_DATETIME_PATH, 'w+')
handler.write(int(time.time()))
handler.close()
Wherever you need to read it:
from django.conf import settings
handler = open(RSS_FETCH_DATETIME_PATH, 'r+')
timestamp = int(handler.read())
handler.close()
But cron is the right tool if you want to "run a command every day", for example at 5AM:
0 5 * * * /path/to/manage.py runscript /path/to/retreive/script
Of course, you can still write the last update timestamp in a file at the end of the retreive script, and use it somewhere else, if that makes sense to you.
Concluding by quoting Ken Thompson:
One of my most productive days was
throwing away 1000 lines of code.
One solution I've used in the past is to use Django's cache feature. You set a value to True with an expiration time of one day (in your case.) If the value is not set, you fetch the feed, otherwise you don't do anything.
You can see my solution here: Importing your Flickr photos with Django
If you need it only for caching purposes, why not store it in the memcached?
On the other hand, if you use this data for other purposes (e.g. display it on the page, or to make some calculation, etc.), then I would store it in a new model - in Django, all persistence is built on top of the database, via models, and I would not try to use other "clever" solutions.
One thing I used to do when I was deving with PHP, was to store the xml somewhere, but with a new tag inserted to hold the timestamp of the latest retrieval. It wasn't great, but it was quick and simple.
Keeping it simple would lead to the idea of just storing it in the file system ... why can't you do that? You could, for example, have a siteconfig module in one of your apps which held these sorts of data. This could load up data from a specific file, which could be text, JSON, ConfigParser, pickle or any suitable format. Just import siteconfig somewhere, and it can load the data and make it available to the other modules in your site. You could easily extend this to hold a dict-like object with a number of settings (e.g., if you ever have multiple feeds, but don't want to have a model just for 2-3 rows, you could easily hold the last-retrieved time for each feed in a dict keyed by feed URL).
Create a session key, which persists forever and update the feed timestamp every time you access it.