How can I feed data from Hibernate to the Weka Java API? - data-mining

I am developing a data mining application with the Weka API, Java and MySQL DB connectivity. I want to feed data from the database to the algorithm. I used http://weka.wikispaces.com/Use+Weka+in+your+Java+code#Instances-Database.
Since I use Hibernate and the hibernate.cfg.xml file has the database connection information, can't I just write a normal method in the DAO class to retrieve data and then pass that to the algorithm?

The Weka API is, unfortunately, in some points quite constrained. As such, you will need to have Instances objects. IIRC this its not an interface that you could implement otherwise, but an actual object you have to create.
Therefore, you will likely need to query all your database and produce Instances out of it. Not using hibernate but raw database accesses will save you from doing things twice, thus needing twice as much memory.

I've recently done this with Hibernate, but there is no way that a hibernate class can simply be put into WEKA. I've done it this way:
generate a table in the database that has the model information available as you need it (I've done this since i would have needed to do very complex, time consuming queries for every row. This way, I do the heavy work once and just read it from a simple table)
create you POJO, DAO and what not
then just set-up your WEKA model
Sample Code (WEKA 3.7)
ArrayList<Attribute> atts = new ArrayList<Attribute>();
atts.add(new Attribute("attribute1"));
atts.add(new Attribute("attribute1"));
atts.add(new Attribute("id", (ArrayList<String>) null));
data = new Instances("yourData", atts, 0);
DAOModel dao = getYourDaoModelHereFromHibernateHoweverYouWantIt();
for (Model m : dao.findAll()) {
vals = new double[data.numAttributes()];
vals[0] = m.getAttribute1();
vals[1] = m.getAttribute2();
vals[2] = data.attribute(2).addStringValue(m.getId());
data.add(new DenseInstance(1.0, vals));
}
data now has the proper format and the algorithms can work with it (you could also save it to an .arff file if you want to work with the GUI)

Related

Django: Specifying Dynamic Database at Runtime

I'm attempting to trial setting up a system in Django whereby I specify the database connection to use at runtime. I feel I may need to go as low level as possible, but want to try and work within the idioms of Django where possible, perhaps stretching it as much as can be possible.
The general precise is that I have a centralised database that stores meta information Datasets - but the actual datasets are created as dyanmic models at runtime, in the database in question. I need to be able to specify which database to connect to at runtime to then extract the data back out...
I have kind of the following idea:
db = {}
db['ENGINE'] = 'django.db.backends.postgresql'
db['OPTIONS'] = {'autocommit': True}
db['NAME'] = my_model_db['database']
db['PASSWORD'] = my_model_db['password']
db['USER'] = my_model_db['user']
db['HOST'] = my_model_db['host']
logger.info("Connecting to database {db} on {host}".format(db=source_db['NAME'], host=source_db['HOST']))
connections.databases['my_model_dynamic_db'] = db
DynamicObj.objects.using('my_model_dynamic_db').all()
Has anyone achieved this? And how?

Use DynamoDBVersionAttribute when creating a new DynamoDB Table in Java

I'm trying to add a DynamoDBVersionAttribute to incorporate optimistic locking when accessing/updating items in a DynamoDB table. However, I'm unable to figure out how exactly to add the version attribute.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBMapper.OptimisticLocking.html seems to state that using it as an annotation in the class that creates the table is the way to go. However, our codebase is creating new tables in a format similar to this:
AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().build();
DynamoDB dynamoDB = new DynamoDB(client);
List<AttributeDefinition> attributeDefinitions= new
ArrayList<AttributeDefinition>();
attributeDefinitions.add(new
AttributeDefinition().withAttributeName("Id").withAttributeType("N"));
List<KeySchemaElement> keySchema = new ArrayList<KeySchemaElement>();
keySchema.add(new
KeySchemaElement().withAttributeName("Id").withKeyType(KeyType.HASH));
CreateTableRequest request = new CreateTableRequest()
.withTableName(tableName)
.withKeySchema(keySchema)
.withAttributeDefinitions(attributeDefinitions)
.withProvisionedThroughput(new ProvisionedThroughput()
.withReadCapacityUnits(5L)
.withWriteCapacityUnits(6L));
Table table = dynamoDB.createTable(request);
I'm not able to find out how to add the VersionAttribute through the Java code as described above. It's not an attribute definitions so unsure where it goes. Any guidance as to where I can add this VersionAttribute in the CreateTable request?
As far as I'm aware, the #DynamoDBVersionAttribute annotation for optimistic locking is only available for tables modeled specifically for DynamoDBMapper queries. Using DynamoDBMapper is not a terrible approach, since it effectively creates an ORM for CRUD operations on DynamoDB items.
But if your existing codebase can't make use of it, your next best bet is probably to use conditional writes to increment a version number if it's equal to what you expect it to be (i.e. roll your own optimistic locking). Unfortunately, you would need to include the increment / condition to every write you want to be optimistically locked.
Your code just creates a table, but then in order to use DynamoDBMapper to access that table, you need to create a class that represents it. For example if your table is called Users, you should create a class called Users, and use annotations to link it to the table.
You can keep your table creation code, but you need to create the DynamoDBMapper class. You can then do all of your loading, saving and querying using the DynamoDBMapper class.
When you have created the class, just give it a field called version and put the annotation on it, DynamoDBMapper will take care of the rest.

Memory management of large data from oracle database

I am pulling out large data from oracle database using cx_oracle using below sample script:
from cx_Oracle import connect
TABLEDATA = []
con = connect("user/password#host")
curs = con.cursor()
curs.execute("select * from TABLE where rownum < 100000")
for row in curs:
TABLEDATA.append([str(col) for col in list(row)])
curs.close()
con.close()
Problem with storing in list is that it ends up to about 800-900 mb of RAM usages.
I know I can instead save this in file and not store in list but I am using this list to display table using QTABLEVIEW and QABSTRACTTABLE MODEL.
Is there any alternate or more effient way where I can minimise memory usage of storing this data and also use it to display my table?
I have tried multiple possobilities, I don't think qsqltablemodel works for me. Though it load data directly from database, as you keep scrolling down it loads more and more data in table and hence the memory usage keep on increasing.
What I think will ideally work is being able to load set number of rows in model. As you scroll down it loads new rows but also at the same time unloads what's already there. So at any point of time we only have set number of rows loaded in model.
If you don't want to store all the data in RAM, then you need to use a model for you tableview that get's information from the database as needed. Fortunately, Qt natively supports this, and can connect to oracle databases.
You will want to look into:
http://qt-project.org/doc/qt-4.8/sql-driver.html
http://qt-project.org/doc/qt-4.8/sql-model.html
http://qt-project.org/doc/qt-4.8/qsqltablemodel.html
http://qt-project.org/doc/qt-4.8/qsqldatabase.html
Note this is c++ documentation, but it is fairly easy to translate to PyQt (I always use the c++ documentation despite never coding in c++). You may also want to subclass QSqlTableModel to provide slightly different behaviour to the standard interface!

Container for in-memory representation of a DB table

Let's say I have a (MySQL) DB. I want to automate the update of this database via an application, that will:
1. Import from DB
2. Calculate updated data
3. Export back updated data
The timing is important, I don't want to import while calculating, in fact I don't want any queries then; I want to import (a) table(s) as a whole, then calculate. So, my question is, if a row is represented with an instance of a class, then what container do I put these objects into?
A vector? A set? What about ordered vs. unordered? Just use what seems best for my case according to big O times? Any special traps to fall into here? Is this case no different than with data "born in memory", so the only things to consider besides size overhead are "do I want the lookup or the insertion to be faster" ?
Probably the best route is to use some ORM, but let's say I don't want to.
I've seen some apps use boost::unordered_set, and I wondered, if there is a particular reason for its use...
I use a jdbc-like interface as the connector (libmysqlcpp).
I do not think that the container you have to use can be guessed with so few information. It mainly depends of the data size, type and the algorithm you will run.
But my main concern over such a design is that it will quickly choke your network or your base and database. If you have a big table you'll:
select all the data from the table
retrieve all the data over the network
process on you machine part (some columns ?) or the entirety of the data
push the data over the network
update your rows (or erase/replace maybe)
Why don't you consider working directly on the mysql server ? You create your user defined function that work on the directly data, saving the network and even taking advantage of the fact that mysql is built to handle gigantic amount of data, quantity that an in-memory container is not built to handle.

Django create/alter tables on demand

I've been looking for a way to define database tables and alter them via a Django API.
For example, I'd like to be write some code which directly manipulates table DDL and allow me to define tables or add columns to a table on demand programmatically (without running a syncdb). I realize that django-south and django-evolution may come to mind, but I don't really think of these tools as tools meant to be integrated into an application and used by and end user... rather these tools are utilities used for upgrading your database tables. I'm looking for something where I can do something like:
class MyModel(models.Model): # wouldn't run syncdb.. instead do something like below
a = models.CharField()
b = models.CharField()
model = MyModel()
model.create() # this runs the create table (instead of a syncdb)
model.add_column(c = models.CharField()) # this would set a column to be added
model.alter() # and this would apply the alter statement
model.del_column('a') # this would set column 'a' for removal
model.alter() # and this would apply the removal
This is just a toy example of how such an API would work, but the point is that I'd be very interested in finding out if there is a way to programatically create and change tables like this. This might be useful for things such as content management systems, where one might want to dynamically create a new table. Another example would be a site that stores datasets of an arbitrary width, for which tables need to be generated dynamically by the interface or data imports. Dose anyone know any good ways to dynamically create and alter tables like this?
(Granted, I know one can do direct SQL statements against the database, but that solution lacks the ability to treat the databases as objects)
Just curious as to if people have any suggestions or approaches to this...
You can try and interface with the django's code that manages changes in the database. It is a bit limited (no ALTER, for example, as far as I can see), but you may be able to extend it. Here's a snippet from django.core.management.commands.syncdb.
for app in models.get_apps():
app_name = app.__name__.split('.')[-2]
model_list = models.get_models(app)
for model in model_list:
# Create the model's database table, if it doesn't already exist.
if verbosity >= 2:
print "Processing %s.%s model" % (app_name, model._meta.object_name)
if connection.introspection.table_name_converter(model._meta.db_table) in tables:
continue
sql, references = connection.creation.sql_create_model(model, self.style, seen_models)
seen_models.add(model)
created_models.add(model)
for refto, refs in references.items():
pending_references.setdefault(refto, []).extend(refs)
if refto in seen_models:
sql.extend(connection.creation.sql_for_pending_references(refto, self.style, pending_references))
sql.extend(connection.creation.sql_for_pending_references(model, self.style, pending_references))
if verbosity >= 1 and sql:
print "Creating table %s" % model._meta.db_table
for statement in sql:
cursor.execute(statement)
tables.append(connection.introspection.table_name_converter(model._meta.db_table))
Take a look at connection.creation.sql_create_model. The creation object is created in the database backend relevant to the database you are using in your settings.py. All of them are under django.db.backends.
If you must have ALTER table, I think you can create your own custom backend that extends an existing one and adds this functionality. Then you can interface with it directly through a ExtendedModelManager you create.
Quickly off the top of my head..
Create a Custom Manager with the Create/Alter methods.