So I've got a task to build which is going to archive a ton of data in our DB into JSON.
To give you a better idea of what is happening; X has 100s of Ys, and Y has 100s of Zs and so on. I'm creating a json file for every X, Y, and Z. But every X json file has an array of ids for the child Ys of X, and likewise the Ys store an array of child Zs..
It more complicated than that in many cases, but you should get an idea of the complexity involved from that example I think.
I was using ColdFusion but it seems to be a bad choice for this task because it is crashing due to memory errors. It seems to me that if it were removing queries from memory that are no longer referenced while running the task (ie: garbage collecting) then the task should have enough memory, but afaict ColdFusion isn't doing any garbage collection at all, and must be doing it after a request is complete.
So I'm looking either for advice on how to better achieve my task in CF, or for recommendations on other languages to use..
Thanks.
1) If you have debugging enabled, coldfusion will hold on to your queries until the page is done. Turn it off!
2) You may need to structDelete() the query variable to allow it to be garbage collected, otherwise it may persist as long as the scope that has a reference to it exists. eg.,
<cfset structDelete(variables,'myQuery') />
3) A cfquery pulls the entire ResultSet into memory. Most of the time this is fine. But for reporting on a large result set, you don't want this. Some JDBC drivers support setting the fetchSize, which in a forward, read only fashion, will let you get a few results at a time. This way you can deal with thousands and thousands of rows, without swamping memory. I just generated a 1GB csv file in ~80 seconds, using less than 100mb of heap. This requires dropping out to Java. But it kills two birds with one stone. It reduces the amount of data brought in at a time by the JDBC driver, and since you're working directly with the ResultSet, you don't hit the cfloop problem #orangepips mentioned. Granted, it's not for those without some Java chops.
You can do it something like this (you need cfusion.jar in your build path):
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.sql.ResultSet;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import au.com.bytecode.opencsv.CSVWriter;
import coldfusion.server.ServiceFactory;
public class CSVExport {
public static void export(String dsn,String query,String fileName) {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
FileWriter fw = null;
BufferedWriter bw = null;
try {
DataSource ds = ServiceFactory.getDataSourceService().getDatasource(dsn);
conn = ds.getConnection();
// we want a forward-only, read-only result.
// you may want need to use a PreparedStatement instead.
stmt = conn.createStatement(
ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY
);
// we only want to go forward!
stmt.setFetchDirect(ResultSet.FETCH_FORWARD);
// how many records to pull back at a time.
// the hard part is balancing memory usage, and round trips to the database.
// basically sacrificing speed for a lower memory hit.
stmt.setFetchSize(256);
rs = stmt.executeQuery(query);
// do something with the ResultSet, for example write to csv using opencsv
// the key is to stream it. you don't want it stored in memory.
// so excel spreadsheets and pdf files are out, but text formats like
// like csv, json, html, and some binary formats like MDB (via jackcess)
// that support streaming are in.
fw = new FileWriter(fileName);
bw = new BufferedWriter(fw);
CSVWriter writer = new CSVWriter(bw);
writer.writeAll(rs,true);
}
catch (Exception e) {
// handle your exception.
// maybe try ServiceFactory.getLoggingService() if you want to do a cflog.
e.printStackTrace();
}
finally() {
try {rs.close()} catch (Exception e) {}
try {stmt.close()} catch (Exception e) {}
try {conn.close()} catch (Exception e) {}
try {bw.close()} catch (Exception e) {}
try {fw.close()} catch (Exception e) {}
}
}
}
Figuring out how to pass parameters, logging, turning this into a background process (hint: extend Thread) etc. are separate issues, but if you grok this code, it shouldn't be too difficult.
4) Perhaps look at Jackson for generating your json. It supports streaming, and combined with the fetchSize, and a BufferedOutputStream, you should be able to keep the memory usage way down.
Eric, you are absolutely correct about ColdFusion garbage collection not removing query information from memory until request end and I've documented it fairly extensively in another SO question. In short, you hit OoM Exceptions when you loop over queries. You can prove it using a tool like VisualVM to generate a heap dump while the process is running and then running the resulting dump through Eclipse Memory Analyzer Tool (MAT). What MAT would show you is a large hierarchy, starting with an object named (I'm not making this up) CFDummyContent that holds, among other things, references to cfquery and cfqueryparam tags. Note, attempting to change it up to stored procs or even doing the database interaction via JDBC does not make difference.
So. What. To. Do?
This took me a while to figure out, but you've got 3 options in increasing order of complexity:
<cthread/>
asynchronous CFML gateway
daisy chain http requests
Using cfthread looks like this:
<cfloop ...>
<cfset threadName = "thread" & createUuid()>
<cfthread name="#threadName#" input="#value#">
<!--- do query stuff --->
<!--- code has access to passed attributes (e.g. #attributes.input#) --->
<cfset thread.passOutOfThread = somethingGeneratedInTheThread>
</cfthread>
<cfthread action="join" name="#threadName#">
<cfset passedOutOfThread = cfthread["#threadName#"].passOutOfThread>
</cfloop>
Note, this code is not taking advantage of asynchronous processing, thus the immediate join after each thread call, but rather the side effect that cfthread runs in its own request-like scope independent of the page.
I'll not cover ColdFusion gateways here. HTTP daisy chaining means executing an increment of the work, and at the end of the increment launching a request to the same algorithm telling it to execute the next increment.
Basically, all three approaches allow those memory references to be collected mid process.
And yes, for whoever asks, bugs have been raised with Adobe, see the question referenced. Also, I believe this issue is specific to Adobe ColdFusion, but have not tested Railo or OpenDB.
Finally, have to rant. I've spent a lot of time tracking this one down, fixing it in my own large code base, and several others listed in the question referenced have as well. AFAIK Adobe has not acknowledge the issue much-the-less committed to fixing it. And, yes it's a bug, plain and simple.
Related
I am not really seeking code examples, but I'm hoping someone can review my program design and provide feedback. I am trying to figure out how do I ensure I have one instance of my "workflow" running at a time.
I am working in C++.
This is my workflow:
I read rows off of a Postgres database.
If the table has any records, I want to do these instructions:
Read the records and transform them to JSON
Send the JSON document to a remote Web service
Parse the response from the service. The service tells me which records were saved or not saved, based on their primary key.
I delete the successfully saved records
I log the unsuccessful records (there's another process that consumes the logs and so my work is done).
I want to perform all of this threads using a separate thread (or "task", whatever higher-level abstraction is available in C++), and I want to make sure that if my function for [1] gets called multiple times, the additional calls basically get "dropped" if step 1 is already in flight.
In C++, I believe I can use a flag and a mutex. I use a something like std::lock_guard<std::mutex> at the top of my method. Then the next line checks for a flag.
// MyWorkflow.cpp
std::mutex myMutex;
int inFlight = 0;
void process() {
std::lock_guard<std::mutex> guard(myMutex);
if (inflight) {
return;
}
inflight = 1;
std::vector<Widget> widgets = readFromMyTable();
std::string json = getJson(&widgets);
... // Send the json to the remote service and handle the response
}
Okay, let me explain my confusion. I want to use Curl to perform the HTTP request. But Curl works asynchronously. And so if I make the asynchronous HTTP call via Curl, my update function will just return and myMutex will be released, right?
I think in my asynchronous response handler, I need to call a second function that's in MyWorkflow.cpp
void markCompletion() {
std::lock_guard<std::mutex> guard(myMutex);
inFlight = 0; // Reset the inflight flag here
}
Is this the right approach? I am worried that if an exception is thrown anywhere before I call markCompletion(), I will block all future callers. I think I need to ensure I have proper exception handling and always call markCompletion().
I am terribly sorry for asking such a noob question, but I really want to learn to do this the right way.
We have a deadlock situation which occured because of this heavy load on the microservice (Say A) causing multiple requests from different client services (B,C). So these calls from B and C come for the same clientId(key) and are served by different instances of A and they try to update the same clientId data in database at same time causing below error.
CannotAcquireLockException is thrown,
(SQL Error: 60, SQLState: 61000..
ORA-00060: deadlock detected while waiting for resource
We have decided to implement sharding at load balancer(haproxy) level which will ensure same instance of A will always serve the requests from B and C for a specific key(clientId), so we dont have multiple instances processing the request for same key(clientId).
Now we get into the mode of everything in single jvm as we have made sure requests from B and C for a specific clientId always come to same instance of A.
With this its still possible that requests from B and C services come for same clientId with difference in time of nanoseconds. Any then multiple threads will again try to update the same clientId data in database at same time causing same error again.
To improve this we are looking for possible solutions and one solutions is ReentrantReadWriteLock which should take care of this based on the concepts.
We are using spring data jpa and have a save being done which looks like
clientJpaRepository.save(ClientObject);
Now is it possible to use something like below.
public void save(Client clientObject) {
String clientId = clientObject.getClientId();
try {
boolean isLockAcquired = writeLock.tryLock(100, TimeUnit.MILLISECONDS);
if (isLockAcquired) {
clientJpaRepository.save(clientObject);
}
} catch (InterruptedException e) {
log.error("exception occured trying to acquire lock for clientId={}", clientId);
} finally {
writeLock.unlock();
}
}
I am not very sure how its going to deal with the keys. As in i don't want any threads to block if they are wanting to update for different key(clientId 2).
Also, other thing to note is there could be reads happening as part of other API calls for this data from database. They would not be waiting too long hopefully and i hope i don't need to make any changes there for the reads.
Sorry for the long question, Hope i will hear from someone soon.
Thanks.
For the past couple years, I've been maintaining a large C++ application (v100) that utilizes some form of non-ADO database connections, but it works great.
During this time, getting a resultset from the database is quite simple. I instantiate the return class, with the database object, then Open a query.
CUpdates cUpdates(GetDatabase());
CString strQuery = "SELECT * FROM Updates";
cUpdates.Open(-1, strQuery);
Just that simple, cUpdates is filled with records.
NOW however, we want to execute a stored procedure, and return the results from it. But no matter what I try, even changing 'EXEC' to 'CALL', the call fails. Is there a similar simple method for executing a stored procedure, and returning the results, without having to totally rewrite how the application handles the database connection and returning of data?
strQuery.Format("EXEC dbo.[GetUpdates_ComputerName] '%s', %d, %d", GetWorkstationName(), m_bRetainUpdates, m_bScheduleUpdate);
cUpdates.Open(-1, strQuery); //FAILS ON EXEC
(I have tested the EXEC statement in SSMS, and it works fine)
We do also use another sql command, for strictly executing statements, but I see no way of returning data with it. Maybe there is a similar command I don't know of?
GetDatabase()->ExecuteSQL(strQuery);
note: for the record, I am C# developer (since 1.0 beta). My only experience in c++ has been learning on the fly over the past 2 years, occasionally maintaining a few of our massive systems.
It would seem that CRecordset cannot handle an EXEC statement inside of it. So we converted the new stored procedure to a Tabular Function, so I can use a SELECT instead... which works properly. (though I'd rather use the stored procedure)
I want to bulk-import Doctrine entities from an XML file.
The XML file can be very large (up to 1 million entities), so I can't persist all my entities the traditional way:
$em->beginTransaction();
while ($entity = $xmlReader->readNextEntity()) {
$em->persist($entity);
}
$em->flush();
$em->commit();
I would soon exceed my memory limit, and Doctrine is not really designed to handle that many managed entities.
I don't need to track changes to the persisted entities, just to persist them; therefore I don't want them to be managed by the EntityManager.
Is it possible to persist entities without getting them managed by the EntityManager?
The first option that comes to my mind is to detach it immediately after persisting it:
$em->beginTransaction();
while ($entity = $xmlReader->readNextEntity()) {
$em->persist($entity);
$em->flush($entity);
$em->detach($entity);
}
$em->commit();
But this is quite expensive in Doctrine, and would slow down the import.
The other option would be to directly insert the data into the database using the Connection object and a prepared statement, but I like the abstraction of the entity and would ideally like to store the object directly.
Instead of using detach and flush after each insert, you can call clear (which detaches all entities from the manager) and flush in batches, which should be significantly faster:
Bulk inserts in Doctrine are best performed in batches, taking
advantage of the transactional write-behind behavior of an
EntityManager. The following code shows an example for inserting 10000
objects with a batch size of 20. You may need to experiment with the
batch size to find the size that works best for you. Larger batch
sizes mean more prepared statement reuse internally but also mean more
work during flush.
https://doctrine-orm.readthedocs.org/projects/doctrine-orm/en/latest/reference/batch-processing.html
If possible, I recommend avoiding transactions for bulk operations as they tend to slow things down:
//$em->beginTransaction();
$i = 0;
while ($entity = $xmlReader->readNextEntity()) {
$em->persist($entity);
if(++$i % 20 == 0) {
$em->flush();
$em->clear(); // detaches all entities
}
}
$em->flush(); //Persist objects that did not make up an entire batch
$em->clear();
//$em->commit();
Background
I have a 2-tier web service - just my app server and an RDBMS. I want to move to a pool of identical app servers behind a load balancer. I currently cache a bunch of objects in-process. I hope to move them to a shared Redis.
I have a dozen or so caches of simple, small-sized business objects. For example, I have a set of Foos. Each Foo has a unique FooId and an OwnerId.
One "owner" may own multiple Foos.
In a traditional RDBMS this is just a table with an index on the PK FooId and one on OwnerId. I'm caching this in one process simply:
Dictionary<int,Foo> _cacheFooById;
Dictionary<int,HashSet<int>> _indexFooIdsByOwnerId;
Reads come straight from here, and writes go here and to the RDBMS.
I usually have this invariant:
"For a given group [say by OwnerId], the whole group is in cache or none of it is."
So when I cache miss on a Foo, I pull that Foo and all the owner's other Foos from the RDBMS. Updates make sure to keep the index up to date and respect the invariant. When an owner calls GetMyFoos I never have to worry that some are cached and some aren't.
What I did already
The first/simplest answer seems to be to use plain ol' SET and GET with a composite key and json value:
SET( "ServiceCache:Foo:" + theFoo.Id, JsonSerialize(theFoo));
I later decided I liked:
HSET( "ServiceCache:Foo", theFoo.FooId, JsonSerialize(theFoo));
That lets me get all the values in one cache as HVALS. It also felt right - I'm literally moving hashtables to Redis, so perhaps my top-level items should be hashes.
This works to first order. If my high-level code is like:
UpdateCache(myFoo);
AddToIndex(myFoo);
That translates into:
HSET ("ServiceCache:Foo", theFoo.FooId, JsonSerialize(theFoo));
var myFoos = JsonDeserialize( HGET ("ServiceCache:FooIndex", theFoo.OwnerId) );
myFoos.Add(theFoo.OwnerId);
HSET ("ServiceCache:FooIndex", theFoo.OwnerId, JsonSerialize(myFoos));
However, this is broken in two ways.
Two concurrent operations can read/modify/write at the same time. The latter "wins" the final HSET and the former's index update is lost.
Another operation could read the index in between the first and second lines. It would miss a Foo that it should find.
So how do I index properly?
I think I could use a Redis set instead of a json-encoded value for the index.
That would solve part of the problem since the "add-to-index-if-not-already-present" would be atomic.
I also read about using MULTI as a "transaction" but it doesn't seem like it does what I want. Am I right that I can't really MULTI; HGET; {update}; HSET; EXEC since it doesn't even do the HGET before I issue the EXEC?
I also read about using WATCH and MULTI for optimistic concurrency, then retrying on failure. But WATCH only works on top-level keys. So it's back to SET/GET instead of HSET/HGET. And now I need a new index-like-thing to support getting all the values in a given cache.
If I understand it right, I can combine all these things to do the job. Something like:
while(!succeeded)
{
WATCH( "ServiceCache:Foo:" + theFoo.FooId );
WATCH( "ServiceCache:FooIndexByOwner:" + theFoo.OwnerId );
WATCH( "ServiceCache:FooIndexAll" );
MULTI();
SET ("ServiceCache:Foo:" + theFoo.FooId, JsonSerialize(theFoo));
SADD ("ServiceCache:FooIndexByOwner:" + theFoo.OwnerId, theFoo.FooId);
SADD ("ServiceCache:FooIndexAll", theFoo.FooId);
EXEC();
//TODO somehow set succeeded properly
}
Finally I'd have to translate this pseudocode into real code depending how my client library uses WATCH/MULTI/EXEC; it looks like they need some sort of context to hook them together.
All in all this seems like a lot of complexity for what has to be a very common case;
I can't help but think there's a better, smarter, Redis-ish way to do things that I'm just not seeing.
How do I lock properly?
Even if I had no indexes, there's still a (probably rare) race condition.
A: HGET - cache miss
B: HGET - cache miss
A: SELECT
B: SELECT
A: HSET
C: HGET - cache hit
C: UPDATE
C: HSET
B: HSET ** this is stale data that's clobbering C's update.
Note that C could just be a really-fast A.
Again I think WATCH, MULTI, retry would work, but... ick.
I know in some places people use special Redis keys as locks for other objects. Is that a reasonable approach here?
Should those be top-level keys like ServiceCache:FooLocks:{Id} or ServiceCache:Locks:Foo:{Id}?
Or make a separate hash for them - ServiceCache:Locks with subkeys Foo:{Id}, or ServiceCache:Locks:Foo with subkeys {Id} ?
How would I work around abandoned locks, say if a transaction (or a whole server) crashes while "holding" the lock?
For your use case, you don't need to use watch. You simply use a multi + exec block and you'd have eliminated the race condition.
In pseudo code -
MULTI();
SET ("ServiceCache:Foo:" + theFoo.FooId, JsonSerialize(theFoo));
SADD ("ServiceCache:FooIndexByOwner:" + theFoo.OwnerId, theFoo.FooId);
SADD ("ServiceCache:FooIndexAll", theFoo.FooId);
EXEC();
This is sufficient because multi makes the following promise :
"It can never happen that a request issued by another client is served in the middle of the execution of a Redis transaction"
You don't need the watch and retry mechanism because you are not reading and writing in the same transaction.