I want to bulk-import Doctrine entities from an XML file.
The XML file can be very large (up to 1 million entities), so I can't persist all my entities the traditional way:
$em->beginTransaction();
while ($entity = $xmlReader->readNextEntity()) {
$em->persist($entity);
}
$em->flush();
$em->commit();
I would soon exceed my memory limit, and Doctrine is not really designed to handle that many managed entities.
I don't need to track changes to the persisted entities, just to persist them; therefore I don't want them to be managed by the EntityManager.
Is it possible to persist entities without getting them managed by the EntityManager?
The first option that comes to my mind is to detach it immediately after persisting it:
$em->beginTransaction();
while ($entity = $xmlReader->readNextEntity()) {
$em->persist($entity);
$em->flush($entity);
$em->detach($entity);
}
$em->commit();
But this is quite expensive in Doctrine, and would slow down the import.
The other option would be to directly insert the data into the database using the Connection object and a prepared statement, but I like the abstraction of the entity and would ideally like to store the object directly.
Instead of using detach and flush after each insert, you can call clear (which detaches all entities from the manager) and flush in batches, which should be significantly faster:
Bulk inserts in Doctrine are best performed in batches, taking
advantage of the transactional write-behind behavior of an
EntityManager. The following code shows an example for inserting 10000
objects with a batch size of 20. You may need to experiment with the
batch size to find the size that works best for you. Larger batch
sizes mean more prepared statement reuse internally but also mean more
work during flush.
https://doctrine-orm.readthedocs.org/projects/doctrine-orm/en/latest/reference/batch-processing.html
If possible, I recommend avoiding transactions for bulk operations as they tend to slow things down:
//$em->beginTransaction();
$i = 0;
while ($entity = $xmlReader->readNextEntity()) {
$em->persist($entity);
if(++$i % 20 == 0) {
$em->flush();
$em->clear(); // detaches all entities
}
}
$em->flush(); //Persist objects that did not make up an entire batch
$em->clear();
//$em->commit();
Related
Google Cloud Datastore documents that if an entity id needs to be pre-allocated, then one should use the allocateIds method:
https://cloud.google.com/datastore/docs/best-practices#keys
That method seems to make a REST or RPC call which has latency. I'd like to avoid that latency by using a PRNG in my Kubernetes Engine application. Here's the scala code:
import java.security.SecureRandom
class RandomFactory {
protected val r = new SecureRandom
def randomLong: Long = r.nextLong
def randomLong(min: Long, max: Long): Long =
// Unfortunately, Java didn't make Random.internalNextLong public,
// so we have to get to it in an indirect way.
r.longs(1, min, max).toArray.head
// id may be any value in the range (1, MAX_SAFE_INTEGER),
// so that it can be represented in Javascript.
// TODO: randomId is used in production, and might be susceptible to
// TODO: blocking if /dev/random does not contain entropy.
// TODO: Keep an eye on this concern.
def randomId: Long =
randomLong(1, RandomFactory.MAX_SAFE_INTEGER)
}
object RandomFactory extends RandomFactory {
// MAX_SAFE_INTEGER is es6 Number.MAX_SAFE_INTEGER
val MAX_SAFE_INTEGER = 9007199254740991L
}
I also plan to install haveged in the pod to help with entropy.
I understand allocateIds ensures that an ID is not already in use. But in my particular use case, there are two mitigating factors to overlooking that concern:
Based on entity count, the chance of a conflict is 1 in 100 million.
This particular entity type is non-essential, and can afford a "once in a blue moon" conflict.
I am more concerned about even distribution in keyspace, because that is normal use case concern.
Will this approach work, particularly with even distribution in keyspace? Is the allocatedIds method essential, or does it just help developers avoid simple mistakes?
To get rid of collisions use more bits -- for all practical purposes 128 [See statistics behind UUID V4] will never generate a collision.
Another technique is to insert new entities with a shorter random number and handle the error Cloud Datastore returns if they already exist by trying again with a new ID (until you happen upon one that isn't currently in use).
As far as the key distribution goes: the keys will be randomly distributed within the key space will keep Cloud Datastore happy.
Given that you don't want the entity identifier to be based on an external value, you should allow Cloud Datastore to allocate IDs for you. This way you won't have any conflicts. The IDs allocated by Cloud Datastore will be appropriately scattered through the key space.
We use H2 database to execute tests. To isolate each test from another one, the database schema and basic data-setup is dropped and re-created before each test.
Is it possible to create a restore-point after the first setup of the database and restore before each test the data of this point?
SCRIPT just creates a sql-file with all tables and datas. Not a big difference to our own initialization.
Question database restore to particular state for testing is the same, just for Oracle and Postgres.
An old question, but I find it is still relevant. AFAIK there is no restore-point support.
Here is a simple, yet fast approach to backup/restore.
Create a backup prior to running the first test:
Connection conn = DriverManager.getConnection("jdbc:h2:mem:myDatabase;DB_CLOSE_DELAY=-1;LOG=0");
Statement stat = conn.createStatement();
stat.execute("SCRIPT TO 'memFS:myDatabase.sql'");
stat.close();
conn.close();
Restore after each test:
Connection conn = DriverManager.getConnection("jdbc:h2:mem:myDatabase;DB_CLOSE_DELAY=-1;LOG=0");
Statement stat = conn.createStatement();
stat.execute("DROP ALL OBJECTS");
stat.close();
conn.close();
conn = DriverManager.getConnection("jdbc:h2:mem:myDatabase;DB_CLOSE_DELAY=-1;INIT=runscript from 'memFS:myDatabase.sql';LOG=0");
conn.close();
Note that SHUTDOWN command turned out to be faster than DROP ALL OBJECTS, but it caused some issues (connection pool unable to reestablish connection).
I would not say the above approach is slow, far from it. But with a large database and thousands of tests there is still room for improvement as the method above takes some time. I managed to achieve a few times faster backup/restore (~15ms for a DB with ~350 tables) manually composing a script performing TRUNCATE TABLE, ALTER SEQUENCE and do the INSERT of all initial data (needs SET REFERENTIAL_INTEGRITY FALSE for cleanup/restore procedure to be really fast). The code is cumbersome but was worth the effort.
We have a data set that grows while the application is processing the data set. After a long discussion we have come to the decision that we do not want blocking or asynchronous APIs at this time, and we will periodically query our data store.
We thought of two options to design an API for querying our storage:
A query method returns a snapshot of the data and a flag indicating weather we might have more data. When we finish iterating over the last returned snapshot, we query again to get another snapshot for the rest of the data.
A query method returns a "live" iterator over the data, and when this iterator advances it returns one of the following options: Data is available, No more data, Might have more data.
We are using C++ and we borrowed the .NET style enumerator API for reasons which are out of scope for this question. Here is some code to demonstrate the two options. Which option would you prefer?
/* ======== FIRST OPTION ============== */
// similar to the familier .NET enumerator.
class IFooEnumerator
{
// true --> A data element may be accessed using the Current() method
// false --> End of sequence. Calling Current() is an invalid operation.
virtual bool MoveNext() = 0;
virtual Foo Current() const = 0;
virtual ~IFooEnumerator() {}
};
enum class Availability
{
EndOfData,
MightHaveMoreData,
};
class IDataProvider
{
// Query params allow specifying the ID of the starting element. Here is the intended usage pattern:
// 1. Call GetFoo() without specifying a starting point.
// 2. Process all elements returned by IFooEnumerator until it ends.
// 3. Check the availability.
// 3.1 MightHaveMoreDataLater --> Invoke GetFoo() again after some time by specifying the last processed element as the starting point
// and repeat steps (2) and (3)
// 3.2 EndOfData --> The data set will not grow any more and we know that we have finished processing.
virtual std::tuple<std::unique_ptr<IFooEnumerator>, Availability> GetFoo(query-params) = 0;
};
/* ====== SECOND OPTION ====== */
enum class Availability
{
HasData,
MightHaveMoreData,
EndOfData,
};
class IGrowingFooEnumerator
{
// HasData:
// We might access the current data element by invoking Current()
// EndOfData:
// The data set has finished growing and no more data elements will arrive later
// MightHaveMoreData:
// The data set will grow and we need to continue calling MoveNext() periodically (preferably after a short delay)
// until we get a "HasData" or "EndOfData" result.
virtual Availability MoveNext() = 0;
virtual Foo Current() const = 0;
virtual ~IFooEnumerator() {}
};
class IDataProvider
{
std::unique_ptr<IGrowingFooEnumerator> GetFoo(query-params) = 0;
};
Update
Given the current answers, I have some clarification. The debate is mainly over the interface - its expressiveness and intuitiveness in representing queries for a growing data-set that at some point in time will stop growing. The implementation of both interfaces is possible without race conditions (at-least we believe so) because of the following properties:
The 1st option can be implemented correctly if the pair of the iterator + the flag represent a snapshot of the system at the time of querying. Getting snapshot semantics is a non-issue, as we use database transactions.
The 2nd option can be implemented given a correct implementation of the 1st option. The "MoveNext()" of the 2nd option will, internally, use something like the 1st option and re-issue the query if needed.
The data-set can change from "Might have more data" to "End of data", but not vice versa. So if we, wrongly, return "Might have more data" because of a race condition, we just get a small performance overhead because we need to query again, and the next time we will receive "End of data".
"Invoke GetFoo() again after some time by specifying the last processed element as the starting point"
How are you planning to do that? If it's using the earlier-returned IFooEnumerator, then functionally the two options are equivalent. Otherwise, letting the caller destroy the "enumerator" then however-long afterwards call GetFoo() to continue iteration means you're losing your ability to monitor the client's ongoing interest in the query results. It might be that right now you have no need for that, but I think it's poor design to exclude the ability to track state throughout the overall result processing.
It really depends on many things whether the overall system will at all work (not going into details about your actual implementation):
No matter how you twist it, there will be a race condition between checking for "Is there more data" and more data being added to the system. Which means that it's possibly pointless to try to capture the last few data items?
You probably need to limit the number of repeated runs for "is there more data", or you could end up in an endless loop of "new data came in while processing the last lot".
How easy it is to know if data has been updated - if all the updates are "new items" with new ID's that are sequentially higher, you can simply query "Is there data above X", where X is your last ID. But if you are, for example, counting how many items in the data has property Y set to value A, and data may be updated anywhere in the database at the time (e.g. a database of where taxis are at present, that gets updated via GPS every few seconds and has thousands of cars, it may be hard to determine which cars have had updates since last time you read the database).
As to your implementation, in option 2, I'm not sure what you mean by the MightHaveMoreData state - either it has, or it hasn't, right? Repeated polling for more data is a bad design in this case - given that you will never be able to say 100% certain that there hasn't been "new data" provided in the time it took from fetching the last data until it was processed and acted on (displayed, used to buy shares on the stock market, stopped the train or whatever it is that you want to do once you have processed your new data).
Read-write lock could help. Many readers have simultaneous access to data set, and only one writer.
The idea is simple:
-when you need read-only access, reader uses "read-block", which could be shared with other reads and exclusive with writers;
-when you need write access, writer uses write-lock which is exclusive for both readers and writers;
So I've got a task to build which is going to archive a ton of data in our DB into JSON.
To give you a better idea of what is happening; X has 100s of Ys, and Y has 100s of Zs and so on. I'm creating a json file for every X, Y, and Z. But every X json file has an array of ids for the child Ys of X, and likewise the Ys store an array of child Zs..
It more complicated than that in many cases, but you should get an idea of the complexity involved from that example I think.
I was using ColdFusion but it seems to be a bad choice for this task because it is crashing due to memory errors. It seems to me that if it were removing queries from memory that are no longer referenced while running the task (ie: garbage collecting) then the task should have enough memory, but afaict ColdFusion isn't doing any garbage collection at all, and must be doing it after a request is complete.
So I'm looking either for advice on how to better achieve my task in CF, or for recommendations on other languages to use..
Thanks.
1) If you have debugging enabled, coldfusion will hold on to your queries until the page is done. Turn it off!
2) You may need to structDelete() the query variable to allow it to be garbage collected, otherwise it may persist as long as the scope that has a reference to it exists. eg.,
<cfset structDelete(variables,'myQuery') />
3) A cfquery pulls the entire ResultSet into memory. Most of the time this is fine. But for reporting on a large result set, you don't want this. Some JDBC drivers support setting the fetchSize, which in a forward, read only fashion, will let you get a few results at a time. This way you can deal with thousands and thousands of rows, without swamping memory. I just generated a 1GB csv file in ~80 seconds, using less than 100mb of heap. This requires dropping out to Java. But it kills two birds with one stone. It reduces the amount of data brought in at a time by the JDBC driver, and since you're working directly with the ResultSet, you don't hit the cfloop problem #orangepips mentioned. Granted, it's not for those without some Java chops.
You can do it something like this (you need cfusion.jar in your build path):
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.sql.ResultSet;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import au.com.bytecode.opencsv.CSVWriter;
import coldfusion.server.ServiceFactory;
public class CSVExport {
public static void export(String dsn,String query,String fileName) {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
FileWriter fw = null;
BufferedWriter bw = null;
try {
DataSource ds = ServiceFactory.getDataSourceService().getDatasource(dsn);
conn = ds.getConnection();
// we want a forward-only, read-only result.
// you may want need to use a PreparedStatement instead.
stmt = conn.createStatement(
ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY
);
// we only want to go forward!
stmt.setFetchDirect(ResultSet.FETCH_FORWARD);
// how many records to pull back at a time.
// the hard part is balancing memory usage, and round trips to the database.
// basically sacrificing speed for a lower memory hit.
stmt.setFetchSize(256);
rs = stmt.executeQuery(query);
// do something with the ResultSet, for example write to csv using opencsv
// the key is to stream it. you don't want it stored in memory.
// so excel spreadsheets and pdf files are out, but text formats like
// like csv, json, html, and some binary formats like MDB (via jackcess)
// that support streaming are in.
fw = new FileWriter(fileName);
bw = new BufferedWriter(fw);
CSVWriter writer = new CSVWriter(bw);
writer.writeAll(rs,true);
}
catch (Exception e) {
// handle your exception.
// maybe try ServiceFactory.getLoggingService() if you want to do a cflog.
e.printStackTrace();
}
finally() {
try {rs.close()} catch (Exception e) {}
try {stmt.close()} catch (Exception e) {}
try {conn.close()} catch (Exception e) {}
try {bw.close()} catch (Exception e) {}
try {fw.close()} catch (Exception e) {}
}
}
}
Figuring out how to pass parameters, logging, turning this into a background process (hint: extend Thread) etc. are separate issues, but if you grok this code, it shouldn't be too difficult.
4) Perhaps look at Jackson for generating your json. It supports streaming, and combined with the fetchSize, and a BufferedOutputStream, you should be able to keep the memory usage way down.
Eric, you are absolutely correct about ColdFusion garbage collection not removing query information from memory until request end and I've documented it fairly extensively in another SO question. In short, you hit OoM Exceptions when you loop over queries. You can prove it using a tool like VisualVM to generate a heap dump while the process is running and then running the resulting dump through Eclipse Memory Analyzer Tool (MAT). What MAT would show you is a large hierarchy, starting with an object named (I'm not making this up) CFDummyContent that holds, among other things, references to cfquery and cfqueryparam tags. Note, attempting to change it up to stored procs or even doing the database interaction via JDBC does not make difference.
So. What. To. Do?
This took me a while to figure out, but you've got 3 options in increasing order of complexity:
<cthread/>
asynchronous CFML gateway
daisy chain http requests
Using cfthread looks like this:
<cfloop ...>
<cfset threadName = "thread" & createUuid()>
<cfthread name="#threadName#" input="#value#">
<!--- do query stuff --->
<!--- code has access to passed attributes (e.g. #attributes.input#) --->
<cfset thread.passOutOfThread = somethingGeneratedInTheThread>
</cfthread>
<cfthread action="join" name="#threadName#">
<cfset passedOutOfThread = cfthread["#threadName#"].passOutOfThread>
</cfloop>
Note, this code is not taking advantage of asynchronous processing, thus the immediate join after each thread call, but rather the side effect that cfthread runs in its own request-like scope independent of the page.
I'll not cover ColdFusion gateways here. HTTP daisy chaining means executing an increment of the work, and at the end of the increment launching a request to the same algorithm telling it to execute the next increment.
Basically, all three approaches allow those memory references to be collected mid process.
And yes, for whoever asks, bugs have been raised with Adobe, see the question referenced. Also, I believe this issue is specific to Adobe ColdFusion, but have not tested Railo or OpenDB.
Finally, have to rant. I've spent a lot of time tracking this one down, fixing it in my own large code base, and several others listed in the question referenced have as well. AFAIK Adobe has not acknowledge the issue much-the-less committed to fixing it. And, yes it's a bug, plain and simple.
I am using the parallel data structures in my .NET 4 application and I have a ConcurrentQueue that gets added to while I am processing through it.
I want to do something like:
personqueue.AsParallel().WithDegreeOfParallelism(20).ForAll(i => ... );
as I make database calls to save the data, so I am limiting the number of concurrent threads.
But, I expect that the ForAll isn't going to dequeue, and I am concerned about just doing
ForAll(i => {
personqueue.personqueue.TryDequeue(...);
...
});
as there is no guarantee that I am popping off the correct one.
So, how can I iterate through the collection and dequeue, in a parallel fashion.
Or, would it be better to use PLINQ to do this processing, in parallel?
Well I'm not 100% sure what you try to archive here. Are you trying to just dequeue all items until nothing is left? Or just dequeue lots of items in one go?
The first probably unexpected behavior starts with this statement:
theQueue.AsParallel()
For a ConcurrentQueue, you get a 'Snapshot'-Enumerator. So when you iterate over a concurrent stack, you only iterate over the snapshot, no the 'live' queue.
In general I think it's not a good idea to iterate over something you're changing during the iteration.
So another solution would look like this:
// this way it's more clear, that we only deque for theQueue.Count items
// However after this, the queue is probably not empty
// or maybe the queue is also empty earlier
Parallel.For(0, theQueue.Count,
new ParallelOptions() {MaxDegreeOfParallelism = 20},
() => {
theQueue.TryDequeue(); //and stuff
});
This avoids manipulation something while iterating over it. However, after that statement, the queue can still contain data, which was added during the for-loop.
To get the queue empty for moment in time you probably need a little more work. Here's an really ugly solution. While the queue has still items, create new tasks. Each task start do dequeue from the queue as long as it can. At the end, we wait for all tasks to end. To limit the parallelism, we never create more than 20-tasks.
// Probably a kitty died because of this ugly code ;)
// However, this code tries to get the queue empty in a very aggressive way
Action consumeFromQueue = () =>
{
while (tt.TryDequeue())
{
; // do your stuff
}
};
var allRunningTasks = new Task[MaxParallism];
for(int i=0;i<MaxParallism && tt.Count>0;i++)
{
allRunningTasks[i] = Task.Factory.StartNew(consumeFromQueue);
}
Task.WaitAll(allRunningTasks);
If you are aiming at a high throughout real site and you don't have to do immediate DB updates , you'll be much better of going for very conservative solution rather than extra layers libraries.
Make fixed size array (guestimate size - say 1000 items or N seconds worth of requests) and interlocked index so that requests just put data into slots and return. When one block gets filled (keep checking the count), make another one and spawn async delegate to process and send to SQL the block that just got filled. Depending on the structure of your data that delegate can pack all data into comma-separated arrays, maybe even a simple XML (got to test perf of that one of course) and send them to SQL sproc which should give it's best to process them record by record - never holding a big lock. It if gets heavy, you can split your block into several smaller blocks. The key thing is that you minimized the number of requests to SQL, always kept one degree of separation and didn't even have to pay the price for a thread pool - you probably won't need to use more that 2 async threads at all.
That's going to be a lot faster that fiddling with Parallel-s.