AbstractFileOutputWriter Generating duplicate tmp files

AbstractFileOutputWriter Generating duplicate tmp files - hdfs

I have an Apache Apex application consuming Kafka Logs and writing it to HDFS.
The DAG is simple enough that there is a Kafka Consumer (20 partitions of 2 GB memory for operator) connected by a stream to a "MyWriter extends AbstractFileOutputOperator".
Issue:
1. I have been seeing the Writer repeatedly writing .tmp files with same size and same data many times. I have tried increasing the Write Operator memory, increased the number of paritions of Writer etc. Still this issue keeps happening.
I tried adding/removing requestFinalize to MyWriter. Still same issue.
#Override
public void endWindow()
{
if (null != fileName) {
requestFinalize(fileName);
}
super.endWindow();
}
This is a subset of my properties.xml
<property>
<name>dt.attr.STREAMING_WINDOW_SIZE_MILLIS</name>
<value>1000</value>
</property>
<property>
<name>dt.application.myapp.operator.*.attr.APPLICATION_WINDOW_COUNT</name>
<value>60</value>
</property>
<property>
<name>dt.application.myapp.operator.*.attr.CHECKPOINT_WINDOW_COUNT</name>
<value>60</value>
</property>
<property>
<name>dt.application.myapp.operator.myWriter.attr.PARTITIONER</name>
<value>com.datatorrent.common.partitioner.StatelessPartitioner:20</value>
</property>
<property>
<name>dt.application.myapp.operator.myWriter.prop.maxLength</name>
<value>1000000000</value> <!-- 1 GB File -->
</property>
This is the stack-trace I was able to get from dt.log for the operator:
The operator gets redeployed probably in different contianers, throw this exception and keeps writing the duplicate files.
java.lang.RuntimeException: java.io.FileNotFoundException: File does not exist: /kafkaconsumetest/inventoryCount/nested/trial2/1471489200000_1471489786800_161.0.1471489802786.tmp
at com.datatorrent.lib.io.fs.AbstractFileOutputOperator.setup(AbstractFileOutputOperator.java:418)
at com.datatorrent.lib.io.fs.AbstractFileOutputOperator.setup(AbstractFileOutputOperator.java:112)
at com.datatorrent.stram.engine.Node.setup(Node.java:187)
at com.datatorrent.stram.engine.StreamingContainer.setupNode(StreamingContainer.java:1309)
at com.datatorrent.stram.engine.StreamingContainer.access$100(StreamingContainer.java:130)
at com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1388)
Caused by: java.io.FileNotFoundException: File does not exist: /kafkaconsumetest/inventoryCount/nested/trial2/1471489200000_1471489786800_161.0.1471489802786.tmp
at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1219)
at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1211)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1211)
at com.datatorrent.lib.io.fs.AbstractFileOutputOperator.setup(AbstractFileOutputOperator.java:411)
... 5 more
2016-08-17 22:17:01,108 INFO com.datatorrent.stram.engine.StreamingContainer: Undeploy request: [161, 177]
2016-08-17 22:17:01,116 INFO com.datatorrent.stram.engine.StreamingContainer: Undeploy complete.
2016-08-17 22:17:02,121 INFO com.datatorrent.stram.engine.StreamingContainer: Waiting for pending request.
2016-08-17 22:17:02,625 INFO com.datatorrent.stram.engine.StreamingContainer: Waiting for pending request.
2016-08-17 22:17:03,129 INFO com.datatorrent.stram.engine.StreamingContainer: Waiting for pending request.

The code for the base operator is at the following link and is referenced in the
comments below:
https://github.com/apache/apex-malhar/blob/master/library/src/main/java/com/datatorrent/lib/io/fs/AbstractFileOutputOperator.java
By setting the max file size to 1GB, you automatically enable rolling files; the relevant fields are:
protected Long maxLength = Long.MAX_VALUE;
protected transient boolean rollingFile = false;
The latter is set to true in the setup() method if the former has a value less than the default value of Long.MAX_VALUE.
When rolling files are enabled, file finalization is done automatically, so you should not call requestFinalize().
Secondly, in your MyWriter class, remove the endWindow() override and make sure you create a desired file name that includes the operator id in the setup() method and return this file name in the getFileName() override; this ensures that multiple partitioners don't step on one another. For example:
#NotNull
private String fileName; // current base file name
private transient String fName; // per partition file name
#Override
public void setup(Context.OperatorContext context)
{
// create file name for this partition by appending the operator id to
// the base name
//
long id = context.getId();
fName = fileName + "_p" + id;
super.setup(context);
LOG.debug("Leaving setup, fName = {}, id = {}", fName, id);
}
#Override
protected String getFileName(Long[] tuple)
{
return fName;
}
The file base name (fileName in the code above) can be set directly in the code or initialized from a property in an XML file (you'll need to add a getter and setter for it as well).
You can see an example of this type of usage at:
https://github.com/DataTorrent/examples/tree/master/tutorials/fileOutput
Couple of additional suggestions:
Set the partition count to 1 (or comment out the XML that sets the PARTITIONER attribute) and make sure everything works as expected. This will eliminate any issues that are not partitioning related. If possible, also reduce the max file size to, say, 2K or 4K so testing is easier.
Once the single partition case works, increase the number of partitions to 2. If this works, arbitrary larger numbers (within reason) should also work.

Related

Dynamic URI based on file name/cfg file/property name

I am trying to pass a to URI value dynamically with a property value. That property value will be configured already in the cfg file.
When the file name is extracted using CamelFileNameOnly header, it has to get passed to the to Uri endpoint. So that the same name is referred in the code.
Please find my code below:
I have dropped a file with name KevinFile.txt in my server location= D:\Servers\jboss-fuse-6.2.0.redhat-133\data\myLocalFTP (file://data/myLocalFTP)
Config File
local.folder.url=file://data/myLocalFTP
KevinFile=file://data/KevinFileDirectory
Camel Route
<route id="awsRoute">
<from uri="{{local.folder.url}}"/>
<bean ref="processorClass" method="process"/>
<log message="myProperty value is ${exchangeProperty.myProperty}"/> <---Gives the fileName
<to uri="{{${exchangeProperty.myProperty}}}"/> <--This is the spot i am getting error :(
</route>
ProcessorClass.java
public class ProcessorClass implements Processor{
#Override
public void process(Exchange exchange) throws Exception {
String fileName = (String) exchange.getIn().getHeader("CamelFileNameOnly");
exchange.setProperty("myPropertyNew", fileName);
}
}

If I understand correctly, you need to specify "dynamic" vlue for the constant part of the producer. Instead of <to uri="{{${exchangeProperty.myProperty}}}"/> you can use recipientList or routingSlip:
<recipientList>
<simple>${exchangeProperty.myProperty}</simple>
</recipientList>
or
<routingSlip>
<simple>${exchangeProperty.myProperty}</simple>
</routingSlip>

Ah what your looking for is simply setting the header as a property. You can do that like this:
from("direct:start")
.setHeader("CamelFileNameOnly").simple("{{myPropertyName}}")
.to("file://data/myLocalDisk");
You can also simplify this by using the uri syntax available on the file component in this case (Thanks to Sergii for the recommendation). Just make sure you check the camel documentation for each component certain components rely on exchange headers, while others can leverage URI properties.
from("direct:start")
.to("file://data/myLocalDisk?fileName={{myPropertyName}}");
Its also worth noting that if you have logic that you want to use before setting the header you can have the setHeader call a bean.
from("direct:start")
.setHeader("CamelFileNameOnly").bean(MyPropertyLogicBean.class, "someMethod({{myPropertyName}})")
.to("file://data/myLocalDisk");
Use the camel properties component to get this property to resolve.
Reference: http://camel.apache.org/properties.html

jpa FlushModeType COMMIT

In FlushModeType.AUTO mode, the persistence context is synchronized with the database at the following times:
before each SELECT operation
at the end of a transaction
after a flush or close operation on the persistence context
In FlushModeType.COMMIT mode, means that it does not have to flush
the persistence context before executing a query because you have indicated that there is no changed data in memory that would affect the results of the database query.
I have made an example in jboss as 6.0:
#Stateless
public class SessionBeanTwoA implements SessionBeanTwoALocal {
#PersistenceContext(unitName = "entity_manager_trans_unit")
protected EntityManager em;
#EJB
private SessionBeanTwoBLocal repo;
#Override
#TransactionAttribute(TransactionAttributeType.REQUIRED)
public void findPersonByEmail(String email) {
1. List<Person> persons = repo.retrievePersonByEmail(email);
2. Person person = persons.get(0);
3. System.out.println(person.getAge());
4. person.setAge(2);
5. persons = repo.retrievePersonByEmail(email);
6. person=persons.get(0);
7. System.out.println(person.getAge());
}
}
#Stateless
public class SessionBeanTwoB extends GenericCrud implements SessionBeanTwoBLocal {
#Override
public List<Person> retrievePersonByEmail(String email) {
Query query = em.createNamedQuery("Person.findAllPersonByEmail");
query.setFlushMode(FlushModeType.COMMIT);
query.setParameter("email", email);
List<Person> persons;
persons = query.getResultList();
return persons;
}
}
FlushModeType.COMMIT does not seem to work. At line 1., the person's age is taken from the database and print 35 at line3. At line 4., the person is updated within the persistent context but in line 7. the person's age is 2.
The jpa 2.0 spec says:
Type.COMMIT is set, the effect of updates made to entities in the persistence context upon queries is
unspecified.
But in many books, they explains what I wrote in the beginning of this post.
So what FlushModeType COMMIT really does?
Tks in advance for your help.

The javadocs mentions this here for FlushModeType COMMIT
Flushing to occur at transaction commit. The provider may flush at
other times, but is not required to.
So if the provider thinks it should then it can flush even though it is configured to flush on commit. Since for AUTO setting, the provider typically flushes at various times ( which requires expensive traversal of all managed entities - especially if the number is huge- to check if any database updates/deletes needs to be scheduled) so if we are sure that there are no database changes happening then we may use COMMIT setting to cut down on frequent checks for any changes and save on some CPU cycles.

How can I get optimistic concurrency with JPA annotation

I am using JPA 3, with annotation (no mapping file) and with provider org.hibernate.ejb.HibernatePersistence
I need to have optimistic concurrency.
1)I tried to rely on the tag called , it did not work.
2)
So I decided to do it with java code. I have a mergeServiceRequest method and an object with type Request as follows: I start a transaction, lock the request object,
then try to get a Request object newRequest from database, compare its timestamp with the current one request. If they do not match, I throw an exception; if they match, then I update the current request enter code herewith current time and save it to database.
I need to lock the object manually, because by starting a transaction from session, it does not put a lock on the row in database. I wrote some java code which shows that a transaction does not lock the record in database automatically.
Problem with this approach is the query
Request newRequest=entityManager.createQuery("select r from Request r where serviceRequestId = " + request.getServiceRequestId());
always return same object as request. "request" is in the session entityManger, and the query always return what is cached in the session. I tried all the five query.setHint lines and I still get same result: no database query is performed, the result is from session cache directly.
#Transactional
public void mergeServiceRequest(Request request) {
System.out.println("ServiceRequestDao.java line 209");
EntityTransaction transaction = entityManager.getTransaction();
transaction.begin();
entityManager.lock(request, LockModeType.WRITE); // use to lock the database row
Query query = entityManager.createQuery("select r from Request r where serviceRequestId = " + request.getServiceRequestId());
//query.setHint("javax.persistence.cache.retrieveMode", "BYPASS");
//query.setHint("org.hibernate.cacheMode", CacheMode.REFRESH);
//query.setHint("javax.persistence.cache.retrieveMode", CacheMode.REFRESH);
//query.setHint("javax.persistence.retrieveMode", CacheMode.REFRESH);
//query.setHint(QueryHints.CACHE_USAGE, CacheUsage.DoNotCheckCache);
Request newRequest=(Request)query.getSingleResult();
if (! newRequest.getLastUpdatedOn().equals(request.getLastUpdatedOn())) {
throw new StaleObjectStateException(request.getClass().toString(), request.getServiceRequestId());
}
request.setLastUpdatedOn(new Date(System.currentTimeMillis()));
entityManager.persist(request);
entityManager.flush();
transaction.commit();
}
3)So I also tried to use another session get query the newRequest, if I do that, the newRequest will be different from request. But for some reason, if I do that, then the lock on request object is never released, even after the transaction commit. Code looks like below
#Transactional
public void mergeServiceRequest(Request request) {
System.out.println("ServiceRequestDao.java line 209");
EntityTransaction transaction = entityManager.getTransaction();
transaction.begin();
entityManager.lock(request, LockModeType.WRITE); // use to lock the database row
Request newRequest=findRequest(request.getServiceRequestId()); // get it from another session
if (! newRequest.getLastUpdatedOn().equals(request.getLastUpdatedOn())) {
throw new StaleObjectStateException(request.getClass().toString(), request.getServiceRequestId());
}
request.setLastUpdatedOn(new Date(System.currentTimeMillis()));
entityManager.persist(request);
entityManager.flush();
transaction.commit();
//lock on the database record is not released after this, and even after entityManager is closed
}
Could anyone help me on this?
Thanks.
Daniel

Why does WebSharingAppDemo-CEProviderEndToEnd sample still need a client db connection after scope creation to perform sync

I'm researching a way to build an n-tierd sync solution. From the WebSharingAppDemo-CEProviderEndToEnd sample it seems almost feasable however for some reason, the app will only sync if the client has a live SQL db connection. Can some one explain what I'm missing and how to sync without exposing SQL to the internet?
The problem I'm experiencing is that when I provide a Relational sync provider that has an open SQL connection from the client, then it works fine but when I provide a Relational sync provider that has a closed but configured connection string, as in the example, I get an error from the WCF stating that the server did not receive the batch file. So what am I doing wrong?
SqlConnectionStringBuilder builder = new SqlConnectionStringBuilder();
builder.DataSource = hostName;
builder.IntegratedSecurity = true;
builder.InitialCatalog = "mydbname";
builder.ConnectTimeout = 1;
provider.Connection = new SqlConnection(builder.ToString());
// provider.Connection.Open(); **** un-commenting this causes the code to work**
//create anew scope description and add the appropriate tables to this scope
DbSyncScopeDescription scopeDesc = new DbSyncScopeDescription(SyncUtils.ScopeName);
//class to be used to provision the scope defined above
SqlSyncScopeProvisioning serverConfig = new SqlSyncScopeProvisioning();
....
The error I get occurs in this part of the WCF code:
public SyncSessionStatistics ApplyChanges(ConflictResolutionPolicy resolutionPolicy, ChangeBatch sourceChanges, object changeData)
{
Log("ProcessChangeBatch: {0}", this.peerProvider.Connection.ConnectionString);
DbSyncContext dataRetriever = changeData as DbSyncContext;
if (dataRetriever != null && dataRetriever.IsDataBatched)
{
string remotePeerId = dataRetriever.MadeWithKnowledge.ReplicaId.ToString();
//Data is batched. The client should have uploaded this file to us prior to calling ApplyChanges.
//So look for it.
//The Id would be the DbSyncContext.BatchFileName which is just the batch file name without the complete path
string localBatchFileName = null;
if (!this.batchIdToFileMapper.TryGetValue(dataRetriever.BatchFileName, out localBatchFileName))
{
//Service has not received this file. Throw exception
throw new FaultException<WebSyncFaultException>(new WebSyncFaultException("No batch file uploaded for id " + dataRetriever.BatchFileName, null));
}
dataRetriever.BatchFileName = localBatchFileName;
}
Any ideas?

For the Batch file not available issue, remove the IsOneWay=true setting from IRelationalSyncContract.UploadBatchFile. When the Batch file size is big, ApplyChanges will be called even before fully completing the previous UploadBatchfile.
// Replace
[OperationContract(IsOneWay = true)]
// with
[OperationContract]
void UploadBatchFile(string batchFileid, byte[] batchFile, string remotePeer1

I suppose it's simply a stupid example. It exposes "some" technique but assumes you have to arrange it in proper order by yourself.
http://msdn.microsoft.com/en-us/library/cc807255.aspx

Why are my DBUnit tests consuming so much memory?

I've got a hibernate-based application which uses DBUnit for unit testing. We have an XML test database, which gets loaded up with dummy data in the setUp() of each test and deleted during the tearDown(). The problem is that I can no longer run the entire suite in an IDE (in this case, Intellij), because after about 300 tests, the heap memory gets all used up. The tests go from taking ~0.3 seconds to 30+ seconds to execute, until the JVM eventually gives up and dies.
When I run the test suite via ant's junit task, then it's no problem, nor is running the test suite for an individual class. However, I like being able to run the whole suite locally before I check in large refactoring changes to the codebase rather than breaking the build on the CI server.
I am running the test suite with -Xmx512m as my only argument to the JVM, which is the same amount I pass to ant when running the task on the CI server. My hibernate-test.cfg.xml looks like this:
<hibernate-configuration>
<session-factory>
<!-- Database connection settings -->
<property name="connection.driver_class">org.hsqldb.jdbcDriver</property>
<property name="connection.url">jdbc:hsqldb:mem:mydatabase</property>
<property name="connection.username">sa</property>
<property name="connection.password"/>
<!-- Other configuration properties -->
<property name="connection.pool_size">1</property>
<property name="jdbc.batch_size">20</property>
<property name="connection.autocommit">true</property>
<property name="dialect">org.hibernate.dialect.HSQLDialect</property>
<property name="current_session_context_class">thread</property>
<property name="cache.provider_class">org.hibernate.cache.HashtableCacheProvider</property>
<property name="bytecode.use_reflection_optimizer">false</property>
<property name="show_sql">true</property>
<property name="hibernate.hbm2ddl.auto">create-drop</property>
<!-- Mappings (omitted for brevity) -->
<mapping resource="hbm/blah.hbm.xml"/>
</session-factory>
</hibernate-configuration>
We have written a class for which all of the test classes extend from, which looks something like this:
package com.mycompany.test;
// imports omitted for brevity
public abstract class DBTestCase extends TestCase {
private final String XML_DATA_SET = "test/resources/mytestdata.xml";
private Session _session;
private Configuration _config;
public DBTestCase(String name) {
super(name);
}
#Override
protected void setUp() throws Exception {
super.setUp();
_config = new Configuration().configure();
SessionFactory sf = _config.buildSessionFactory();
// This is a singleton which is used the DAO's to acquire a session.
// The session must be manually set from the test's setup so that any
// calls to the singleton return this session factory, otherwise NPE
// will result, since the session factory is normally built during
// webapp initialization.
HibernateUtil.setSessionFactory(sf);
_session = sf.openSession();
_session.beginTransaction();
IDataSet dataSet = new FlatXmlDataSet(new File(XML_DATA_SET));
DatabaseOperation.CLEAN_INSERT.execute(getConnection(), dataSet);
}
protected void tearDown() throws Exception {
super.tearDown();
_session.close();
}
protected IDatabaseConnection getConnection() throws Exception {
ConnectionProvider connProvider = ConnectionProviderFactory
.newConnectionProvider(_config.getProperties());
Connection jdbcConnection = connProvider.getConnection();
DatabaseConnection dbConnection = new DatabaseConnection(jdbcConnection);
DatabaseConfig dbConfig = dbConnection.getConfig();
dbConfig.setProperty(DatabaseConfig.PROPERTY_DATATYPE_FACTORY, new HsqldbDataTypeFactory());
return dbConnection;
}
}
It is clear that some memory leak is going on here, but I'm not sure where. How might I go about diagnosing this?

You are using memory database here:
<property name="connection.driver_class">org.hsqldb.jdbcDriver</property>
<property name="connection.url">jdbc:hsqldb:mem:mydatabase</property>
That means everything in the database is in the memory. Either use on disk database with cached table, or make sure you drop everything after each test.

J-16 SDiZ's answer got me working in the right direction, but I thought I would provide a bit more detailed information as to how I was able to solve this. The root of the problem was indeed that the database kept being stored in memory, but the solution was to inherit from DBUnit's DBTestCase class, not try to roll my own by inheriting from the JUnit TestCase. My test case base class now looks something like this:
public class MyTestCase extends DBTestCase {
private static Configuration _config = null;
public MyTestCase(String name) {
super(name);
if(_config == null) {
_config = new Configuration().configure();
SessionFactory sf = _config.buildSessionFactory();
HibernateUtil.setSessionFactory(sf);
}
System.setProperty(PropertiesBasedJdbcDatabaseTester.DBUNIT_DRIVER_CLASS, "org.hsqldb.jdbcDriver");
System.setProperty(PropertiesBasedJdbcDatabaseTester.DBUNIT_CONNECTION_URL, "jdbc:hsqldb:mem:mydbname");
System.setProperty(PropertiesBasedJdbcDatabaseTester.DBUNIT_USERNAME, "sa");
System.setProperty(PropertiesBasedJdbcDatabaseTester.DBUNIT_PASSWORD, "");
}
#Override
protected IDataSet getDataSet() throws Exception {
return new FlatXmlDataSet(new FileReader(MY_XML_DATA_FILE_NAME), false, true, false);
}
#Override
protected void setUpDatabaseConfig(DatabaseConfig config) {
config.setProperty(DatabaseConfig.PROPERTY_DATATYPE_FACTORY, new HsqldbDataTypeFactory());
}
This class works quite well, and my test suite runs have gone down from several minutes to a mere 30 seconds.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AbstractFileOutputWriter Generating duplicate tmp files - hdfs

Related

Dynamic URI based on file name/cfg file/property name

jpa FlushModeType COMMIT

How can I get optimistic concurrency with JPA annotation

Why does WebSharingAppDemo-CEProviderEndToEnd sample still need a client db connection after scope creation to perform sync

Why are my DBUnit tests consuming so much memory?

Categories

Resources