Setting Custom Coders & Handling Parameterized types - google-cloud-platform

I have two questions related to coder issues I am facing with my Dataflow pipeline.
How do I go about setting a coder for my custom data types? The class consists of just three items - two doubles and another parameterized property. I tried annotating the type with SerializableCoder but I still end up with the error "com.google.cloud.dataflow.sdk.coders.CannotProvideCoderException: Cannot provide coder based on value with class interface java.util.Set: No CoderFactory has been registered for the class." The Set actually contains the parameterized custom data-type - so I am assuming that the custom datatype is the problem. I could not find enough documentation/examples on the right way to do this. Please point me to the right place if its available.
Even without the custom datatype, whenever I try switching to a parameterized version of Transform functions, it results in coder errors. Specifically, inside a complex transform which is parameterized, a ParDo works with parameterized types but when I apply a Combine.PerKey on the resulting PCollection after the ParDo, it results in the CoderNotFoundException.
Any help regarding these two items would be helpful as I am kind of stuck on this for sometime now.

It looks like you have been bitten by two issues. Thanks for bringing them to our attention! Fortunately, there are easy workarounds for both while we improve things.
The first issue is that the default coder registry does not have an entry for mapping Set.class to SetCoder. We have filed GitHub issue #56 to track its resolution. In the meantime, you can use the following code to perform the needed registration:
pipeline.getCoderRegistry().registerCoder(Set.class, SetCoder.class);
The second issue is that parameterized types currently require advanced treatment in the coder registry, so the #DefaultCoder will not be honored. We have filed Github issue #57 to track this. The best way to ensure that SerializableCoder is used everywhere for CustomType is to register a CoderFactory for your type that will return a SerializableCoder. Supposing your type is something like this:
public class CustomType<T extends Serializable> implements Serializable {
T field;
}
Then the following code registers a CoderFactory that produces appropriate SerializableCoder instances:
pipeline.getCoderRegistry().registerCoder(CustomType.class, new CoderFactory() {
#Override
public Coder<?> create(List<? extends Coder<?>>) {
// No matter what the T is, return SerializableCoder
return SerializableCoder.of(CustomType.class);
}
#Override
public List<Object> getInstanceComponents(Object value) {
// Return the T inside your CustomType<T> to enable coder inference for Create
return Collections.singletonList(((CustomType<Object>) value).field);
}
});
Now, whenever you use CustomType in your pipeline, the coder registry will produce a SerializableCoder.
Note that SerializableCoder is not deterministic (the bytes of encoded objects are not necessarily equal for objects that are equals()) so values encoded using this coder cannot be used as keys in a GroupByKey operation.

Related

Generating Random instances of Avro's SpecificClass Records in avro via AvroTools - is there a better way than this?

I'm trying to see if this is already part of Avro Tools or a better way to automatically create random data in Avro or a SpecificRecord (not GenericRecord) on a java generated class.
Here's an example:
I have a record generated by avro - let's call it SomeAvroGeneratedRecord. I was able to create a static method that does something like this:
SomeAvroGeneratedRecord record = SomeHelper.generateRandomData(SomeAvroGeneratedRecord.class);
So I was able to get this working - but I feel like somehow there's a better way to code this. I'm reaching out to see if someone can improve my answer.
I've been working on creating dummy data for my avro classes. I went ahead and looked at AvroTools and found a holy grail: a wonderful class called org.apache.avro.util.RandomData. It works great!
However, it seems to only output GenericData. What I'm looking for is to take a generated java class and just create all the new instances that I'd like. And I'd like to do it on-the-fly without having to write a line of code for every type I create.
The closest I found to this was a post here on StackOverflow:
convert generic to specific record
I also found this random generator by confluent
However, the above code assumes you have the schema already.
This would allow me to take a random data generic record and convert it to a specific record - but this would still require me to have the schema. So I noticed that the schema is in the generated class as a field SCHEMA$ for all the generated classes. So I used some reflection to get that data:
public static <T extends SpecificRecordBase> T specificAvroRecordGenerator(Class<T> avroClassType) {
try {
Field field = avroClassType.getDeclaredField("SCHEMA$");
return specificAvroRecordGenerator((Schema)field.get(null));
} catch (IllegalAccessException | NoSuchFieldException e) {
throw new RuntimeException(e);
}
}
#SuppressWarnings("unchecked")
public static <T extends SpecificRecordBase> T specificAvroRecordGenerator(Schema schema) {
GenericRecord test =
(GenericRecord)new RandomData(schema, 1)
.iterator().next();
return (T) SpecificData.get().deepCopy(test.getSchema(), test);
}
Now, the code above DOES work. However, it feels sooo wonky. I'm surprised it works to be honest. I was able to test it and they're passing:
#Test
void testGenericAvroGenerator() {
assertInstanceOf(PipelineDocument.class, SampleAvroData.specificAvroRecordGenerator(PipelineDocument.class));
assertInstanceOf(ParsedArticle.class, SampleAvroData.specificAvroRecordGenerator(ParsedArticle.class));
}
#Test
void testGenericAvroGenerator() {
assertInstanceOf(PipelineDocument.class, SampleAvroData.specificAvroRecordGenerator(PipelineDocument.getClassSchema()));
assertInstanceOf(ParsedArticle.class, SampleAvroData.specificAvroRecordGenerator(ParsedArticle.getClassSchema()));
}
So I ask: is there a better way to do this? I'm using JDK 17/19 for now. So I'm willing to use any new classes available. I'm happy with the results, but just want to make sure I'm not coding some monstrous hack.
I tried the above but I'm using a lot of parts of reflection and the Avro generated classes that I'm not particularly used to. I feel like I can tighten it up a bit and make it a little more clear - or if there's another method that already does something similar to this.
Any advice would be appreciated.

Please explain the Repository-, Mapping-, and Business-Layer relations and responsibilities

I've been reading a lot about this stuff and I am currently in the middle of the development of a larger web-application and its corresponding back-end.
However, I've started with a design where I ask a Repository to fetch data from the database and map it into a DTO. Why DTO? Simply because until now basically everything was simple stuff and no more complexity was necessary. If it got a bit more complex then I started to map e.g. 1-to-n relations directly in the service layer. Something like:
// This is Service-Layer
public List<CarDTO> getCarsFromOwner(Long carOwnerId) {
// Entering Repository-Layer
List<CarDTO> cars = this.carRepository = this.carRepository.getCars(carOwnerId);
Map<Long, List<WheelDTO>> wheelMap = this.wheelRepository.getWheels(carId);
for(CarDTO car : cars) {
List<WheelDTO> wheels = wheelMap.get(car.getId());
car.setWheels(wheels);
}
return cars;
}
This works of course but it turns out that sometimes things are getting more complex than this and I'm starting to realize that the code might look quite ugly if I don't do anything about this.
Of course, I could load wheelMap in the CarRepository, do the wheel-mapping there and only return complete objects, but since SQL queries can sometimes look quite complex I don't want to fetch all cars and their wheels plus taking care of the mapping in getCars(Long ownerId).
I'm clearly missing a Business-Layer, right? But I'm simply not able to get my head around its best practice.
Let's assume I have Car and a Owner business-objects. Would my code look something like this:
// This is Service-Layer
public List<CarDTO> getCarsFromOwner(Long carOwnerId) {
// The new Business-Layer
CarOwner carOwner = new CarOwner(carOwnerId);
List<Car> cars = carOwner.getAllCars();
return cars;
}
which looks as simple as it can be, but what would happen on the inside? The question is aiming especially at CarOwner#getAllCars().
I imagine that this function would use Mappers and Repositories in order to load the data and that especially the relational mapping part is taken care of:
List<CarDTO> cars = this.carRepository = this.carRepository.getCars(carOwnerId);
Map<Long, List<WheelDTO>> wheelMap = this.wheelRepository.getWheels(carId);
for(CarDTO car : cars) {
List<WheelDTO> wheels = wheelMap.get(car.getId());
car.setWheels(wheels);
}
But how? Is the CarMapper providing functions getAllCarsWithWheels() and getAllCarsWithoutWheels()? This would also move the CarRepository and the WheelRepository into CarMapper but is this the right place for a repository?
I'd be happy if somebody could show me a good practical example for the code above.
Additional Information
I'm not using an ORM - instead I'm going with jOOQ. It's essentially just a type-safe way to write SQL (and it makes quite fun using it btw).
Here is an example how that looks like:
public List<CompanyDTO> getCompanies(Long adminId) {
LOGGER.debug("Loading companies for user ..");
Table<?> companyEmployee = this.ctx.select(COMPANY_EMPLOYEE.COMPANY_ID)
.from(COMPANY_EMPLOYEE)
.where(COMPANY_EMPLOYEE.ADMIN_ID.eq(adminId))
.asTable("companyEmployee");
List<CompanyDTO> fetchInto = this.ctx.select(COMPANY.ID, COMPANY.NAME)
.from(COMPANY)
.join(companyEmployee)
.on(companyEmployee.field(COMPANY_EMPLOYEE.COMPANY_ID).eq(COMPANY.ID))
.fetchInto(CompanyDTO.class);
return fetchInto;
}
Pattern Repository belongs to the group of patterns for data access objects and usually means an abstraction of storage for objects of the same type. Think of Java collection that you can use to store your objects - which methods does it have? How it operates?
By this defintion, Repository cannot work with DTOs - it's a storage of domain entities. If you have only DTOs, then you need more generic DAO or, probably, CQRS pattern. It is common to have separate interface and implementation of a Repository, as it's done, for example, in Spring Data (it generates implementation automatically, so you have only to specify the interface, probably, inheriting basic CRUD operations from common superinterface CrudRepository).
Example:
class Car {
private long ownerId;
private List<Wheel> wheels;
}
#Repository
interface CarRepository extends CrudRepository<Car,Long> {
List<Car> findByOwnerId(long id);
}
Things get complicated, when your domain model is a tree of objects and you store them in relational database. By definition of this problem you need an ORM. Every piece of code that loads relational content into object model is an ORM, so your repository will have an ORM as an implementation. Typically, JPA ORMs do the wiring of objects behind the scene, simpler solutions like custom mappers based on JOOQ or plain JDBC have to do it manually. There's no silver bullet that will solve all ORM problems efficiently and correctly: if you have chosen to write custom mapping, it's still better to keep the wiring inside the repository, so business layer (services) will operate with true object models.
In your example, CarRepository knows about Cars. Car knows about Wheels, so CarRepository already has transitive dependency on Wheels. In CarRepository#findByOwnerId() method you can either fetch the Wheels for a Car directly in the same query by adding a join, or delegate this task to WheelRepository and then only do the wiring. User of this method will receive fully-initialized object tree. Example:
class CarRepositoryImpl implements CarRepository {
public List<Car> findByOwnerId(long id) {
// pseudocode for some database query interface
String sql = select(CARS).join(WHEELS);
final Map<Long, Car> carIndex = new HashMap<>();
execute(sql, record -> {
long carId = record.get(CAR_ID);
Car car = carIndex.putIfAbsent(carId, Car::new);
... // map the car if necessary
Wheel wheel = ...; // map the wheel
car.addWheel(wheel);
});
return carIndex.values().stream().collect(toList());
}
}
What's the role of business layer (sometimes also called service layer)? Business layer performs business-specific operations on objects and, if these operations are required to be atomic, manages transactions. Basically, it knows, when to signal transaction start, transaction commit and rollback, but has no underlying knowledge about what these messages will actually trigger in transaction manager implementation. From business layer perspective, there are only operations on objects, boundaries and isolation of transactions and nothing else. It does not have to be aware of mappers or whatever sits behind Repository interface.
In my opinion, there is no correct answer.
This really depends upon the chosen design decision, which itself depends upon your stack and your/team's comfort.
Case 1:
I disagree with below highlighted sections in your statement:
"Of course, I could load wheelMap in the CarRepository, do the wheel-mapping there and only return complete objects, but since SQL queries can sometimes look quite complex I don't want to fetch all cars and their wheels plus taking care of the mapping in getCars(Long ownerId)"
sql join for the above will be simple. Additionally, it might be faster as databases are optimized for joins and fetching data.
Now, i called this approach as Case1 because this could be followed if you decide to pull data via your repository using sql joins. Instead of just using sql for simple CRUD and then manipulating objects in java. (below)
Case2: Repository is just used to fetch data for "each" domain object which corresponds to "one" table.
In this case what you are doing is already correct.
If you will never use WheelDTO separately, then there is no need to make a separate service for it. You can prepare everything in the Car service.
But, if you need WheelDTO separately, then make different services for each. In this case, there can be a helper layer on top of service layer to perform object creation. I do not suggest implementing an orm from scratch by making repositories and loading all joins for every repo (use hibernate or mybatis directly instead)
Again IMHO, whichever approach you take out of the above, the service or business layers just complement it. So, there is no hard and fast rule, try to be flexible according to your requirements. If you decide to use ORM, some of the above will again change.

Unable to return POCO object with navigation property from a web service?

I'm using POCO to auto generate my entities from DAL project to Entities project. I currently have no need in creating view classes manually.
However I have one problem - When I try to return a poco object that has navigation properties from a [WebMethod] I get the following error:
Cannot serialize member Entities.City.Customers of type System.Collections.Generic.ICollection1[[Entities.Customer, Entities, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]] because it is an interface.
I tried writing context.ContextOptions.LazyLoadingEnabled = false; and
context.ContextOptions.ProxyCreationEnabled = false; to no avail.
if I add [System.Xml.Serialization.XmlIgnore] before the properties, I get no error, but then I lose those properties?
The message is clear: serialization fails because your Entities.City.Customers member is declared as an interface (ICollection).
The interface does not say anything about the implementing type, it only defines the contract that the implementation should follow. As such, the serializer does not know how to represent the implementation in a serialized format.
You might think that it's not that hard to reflect the type and serialize based on the information you get from introspection, but the problem will be when you try to deserialize from this representation. The same representation could possibly correspond to all implementation types, in which case what should the serializer choose as the concrete type?
There are a few steps to work around this limitation, as you can find in this post: XML serialization of interface property. In your particular case, the simplest way would be to make the Entities.City.Customers member of a concrete type like List<Customer> instead of ICollection<Customer>.

unit testing a factory method

suppose I have several OrderProcessors, each of them handles an order a little differently.
The decision about which OrderProcessor to use is done according to the properties of the Order object, and is done by a factory method, like so:
public IOrderProcessor CreateOrderProcessor(IOrdersRepository repository, Order order, DiscountPercentages discountPercentages)
{
if (order.Amount > 5 && order.Unit.Price < 8)
{
return new DiscountOrderProcessor(repository, order, discountPercentages.FullDiscountPercentage);
}
if (order.Amount < 5)
{
// Offer a more modest discount
return new DiscountOrderProcessor(repository, order, discountPercentages.ModestDiscountPercentage);
}
return new OutrageousPriceOrderProcessor(repository, order);
}
Now, my problem is that I want to verify that the returned OrderProcessor has received the correct parameters (for example- the correct discount percentage).
However, those properties are not public on the OrderProcessor entities.
How would you suggest I handle this scenario?
The only solution I was able to come up with is making the discount percentage property of the OrderProcessors public, but it seems like an overkill to do that just for the purpose of unit testing...
One way around this is to change the fields you want to test to internal instead of private and then set the project's internals visible to the testing project. You can read about this here: http://msdn.microsoft.com/en-us/library/system.runtime.compilerservices.internalsvisibletoattribute.aspx
You would do something like this in your AssemblyInfo.cs file:
[assembly:InternalsVisibleTo("Orders.Tests")]
Although you could argue that your unit tests should not necessarily care about the private fields of you class. Maybe it's better to pass in the values to the factory method and write unit tests for the expected result when some method (assuming Calculate() or something similar) is called on the interface.
Or another approach would be to unit test the concrete types (DiscountOrderProcessor, etc.) and confirm their return values from the public methods/properties. Then write unit tests for the factory method that it correctly returns the correct type of interface implementation.
These are the approaches I usually take when writing similar code, however there are many different ways to tackle a problem like this. I would recommend figuring out where you would get the most value in unit tests and write according to that.
If discount percentage is not public, then it's not part of the IOrderProcessor contract and therefore doesn't need to be verified. Just have a set of unit tests for the DiscountOrderProcessor to verify it's properly computing your discounts based on the discount percent passed in via the constructor.
You have a couple of choices as I see it. you could create specializations of DiscountOrderProcessor :
public class FullDiscountOrderProcessor : DiscountOrderProcessor
{
public FullDiscountOrderProcessor(IOrdersRepository repository, Order order):base(repository,order,discountPercentages.FullDiscountPercentage)
{}
}
public class ModestDiscountOrderProcessor : DiscountOrderProcessor
{
public ModestDiscountOrderProcessor (IOrdersRepository repository, Order order):base(repository,order,discountPercentages.ModestDiscountPercentage)
{}
}
and check for the correct type returned.
you could pass in a factory for creating the DiscountOrderProcessor which just takes an amount, then you could check this was called with the correct params.
You could provide a virtual method to create the DiscountOrderProcessor and check that is called with the correct params.
I quite like the first option personally, but all of these approaches suffer from the same problem that in the end you can't check the actual value and so someone could change your discount amounts and you wouldn't know. Even wioth the first approach you'd end up not being able to test what the value applied to FullDiscountOrderProcessor was.
You need to have someway to check the actual values which leaves you with:
you could make the properties public (or internal - using InternalsVisibleTo) so you can interrogate them.
you could take the returned object and check that it correctly applies the discount to some object which you pass in to it.
Personally I'd go for making the properties internal, but it depends on how the objects interact and if passing a mock object in to the discount order processor and verifying that it is acted on correctly is simple then this might be a better solution.

Repository Pattern - POCOs or IQueryable?

I'm new to the Repository Pattern and after doing a lot of reading on the web I have a rough understanding of what is going on, but there seems to be a conflict of ideas.
One is what the IRepository should return.
I would like to deal in ONLY Pocos so I would have an IRepository implementation for every aggregate root, like so:
public class OrangeRepository: IOrangeRepository
{
public Orange GetOrange(IOrangeCriteria criteria);
}
where IOrangeCriteria takes a number of arguments specific to finding an Orange.
The other thing I have is a number of data back-ends - this is why I got into this pattern in the first place. I imagine I will have an implementation for each, e.g
OrangeRepositoryOracle, OrangeRepositorySQL, OrangeRepositoryMock etc
I would like to keep it open so that I could use EF or NHibernate - again if my IOrangeRepository deals in POCOs then I would encapsulate this within the Repository itself, by implementing a OrangeRepositoryNHibernate etc.
Am I on the right lines?
Thanks
EDIT: Thanks for the feedback, I don't have anyone else to bounce these ideas off at the moment so it is appreciated!
Yes, your version is the safest / most compatible one. You can still use it with about any resources, not only data access ones, but with web services, files, whatever.
Note that with the IQueryable version you still get to work based on your POCOs classes, but you are tied to the IQueryable. Also consider that you could be having code that uses the IQueryable and then turns out it you hit a case where one of the repository's ORM doesn't handle it well.
I use the same pattern as you do. I like it a lot. You can get your data from any resources.
But the advantage of using IQuerable is that you do not have to code your own criteria API like the OrangeCriteria.
When NHibernate gets full Linq support then I may switch to the IQueryable.
Then you get
public class OrangeRepository: IOrangeRepository {
public IQueryable<Orange> GetOranges();
}