Cascading Text file to Parquet - hdfs

I am trying to convert a file into Parquet using Cascading. But I am getting the below error.
Error
Exception in thread "main" cascading.flow.planner.PlannerException: tap named: 'Copy', cannot be used as a sink: Hfs["ParquetTupleScheme[['A', 'B']->[ALL]]"]["/user/cloudera/htcountp"]
at cascading.flow.planner.FlowPlanner.verifyTaps(FlowPlanner.java:240)
at cascading.flow.planner.FlowPlanner.verifyAllTaps(FlowPlanner.java:174)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:242)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
at first.Copy.main(Copy.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Code
Scheme sourceScheme = new TextDelimited(new Fields("A","B"), ", ");
Scheme sinkScheme = new ParquetTupleScheme(new Fields("A", "B"));
// create the source tap
Tap inTap = new Hfs(sourceScheme, inPath );
// create the sink tap
Tap outTap = new Hfs( sinkScheme, outPath );
// specify a pipe to connect the taps
Pipe copyPipe = new Pipe("Copy");
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.addSource( copyPipe, inTap )
.addTailSink( copyPipe, outTap );
// run the flow
flowConnector.connect( flowDef ).complete();

Ran into the same problem. Looking at the source code, you must pass a Parquet schema into ParquetTupleScheme's constructor so that it can serialize the data to HDFS. The class has a method isSink(), which checks to ensure it's there. Otherwise, it's not a sink and the code throws the error you identified.

A bit late. But i also ran in the same issue some time back. There is another variation for declaration of ParquetTupleScheme as below:
new ParquetTupleScheme(Fields SourceFields, Fields SinkFields, String ParquetSchema)
Change the declaration of the sink Scheme to as given below:
Scheme sinkScheme = new ParquetTupleScheme(new Fields("A", "B"), new Fields("A", "B"), "message FileName { required Binary A , required Binary B }" );

Related

How to use parquet files as a source for can in Apache Calcite?

I want to read from a parquet file as from a table using Apache Calcite. There are bunch of adapters listed in the docs, but no explicit one for parquet. From the other hand there is adapter for Spark, which can deal with Parquet perfectly. But for some reason I can't really find any example of how to use this spark adapter. Even reading it's code I can't say I understand how I need to define Spark's schemas. There is no factory for it... I've tried following code without really understanding how it should work and it obviously doesn't work:
Class.forName("org.apache.calcite.jdbc.Driver");
Properties info = new Properties();
info.setProperty("lex", "JAVA");
info.setProperty("spark", "true");
Connection connection = DriverManager.getConnection("jdbc:calcite:", info);
CalciteConnection calciteConnection = connection.unwrap(CalciteConnection.class);
SchemaPlus rootSchema = calciteConnection.getRootSchema();
SparkSession spark = SparkSession.builder()
.appName("test")
.master("local[1]")
.getOrCreate();
StopWatch w = StopWatch.createStarted();
Dataset<Row> ds = spark.read().parquet("/tmp/test.parquet");
ds.select("issue_desc", "valid_from_dttm").show(15);
ds.printSchema();
ds.createTempView("sparkTable");
System.out.println(w.getTime(TimeUnit.MILLISECONDS));
FrameworkConfig calciteConfig = Frameworks.newConfigBuilder()
.parserConfig(SqlParser.Config.DEFAULT)
.defaultSchema(rootSchema)
.programs()
.traitDefs(ConventionTraitDef.INSTANCE, RelDistributionTraitDef.INSTANCE)
.build();
RelBuilder builder = RelBuilder.create(calciteConfig);
RelRunner relRunner = calciteConnection.unwrap(RelRunner.class);
RelNode test1 = builder
.scan("sparkTable")
.build();
executeNode(relRunner, test1);
It simply fails with the exception:
Exception in thread "main" org.apache.calcite.runtime.CalciteException: Table 'sparkTable' not found
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at org.apache.calcite.runtime.Resources$ExInstWithCause.ex(Resources.java:506)
at org.apache.calcite.runtime.Resources$ExInst.ex(Resources.java:600)
at org.apache.calcite.tools.RelBuilder.scan(RelBuilder.java:1238)
at org.apache.calcite.tools.RelBuilder.scan(RelBuilder.java:1265)
at ru.tinkoff.dwh.hercule.demo.SparkTest.main(SparkTest.java:64)
Could please somebody explain how to use it, or share an example, or explain why I can't use spark adapter like this?

Testing camel-sql route with in-memory database not fetching results

I have written the code using camel-sql which is working fine. Now I have to write test cases for the same. I have used in-memory database H2. I have initialized the database and assigned the datasource to sqlComponent.
// Setup code
#Override
protected JndiRegistry createRegistry() throws Exception {
JndiRegistry jndi = super.createRegistry();
// this is the database we create with some initial data for our unit test
database = new EmbeddedDatabaseBuilder()
.setType(EmbeddedDatabaseType.H2).addScript("createTableAndInsert.sql").build();
jndi.bind("myDataSource", database);
return jndi;
}
// Testcase code
#SuppressWarnings("unchecked")
#Test
public void testRoute() throws Exception {
Exchange receivedExchange = template.send("direct:myRoute", ExchangePattern.InOut ,exchange -> {
exchange.getIn().setHeader("ID", new Integer(1));
});
camelContext.start();
MyClass updatedEntity = (MyClass)jdbcTemplate.queryForObject("select * from MY_TABLE where id=?", new Long[] { 1l } ,
new RouteTest.CustomerRowMapper() );
// Here I can get the updatedEntity from jdbcTemplate
assertNotNull(receivedExchange);
assertNotNull(updatedEntity);
}
// Main code
from("direct:myRoute")
.routeId("pollDbRoute")
.transacted()
.to("sql:select * from MY_TABLE msg where msg.id = :#"+ID+"?dataSource=#myDataSource&outputType=SelectOne&outputClass=com.entity.MyClass")
.log(LoggingLevel.INFO,"Polled message from DB");
The problem is, as soon as the test case starts, it is saying
No bean could be found in the registry for: myDataSource of type: javax.sql.DataSource
I looked into camel-SQL component test cases and doing the same thing but the code is not able to find dataSource. Please help. Thanks in advance.
After spending a lot of time on this issue, I identified that H2 database was using JDBCUtils to fetch records and It was throwing ClassNotFoundException. I was getting it nowhere in Camel exception hierarchy because this exception was being suppressed and all I was getting a generic exception message. Here is the exception:
ClassNotFoundException: com.vividsolutions.jts.geom.Geometry
After searching for the issue I found out that It requires one more dependency. So I added it and it resolved the issue.
Issue URL: https://github.com/elastic/elasticsearch/issues/9891
Dependency: https://mvnrepository.com/artifact/com.vividsolutions/jts-core/1.14.0

Unparsable MOF Query When Trying to Register Event

Update 2
I accepted an answer and asked a different question elsewhere, where I am still trying to get to the bottom of this.
I don't think that one-lining this query is the answer, as I am still not getting the required results (and multi-lining queries is allowed in .mof, as shown in the URLs in comments to the answer ...
Update
I rewrote the query as a one-liner as suggested, but still got the same error! As it was still talking about lines 11-19 I knew there must be another issue. After saving a new file with the change, I reran mofcomp and it appears to have loaded, but the event which I have subscribed to simply does not work.
I really feel that there is not enough documentation on this topic and it is hard to work out how I am meant to debug this - any help on this would be much appreciated, even if this means using a different more appropriate method.
I have the following .mof file, which I would like to use to register an event on my system :
#pragma namespace("\\\\.\\root\\subscription")
instance of __EventFilter as $EventFilter
{
Name = "Event Filter Instance Name";
Query = "Select * from __InstanceCreationEvent within 1 "
"where targetInstance isa \"Cim_DirectoryContainsFile\" "
"and targetInstance.GroupComponent = \"Win32_Directory.Name=\"c:\\\\test\"\"";
QueryLanguage = "WQL";
EventNamespace = "Root\\Cimv2";
};
instance of ActiveScriptEventConsumer as $Consumer
{
Name = "TestConsumer";
ScriptingEngine = "VBScript";
ScriptText =
"Set objFSO = CreateObject(\"Scripting.FileSystemObject\")\n"
"Set objFile = objFSO.OpenTextFile(\"c:\\test\\Log.txt\", 8, True)\n"
"objFile.WriteLine Time & \" \" & \" File Created\"\n"
"objFile.Close\n";
// Specify any other relevant properties.
};
instance of __FilterToConsumerBinding
{
Filter = $EventFilter;
Consumer = $Consumer;
};
But whenever I run the command mfcomp myfile.mof I am getting this error:
Parsing MOF file: myfile.mof
MOF file has been successfully parsed
Storing data in the repository...
An error occurred while processing item 1 defined on lines 11 - 19 in file myfile.mof:
Error Number: 0x80041058, Facility: WMI
Description: Unparsable query.
Compiler returned error 0x80041058
This error appears to be caused by incorrect syntax in the query, but I don't understand where I have gone wrong with this - is anyone able to advise?
There are no string concatenation or line continuation characters being used in building "Query". To keep it simple, you could put the entire query on one line.

ArrayOfAnyType issues when calling the method:GetRangeA1 excel web services in the silverlight 4.0

I create a simple silverlight 4.0 application used to read the excel file data in the share point 2010 server. I try to use the "Excel Web Services" but I get an error here when calling the GetRangeA1 method:
An unhandled exception of type 'System.ServiceModel.Dispatcher.NetDispatcherFaultException' occurred in mscorlib.dll
Additional information: The formatter threw an exception while trying to deserialize the message: There was an error while trying to deserialize parameter http://schemas.microsoft.com/office/excel/server/webservices:GetRangeA1Response. The InnerException message was 'Error in line 1 position 361. Element 'http://schemas.microsoft.com/office/excel/server/webservices:anyType' contains data from a type that maps to the name 'http://schemas.microsoft.com/office/excel/server/webservices:ArrayOfAnyType'. The deserializer has no knowledge of any type that maps to this name. Consider using a DataContractResolver or add the type corresponding to 'ArrayOfAnyType' to the list of known types - for example, by using the KnownTypeAttribute attribute or by adding it to the list of known types passed to DataContractSerializer.'. Please see InnerException for more details.
the source code is like:
namespace SampleApplication
{
class Program
{
static void Main(string[] args)
{
ExcelServiceSoapClient xlservice = new ExcelServiceSoapClient();
xlservice.ClientCredentials.Windows.AllowedImpersonationLevel = System.Security.Principal.TokenImpersonationLevel.Impersonation;
Status[] outStatus;
string targetWorkbookPath = "http://phc/Shared%20Documents/sample.xlsx";
try
{
// Call open workbook, and point to the trusted location of the workbook to open.
string sessionId = xlservice.OpenWorkbook(targetWorkbookPath, "en-US", "en-US", out outStatus);
Console.WriteLine("sessionID : {0}", sessionId);
//1. works fines.
object res = xlservice.GetCellA1(sessionId, "CER by Feature", "B1", true, out outStatus);
//2. exception
xlservice.GetRangeA1(sessionId, "CER by Feature", "H19:H21", true, out outStatus);
// Close workbook. This also closes session.
xlservice.CloseWorkbook(sessionId);
}
catch (SoapException e)
{
Console.WriteLine("SOAP Exception Message: {0}", e.Message);
}
}
}
}
I am totally new to the silverlight and sharepoint developping, I search around but didn't get any luck, just found another post here, any one could help me?
This appears to be an oustanding issue, but two workarounds I found so far:
1) Requiring a change in App.config.
http://social.technet.microsoft.com/Forums/en-US/sharepoint2010programming/thread/ab2a08d5-2e91-4dc1-bd80-6fc29b5f14eb
2) Indicating to rebuild service reference with svcutil instead of using Add Service Reference:
http://social.msdn.microsoft.com/Forums/en-GB/sharepointexcel/thread/2fd36e6b-5fa7-47a4-9d79-b11493d18107

Why does WebSharingAppDemo-CEProviderEndToEnd sample still need a client db connection after scope creation to perform sync

I'm researching a way to build an n-tierd sync solution. From the WebSharingAppDemo-CEProviderEndToEnd sample it seems almost feasable however for some reason, the app will only sync if the client has a live SQL db connection. Can some one explain what I'm missing and how to sync without exposing SQL to the internet?
The problem I'm experiencing is that when I provide a Relational sync provider that has an open SQL connection from the client, then it works fine but when I provide a Relational sync provider that has a closed but configured connection string, as in the example, I get an error from the WCF stating that the server did not receive the batch file. So what am I doing wrong?
SqlConnectionStringBuilder builder = new SqlConnectionStringBuilder();
builder.DataSource = hostName;
builder.IntegratedSecurity = true;
builder.InitialCatalog = "mydbname";
builder.ConnectTimeout = 1;
provider.Connection = new SqlConnection(builder.ToString());
// provider.Connection.Open(); **** un-commenting this causes the code to work**
//create anew scope description and add the appropriate tables to this scope
DbSyncScopeDescription scopeDesc = new DbSyncScopeDescription(SyncUtils.ScopeName);
//class to be used to provision the scope defined above
SqlSyncScopeProvisioning serverConfig = new SqlSyncScopeProvisioning();
....
The error I get occurs in this part of the WCF code:
public SyncSessionStatistics ApplyChanges(ConflictResolutionPolicy resolutionPolicy, ChangeBatch sourceChanges, object changeData)
{
Log("ProcessChangeBatch: {0}", this.peerProvider.Connection.ConnectionString);
DbSyncContext dataRetriever = changeData as DbSyncContext;
if (dataRetriever != null && dataRetriever.IsDataBatched)
{
string remotePeerId = dataRetriever.MadeWithKnowledge.ReplicaId.ToString();
//Data is batched. The client should have uploaded this file to us prior to calling ApplyChanges.
//So look for it.
//The Id would be the DbSyncContext.BatchFileName which is just the batch file name without the complete path
string localBatchFileName = null;
if (!this.batchIdToFileMapper.TryGetValue(dataRetriever.BatchFileName, out localBatchFileName))
{
//Service has not received this file. Throw exception
throw new FaultException<WebSyncFaultException>(new WebSyncFaultException("No batch file uploaded for id " + dataRetriever.BatchFileName, null));
}
dataRetriever.BatchFileName = localBatchFileName;
}
Any ideas?
For the Batch file not available issue, remove the IsOneWay=true setting from IRelationalSyncContract.UploadBatchFile. When the Batch file size is big, ApplyChanges will be called even before fully completing the previous UploadBatchfile.
// Replace
[OperationContract(IsOneWay = true)]
// with
[OperationContract]
void UploadBatchFile(string batchFileid, byte[] batchFile, string remotePeer1
I suppose it's simply a stupid example. It exposes "some" technique but assumes you have to arrange it in proper order by yourself.
http://msdn.microsoft.com/en-us/library/cc807255.aspx