The flume event was truncated - hdfs

Here I'm facing a issue that I receive message from Kafka source, and write a interceptor to extract two fields(dataSoure and businessType) from the kafka message(json format). Here I'm using gson.fromJson(). But the issue is I got below error.
Here I want to know whether the Flume truncate the Flume event when it exceed a limit? If yes, how to setup it to bigger value. As my kafka message always very long, about 60K bytes.
Looking forward reply. Thanks in advance!
2015-12-09 11:48:05,665 (PollableSourceRunner-KafkaSource-apply)
[ERROR -
org.apache.flume.source.kafka.KafkaSource.process(KafkaSource.java:153)]
KafkaSource EXCEPTION, {} com.google.gson.JsonSyntaxException:
com.google.gson.stream.MalformedJsonException: Unterminated string at
line 1 column 4096
at com.google.gson.Gson.fromJson(Gson.java:809)
at com.google.gson.Gson.fromJson(Gson.java:761)
at com.google.gson.Gson.fromJson(Gson.java:710)
at com.xxx.flume.interceptor.JsonLogTypeInterceptor.intercept(JsonLogTypeInterceptor.java:43)
at com.xxx.flume.interceptor.JsonLogTypeInterceptor.intercept(JsonLogTypeInterceptor.java:61)
at org.apache.flume.interceptor.InterceptorChain.intercept(InterceptorChain.java:62)
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:146)
at org.apache.flume.source.kafka.KafkaSource.process(KafkaSource.java:130)

Finally, I find the root cause by debug the source code.
It is becaues I tried to convert event.getBody() to a map using Gson, which is incorrect, as the event.getBody() is a byte[], not a String, which can't be converted. The correct code should be as below:
String body = new String(event.getBody(), "UTF-8");
Map<String, Object> map = gson.fromJson(body, new TypeToken<Map<String, Object>>() {}.getType());

Related

Send metadata along with Akka stream

Here is my previous question: Send data from InputStream over Akka/Spring stream
I have managed to send compressed and encrypted file over Akka stream. Now, I am looking for way to transport metadata along with data, mainly filename and hash (checksum).
My current idea is to use Flow.prepend function and insert metadata before data this way:
filename, that can vary in size but always ends with null byte
fixed size hash (checksum)
data
Then, on receiving end I would have to use Flow.takeWhile twice - once to read filename and second time to read hash, and then just read data. It doesn't really look like elegant solution plus if in future I would like to add more metadata it will become even worse.
I have noticed method Flow.named, however documentation says just:
Add a ``name`` attribute to this Flow.
and I do not know how to use this (and if is it possible to transport filename over it).
Question is: is there better idea to transport metadata along with data over Akka stream than above?
EDIT: Attaching my drawing with idea.
I think prepending the metadata makes sense. A simple approach could be to prepend the metadata using the same framing you use to send the data.
The receiving end will need to know how many metadata blocks are there, and use this information to split it. See example below.
// client end
filenameSrc
.concat(hashSrc)
.concat(dataSrc)
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue, allowTruncation = true))
.via(Tcp().outgoingConnection(???, ???))
.runForeach{ ??? }
// server end
val printMetadata =
Flow.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val metadataSink = Sink.foreach(println)
val bcast = builder.add(Broadcast[ByteString](2))
bcast.out(0).take(2) ~> metadataSink
FlowShape(bcast.in, bcast.out(1).drop(2).outlet)
})
val handler =
Framing.delimiter(ByteString("\n"), Int.MaxValue)
.via(printMetadata)
.via(???)
This is only one of the many possible approaches to solve this. But whatever solution you choose, the receiver will need to have knowledge of how to extract the metadata from the raw stream of bytes it reads over TCP.

C++ - Decrypting without encryption size

I've looked for a while and I have not found the solution to this problem. I am using BCryptDecrypt to decrypt my encrypted data but it requires the size of the EncryptedData, How are you able to decrypt without knowing the size?
I know BCryptEncrypt gives you the length after it has successfully encrypted the data, the only way I know how I would be able to is send it with the encrypted data / IV.
For example: Let's say I were to encrypt data and then send it over a socket with the IV to my WinSock server that would decrypt the data. How would that server be able to decrypt it without knowing the size? even though it knows the Key and IV.
Thanks
If size is required, I see two ways to get it:
Send it explicitly together with the encrypted data.
Buffer all data on server side until it is received completely. Keep track of how many bytes you received.
With first, you could try something like this:
<number of bytes to follow><separator symbol><message data>
Second requires that you are able to detect the end of the message properly. You could detect this via a specific message end sequences. Then, however, you need to escape such a sequence within the message, if it appears. Something similar to how characters are escaped in C/C++/Java/C#... If not chosing the first approach, which appears the simplest to me, this is what I would probably prefer against the variant below...
An alternative might be closing the connection after a message is complete. Then, however, you need to detect if the connection was closed regularly or if it got broken, because in the latter case, you must not try to decode...
You might even combine both approaches:
<message start sequence>
<number of bytes to follow>
<separator symbol>
<encrypted data>
<message end sequence>
Both message start sequence and message end sequence would have to be escaped. If you detect message start sequence then within the encrypted data, or message end sequence before number of bytes have been read, you know on server side, that something has gone badly wrong...

Sending an image in base64 via Telnet

I am currently working on a project for school and have ran into an issue with a large amount of data being sent via Telnet. If I send a message less than 10KB it is fine. However if I send a message that is above 10KB, I receive the following error "501 Syntax error - line too long" after a few minutes of it running.
Does anyone know of a better way to implement what I am trying to accomplish, that will preferably work with the send()? The data being sent is 5 pages (in Word) of an image in base64.
Thank you, any help is greatly appreciated.
Here is the code portions that I am currently using, which work, with small amounts of data.
char *MailContents = new char[20000000];
std::ifstream in("C:\\test.txt");
std::string MailData((std::istreambuf_iterator<char(in)),std::istreambuf_iterator<char>());
//The following streams in the data into MailData() from a .txt file.
memcpy(MailContents, MailData.c_str(), MailData.length()); //This takes the data and copies it to MailContents
strcat(MailContents, "\r\n");
send(Connection, MailContents, strlen(MailContents), 0); //The following line will take the data in MailContents and echo it to the Telnet data section to be sent.
send(Connection, ".\r\n", strlen(".\r\n"), 0); //This line terminates the data entry and sends it.

com.ctc.wstx.exc.WstxParsingException: Text size limit

I am sending a big attachment to a CXF webservice and I get the following exception:
Caused by: javax.xml.bind.UnmarshalException
- with linked exception:
[com.ctc.wstx.exc.WstxParsingException: Text size limit (134217728) exceeded
at [row,col {unknown-source}]: [1,134855131]]
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.handleStreamException(UnmarshallerImpl.java:426)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:362)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:339)
at org.apache.cxf.jaxb.JAXBEncoderDecoder.doUnmarshal(JAXBEncoderDecoder.java:769)
at org.apache.cxf.jaxb.JAXBEncoderDecoder.access$100(JAXBEncoderDecoder.java:94)
at org.apache.cxf.jaxb.JAXBEncoderDecoder$1.run(JAXBEncoderDecoder.java:797)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.cxf.jaxb.JAXBEncoderDecoder.unmarshall(JAXBEncoderDecoder.java:795)
... 25 more
The issue seems to come from the Woodstox library that says
Text size limit (134217728) exceeded
Does someone know if it is possible to increase this limit? if yes, how to do?
If it's coming from woodstox like that, then you aren't sending it as an MTOM attachment. My first suggestion would be to flip it to MTOM so it can be handled outside the XML parsing. Much more efficient as we can keep it as an inputstream or similar and not have it in memory.
If you want to keep it in the XML, you can set the property: "org.apache.cxf.stax.maxTextLength" to some larger value. Keep in mind, stuff coming in from the stax parser like this are held in memory as either a String or byte[] and will thus consume memory.

Changing this protocol to work with TCP streaming

I made a simple protocol for my game:
b = bool
i = int
sINT: = string whose length is INT followed by a : then the string
m = int message id.
Example:
m133s11:Hello Worldi-57989b0b1b0
This would be:
Message ID 133
String 'Hello World' length 11
int -57989
bool false
bool true
bool false
I did not know however that TCP could potentially only send PART of a message. I'm not sure exactly how I could modify this such that I can do the following:
on receive data from client:
use client's chunk parser
process data
if has partial message then try to find matching END
if no partial messages then try to read a whole message
for each complete message in queue, dispatch it
I could do this by adding B at the beginning of a message and E at the end, then parsing through for the first char to be B and last to be E.
The only problem is what if
I receive something silly in the middle that does not follow the protocol. Or, what if I was supposed to just receive something that is not a message and is just a string. So if I was somehow intended to receive the string HelloB, then I would parse this as hello and the beginning of a message, but I would never receive that message because it is not a message.
How could I modify my protocol to solve these potential issues? As much as I anticipate only ever receiving correctly formed messages, it would be a nightmare if one was poorly encoded and set everything out of whack.
Thanks
I decided to add the length at the beginning and keep track of if I'm working on a message or not:
so:
p32m133s11:Hello Worldi-57989b0b1b0
I then have 3 states, reading to find 'p', reading to find the length after 'p' or reading bytes until length bytes have been read.
What do you think?
It seems to work great.
What you are doing is pretty old-school, magnetic tape stuff. Nice.
The issue you might have is that if a part of the message is received, you cannot tell if you are partway through a token.
E.g. if you receive:
m12
Is this Message 12, or is it the first part of message 122?
If you receive:
i-12
Is this an integer -12 or is it the first part of an integer -124354?
So I think you need to change it so that the message numbers are fixed width (e.g. four digits), the string length is fixed (e.g. 6 digits) and the integer width is fixed at 10 digits.
So your example would be:
m_133s____11:Hello Worldi____-57989b0b1b0
That way if you get the first part of a message you can store it and wait for the remainder to be received before you process it.
You might also consider using control characters to separate message parts. There are ascii control codes often used for this purpose, RS, FS, GS and US. So a message could be
[RS]FieldName[US]FieldValue[RS]fieldName[US]FieldValue[GS].
You know when you have a complete message because the [GS] marks the end. You can then divide it up into fields using the [RS] as a separator, and split each into name/value using the [US].
See http://en.wikipedia.org/wiki/C0_and_C1_control_codes for a brief bit of information.