Binary File read in Google Dataflow - google-cloud-platform

I need to read binary file in google dataflow ,
I just need to read file and parse every 64 byte as one record and apply some logic in each byte of every 64 byte of binary file in dataflow.
same thing I tried in spark , code smape as belows:
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("RecordSplit")
.master("local[*]")
.getOrCreate()
val df = spark.sparkContext.binaryRecords("< binary-file-path>", 64)
val Table = df.map(rec => {
val c1= (convertHexToString(rec(0)))
val c2= convertBinaryToInt16(rec, 48)
val c3= rec(59)
val c4= convertHexToString(rec(50)) match {
case str =>
if (str.startsWith("c"))
2020 + str.substring(1).toInt
else if (str.startsWith("b"))
2010 + str.substring(1).toInt
else if (str.startsWith("b"))
2000 + str.substring(1).toInt
case _ => 1920
}

I would recommend the following:
If you are not limited to python/scala, OffsetBasedSource (FileBasedSource is a subclass) can address your needs because it uses offsets to define the starting and ending positions.
TikaIO can process metadata, however it can read binary data as per documentation.
The example dataflow-opinion-analysis contains information to read from arbitrary byte position.
There are additional docs to create a custom Read implementation. You may want to consider looking at these Beam examples for guidance on how to implement a custom source, like this python example.
A different approach would be making arrays of 64 bytes outside pipeline (in-memory) and then creating a PCollection from memory, just keep in mind that documentation recommends it for unit tests.

Related

Use Scala Iterator to break up large stream (from string) into chunks using a RegEx match, and then operate on those chunks?

I'm currently using a not-very-Scala-like approach to parse large Unix mailbox files. I'm still learning the language and would like to challenge myself to find a better way, however, I do not believe I have a solid grasp on just what can be done with an Iterator and how to effectively use it.
I'm currently using org.apache.james.mime4j, and I use the org.apache.james.mime4j.mboxiterator.MboxIterator to get a java.util.Iterator from a file, as so:
// registers an implementation of a ContentHandler that
// allows me to construct an object representing an email
// using callbacks
val handler: ContentHandler = new MyHandler();
// creates a parser that parses a SINGLE email from a given InputStream
val parser: MimeStreamParser = new MimeStreamParser(configBuilder.build());
// register my handler
parser.setContentHandler(handler);
// Get a java.util.Iterator
val iterator = MboxIterator.fromFile(fileName).build();
// For each email, process it using above Handler
iterator.forEach(p => parser.parse(p.asInputStream(Charsets.UTF_8)))
From my understanding, the Scala Iterator is much more robust, and probably a lot more capable of handling something like this, especially because I won't always be able to fit the full file in memory.
I need to construct my own version of the MboxIterator. I dug through the source for MboxIterator and was able to find a good RegEx pattern to use to determine the beginning of individual email messages with, however, I'm drawing a blank from now on.
I created the RegEx like so:
val MESSAGE_START = Pattern.compile(FromLinePatterns.DEFAULT, Pattern.MULTILINE);
What I want to do (based on what I know so far):
Build a FileInputStream from an MBOX file.
Use Iterator.continually(stream.read()) to read through the stream
Use .takeWhile() to continue to read until the end of the stream
Chunk the Stream using something like MESSAGE_START.matcher(someString).find(), or use it to find the indexes the separate the message
Read the chunks created, or read the bits in between the indexes created
I feel like I should be able to use map(), find(), filter() and collect() to accomplish this, but I'm getting thrown off by the fact that they only give me Ints to work with.
How would I accomplish this?
EDIT:
After doing some more thinking on the subject, I thought of another way to describe what I think I need to do:
I need to keep reading from the stream until I get a string that matches my RegEx
Maybe group the previously read bytes?
Send it off to be processed somewhere
Remove it from the scope somehow so it doesn't get grouped the next time I run into a match
Continue to read the stream until I find the next match.
Profit???
EDIT 2:
I think I'm getting closer. Using a method like this gets me an iterator of iterators. However, there are two issues: 1. Is this a waste of memory? Does this mean everything gets read into memory? 2. I still need to figure out a way to split by the match, but still include it in the iterator returned.
def split[T](iter: Iterator[T])(breakOn: T => Boolean):
Iterator[Iterator[T]] =
new Iterator[Iterator[T]] {
def hasNext = iter.hasNext
def next = {
val cur = iter.takeWhile(!breakOn(_))
iter.dropWhile(breakOn)
cur
}
}.withFilter(l => l.nonEmpty)
If I understand correctly, you want to lazily chunk a large file delimited by a regex recognizable pattern.
You could try to return an Iterator for each request but the correct iterator management would not be trivial.
I'd be inclined to hide all file and iterator management from the client.
class MBox(filePath :String) {
private val file = io.Source.fromFile(filePath)
private val itr = file.getLines().buffered
private val header = "From .+ \\d{4}".r //adjust to taste
def next() :Option[String] =
if (itr.hasNext) {
val sb = new StringBuilder()
sb.append(itr.next() + "\n")
while (itr.hasNext && !header.matches(itr.head))
sb.append(itr.next() + "\n")
Some(sb.mkString)
} else {
file.close()
None
}
}
testing:
val mbox = new MBox("so.txt")
mbox.next()
//res0: Option[String] =
//Some(From MAILER-DAEMON Fri Jul 8 12:08:34 2011
//some text AAA
//some text BBB
//)
mbox.next()
//res1: Option[String] =
//Some(From MAILER-DAEMON Mon Jun 8 12:18:34 2012
//small text
//)
mbox.next()
//res2: Option[String] =
//Some(From MAILER-DAEMON Tue Jan 8 11:18:14 2013
//some text CCC
//some text DDD
//)
mbox.next() //res3: Option[String] = None
There is only one Iterator per open file and only the safe methods are invoked on it. The file text is realized (loaded) only on request and the client gets just what's requested, if available. Instead of all lines in one long String you could return each line as part of a collection, Seq[String], if that's more applicable.
UPDATE: This can be modified for easy iteration.
class MBox(filePath :String) extends Iterator[String] {
private val file = io.Source.fromFile(filePath)
private val itr = file.getLines().buffered
private val header = "From .+ \\d{4}".r //adjust to taste
def next() :String = {
val sb = new StringBuilder()
sb.append(itr.next() + "\n")
while (itr.hasNext && !header.matches(itr.head))
sb.append(itr.next() + "\n")
sb.mkString
}
def hasNext: Boolean =
if (itr.hasNext) true else {file.close(); false}
}
Now you can .foreach(), .map(), .flatMap(), etc. But you can also do dangerous things like .toList which will load the entire file.

Google Dataflow template job not scaling when writing records to Google datastore

I have a small dataflow job triggered from a cloud function using a dataflow template. The job basically reads from a table in Bigquery, converts the resultant Tablerow to a Key-Value, and writes the Key-Value to Datastore.
This is what my code looks like :-
PCollection<TableRow> bigqueryResult = p.apply("BigQueryRead",
BigQueryIO.readTableRows().withTemplateCompatibility()
.fromQuery(options.getQuery()).usingStandardSql()
.withoutValidation());
bigqueryResult.apply("WriteFromBigqueryToDatastore", ParDo.of(new DoFn<TableRow, String>() {
#ProcessElement
public void processElement(ProcessContext pc) {
TableRow row = pc.element();
Datastore datastore = DatastoreOptions.getDefaultInstance().getService();
KeyFactory keyFactoryCounts = datastore.newKeyFactory().setNamespace("MyNamespace")
.setKind("MyKind");
Key key = keyFactoryCounts.newKey("Key");
Builder builder = Entity.newBuilder(key);
builder.set("Key", BooleanValue.newBuilder("Value").setExcludeFromIndexes(true).build());
Entity entity= builder.build();
datastore.put(entity);
}
}));
This pipeline runs fine when the number of records I try to process is anywhere in the range of 1 to 100. However, when I try putting more load on the pipeline, ie, ~10000 records, the pipeline does not scale (eventhough autoscaling is set to THROUGHPUT based and maximumWorkers is specified to as high as 50 with an n1-standard-1 machine type). The job keeps processing 3 or 4 elements per second with one or two workers. This is impacting the performance of my system.
Any advice on how to scale up the performance is very welcome.
Thanks in advance.
Found a solution by using DatastoreIO instead of the datastore client.
Following is the snippet I used,
PCollection<TableRow> row = p.apply("BigQueryRead",
BigQueryIO.readTableRows().withTemplateCompatibility()
.fromQuery(options.getQueryForSegmentedUsers()).usingStandardSql()
.withoutValidation());
PCollection<com.google.datastore.v1.Entity> userEntity = row.apply("ConvertTablerowToEntity", ParDo.of(new DoFn<TableRow, com.google.datastore.v1.Entity>() {
#SuppressWarnings("deprecation")
#ProcessElement
public void processElement(ProcessContext pc) {
final String namespace = "MyNamespace";
final String kind = "MyKind";
com.google.datastore.v1.Key.Builder keyBuilder = DatastoreHelper.makeKey(kind, "root");
if (namespace != null) {
keyBuilder.getPartitionIdBuilder().setNamespaceId(namespace);
}
final com.google.datastore.v1.Key ancestorKey = keyBuilder.build();
TableRow row = pc.element();
String entityProperty = "sample";
String key = "key";
com.google.datastore.v1.Entity.Builder entityBuilder = com.google.datastore.v1.Entity.newBuilder();
com.google.datastore.v1.Key.Builder keyBuilder1 = DatastoreHelper.makeKey(ancestorKey, kind, key);
if (namespace != null) {
keyBuilder1.getPartitionIdBuilder().setNamespaceId(namespace);
}
entityBuilder.setKey(keyBuilder1.build());
entityBuilder.getMutableProperties().put(entityProperty, DatastoreHelper.makeValue("sampleValue").build());
pc.output(entityBuilder.build());
}
}));
userEntity.apply("WriteToDatastore", DatastoreIO.v1().write().withProjectId(options.getProject()));
This solution was able to scale from 3 elements per second with 1 worker to ~1500 elements per second with 20 workers.
At least with python's ndb client library it's possible to write up to 500 entities at a time in a single .put_multi() datastore call - a whole lot faster than calling .put() for one entity at a time (the calls are blocking on the underlying RPCs)
I'm not a java user, but a similar technique appears to be available for it as well. From Using batch operations:
You can use the batch operations if you want to operate on multiple
entities in a single Cloud Datastore call.
Here is an example of a batch call:
Entity employee1 = new Entity("Employee");
Entity employee2 = new Entity("Employee");
Entity employee3 = new Entity("Employee");
// ...
List<Entity> employees = Arrays.asList(employee1, employee2, employee3);
datastore.put(employees);

Stream records from DataBase using Akka Stream

I have a system using Akka which currently handles incoming streaming data over message queues. When a record arrives then it is processed, mq is acked and record is passed on for further handling within the system.
Now I would like to add support for using DBs as input.
What would be a way to go for the input source to be able to handle DB (should stream in > 100M records at the pace that the receiver can handle - so I presume reactive/akka-streams?)?
Slick Library
Slick streaming is how this is usually done.
Extending the slick documentation a bit to include akka streams:
//SELECT Name from Coffees
val q = for (c <- coffees) yield c.name
val action = q.result
type Name = String
val databasePublisher : DatabasePublisher[Name] = db stream action
import akka.stream.scaladsl.Source
val akkaSourceFromSlick : Source[Name, _] = Source fromPublisher databasePublisher
Now akkaSourceFromSlick is like any other akka stream Source.
"Old School" ResultSet
It is also possible to use a plain ResultSet, without slick, as the "engine" for an akka stream. We will utilize the fact that a stream Source can be instantiated from an Iterator.
First create the ResultSet using standard jdbc techniques:
import java.sql._
val resultSetGenerator : () => Try[ResultSet] = Try {
val statement : Statement = ???
statement executeQuery "SELECT Name from Coffees"
}
Of course all ResultSet instances have to move the cursor before the first row:
val adjustResultSetBeforeFirst : (ResultSet) => Try[ResultSet] =
(resultSet) => Try(resultSet.beforeFirst()) map (_ => resultSet)
Once we start iterating through rows we'll have to pull the value from the correct column:
val getNameFromResultSet : ResultSet => Name = _ getString "Name"
And now we can implement the Iterator Interface to create a Iterator[Name] from a ResultSet:
val convertResultSetToNameIterator : ResultSet => Iterator[Name] =
(resultSet) => new Iterator[Try[Name]] {
override def hasNext : Boolean = resultSet.next
override def next() : Try[Name] = Try(getNameFromResultSet(resultSet))
} flatMap (_.toOption)
And finally, glue all the pieces together to create the function we'll need to pass to Source.fromIterator:
val resultSetGenToNameIterator : (() => Try[ResultSet]) => () => Iterator[Name] =
(_ : () => Try[ResultSet])
.andThen(_ flatMap adjustResultSetBeforeFirst)
.andThen(_ map convertResultSetToNameIterator)
.andThen(_ getOrElse Iterator.empty)
This Iterator can now feed a Source:
val akkaSourceFromResultSet : Source[Name, _] =
Source fromIterator resultSetGenToNameIterator(resultSetGenerator)
This implementation is reactive all the way down to the database. Since the ResultSet pre-fetches a limited number of rows at a time, data will only come off the hard drive through the database as the stream Sink signals demand.
I find Alpakka documentation to be excellent and a much easier way to work with reactive-streams than than the Java Publisher interface.
The Alpakka project is an open source initiative to implement stream-aware, reactive, integration pipelines for Java and Scala. It is built on top of Akka Streams, and has been designed from the ground up to understand streaming natively and provide a DSL for reactive and stream-oriented programming, with built-in support for backpressure
Document for Alpakka with Slick: https://doc.akka.io/docs/alpakka/current/slick.html
Alpakka Github: https://github.com/akka/alpakka

Ionic 2 / cordova-plugin-file File.writeFile() refuses to create binary file correctly (png image)

In summary
File.writeFile() creates a PNG file of 0 bytes when trying to write a Blob made from base64 data.
In my application, I am trying to create a file that consists of base64 data stored in the db. The rendered equivalent of the data is a small anti-aliased graph curve in black on a transparent background (never more that 300 x 320 pixels) that has previously been created and stored from a canvas element. I have independently verified that the stored base64 data is indeed correct by rendering it at one of various base64 encoders/decoders available online.
Output from "Ionic Info"
--------------------------------
Your system information:
Cordova CLI: 6.3.1
Gulp version: CLI version 3.9.1
Gulp local:
Ionic Framework Version: 2.0.0-rc.2
Ionic CLI Version: 2.1.1
Ionic App Lib Version: 2.1.1
Ionic App Scripts Version: 0.0.39
OS:
Node Version: v6.7.0
--------------------------------
The development platform is Windows 10, and I've been testing directly on a Samsung Galaxy S7 and S4 so far.
I know that the base64 data has to be converted into binary data (as a Blob) first, as File does not yet support writing base64 directly in to an image file. I found various techniques with which to do this, and the code which seems to suit my needs the most (and reflects a similar way I would have done it in java is illustrated below):
Main code from constructor:
this.platform.ready().then(() => {
this.graphDataService.getDataItem(this.job.id).then((data) =>{
console.log("getpic:");
let imgWithMeta = data.split(",")
// base64 data
let imgData = imgWithMeta[1].trim();
// content type
let imgType = imgWithMeta[0].trim().split(";")[0].split(":")[1];
console.log("imgData:",imgData);
console.log("imgMeta:",imgType);
console.log("aftergetpic:");
// this.fs is correctly set to cordova.file.externalDataDirectory
let folderpath = this.fs;
let filename = "dotd_test.png";
File.resolveLocalFilesystemUrl(this.fs).then( (dirEntry) => {
console.log("resolved dir with:", dirEntry);
this.savebase64AsImageFile(dirEntry.nativeURL,filename,imgData,imgType);
});
});
});
Helper to convert base64 to Blob:
// convert base64 to Blob
b64toBlob(b64Data, contentType, sliceSize) {
//console.log("data packet:",b64Data);
//console.log("content type:",contentType);
//console.log("slice size:",sliceSize);
let byteCharacters = atob(b64Data);
let byteArrays = [];
for (let offset = 0; offset < byteCharacters.length; offset += sliceSize) {
let slice = byteCharacters.slice(offset, offset + sliceSize);
let byteNumbers = new Array(slice.length);
for (let i = 0; i < slice.length; i++) {
byteNumbers[i] = slice.charCodeAt(i);
}
let byteArray = new Uint8Array(byteNumbers);
byteArrays.push(byteArray);
}
console.log("size of bytearray before blobbing:", byteArrays.length);
console.log("blob content type:", contentType);
let blob = new Blob(byteArrays, {type: contentType});
// alternative way WITHOUT chunking the base64 data
// let blob = new Blob([atob(b64Data)], {type: contentType});
return blob;
}
save the image with File.writeFile()
// save the image with File.writeFile()
savebase64AsImageFile(folderpath,filename,content,contentType){
// Convert the base64 string in a Blob
let data:Blob = this.b64toBlob(content,contentType,512);
console.log("file location attempt is:",folderpath + filename);
File.writeFile(
folderpath,
filename,
data,
{replace: true}
).then(
_ => console.log("write complete")
).catch(
err => console.log("file create failed:",err);
);
}
I have tried dozens of different decoding techniques, but the effect is the same. However, if I hardcode simple text data into the writeFile() section, like so:
File.writeFile(
folderpath,
"test.txt",
"the quick brown fox jumps over the lazy dog",
{replace: true}
)
A text file IS created correctly in the expected location with the text string above in it.
However, I've noticed that whether the file is the 0 bytes PNG, or the working text file above, in both cases the ".then()" consequence clause of the File Promise never fires.
Additionally, I swapped the above method and used the Ionic 2 native Base64-To-Gallery library to create the images, which worked without a problem. However, having the images in the user's picture gallery or camera roll is not an option for me as I do not wish to risk a user's own pictures while marshalling / packing / transmitting / deleting the data-rendered images. The images should be created and managed as part of the app.
User marcus-robinson seems to have experienced a similar issue outlined here, but it was across all file types, and not just binary types as seems to be the case here. Also, the issue seems to have been closed:
https://github.com/driftyco/ionic/issues/5638
Anybody experiencing something similar, or possibly spot some error I might have caused? I've tried dozens of alternatives but none seem to work.
I had similar behaviour saving media files which worked perfectly on iOS. Nonetheless, I had the issue of 0 bytes file creation on some Android devices in release build (dev build works perfectly). After very long search, I followed the following solution
I moved the polyfills.js script tag to the top of the index.html in the ionic project before the cordova.js tag. This re-ordering somehow the issue is resolved.
So the order should look like:
<script src="build/polyfills.js"></script>
<script type="text/javascript" src="cordova.js"></script>
Works on ionic 3 and ionic 4.
The credits go to 1
I got that working with most of your code:
this.file.writeFile(this.file.cacheDirectory, "currentCached.jpeg", this.b64toBlob(src, "image/jpg", 512) ,{replace: true})
The only difference i had was:
let byteCharacters = atob(b64Data.replace(/^data:image\/(png|jpeg|jpg);base64,/, ''));
instead of your
let byteCharacters = atob(b64Data);
Note: I did not use other trimming etc. like those techniques you used in your constructor class.

Parametrising encoded web service requests

I am trying to parametrise web service requests in a web performance test. Using Fiddler2 I have recorded a sequence of over 60 web service requests for a transaction performed by my desktop application and saved them as a .webtest file. This web test runs without any errors and the responses that I have checked look correct.
When the web service requests are viewed in Visual Studio 2012 they appear in plain text and so I should be able to edit them to parametrise the values in the SOAP requests. For example, most of the requests contain the text <Database>db1a</Database> (actually it has <Database>db1a</Database>) and I want to change them to get the database name from a context parameter. There are several other items to replace with parameters. For this one transaction there are over 60 web service requests and I have other transactions to record. The .webtest file contains XML and the requests looks like:
<Request Method="POST" Version="1.1" Url="http://example.com/somewhere.asmx" ThinkTime="83" Timeout="60" ParseDependentRequests="True" FollowRedirects="True" RecordResult="True" Cache="False" ResponseTimeGoal="0" Encoding="utf-8">
<Headers>
<Header Name="Content-Type" Value="text/xml; charset=utf-8" />
<Header Name="SOAPAction" Value=""http://example.com/webservices/VariousActionNamesHere"" />
</Headers>
<StringHttpBody ContentType="text/xml; charset=utf-8">PAA/AHgAbQBsACAAdg
... lots more characters not shown
+AA==</StringHttpBody>
</Request>
The StringHttpBody field contains an encoded version of the SOAP request. Visual Studio shows it as plain text. What is the encoding of this field and how can I decode and encode it?
I have installed Release 3.0 of the “Web and Load Test Plugins for Visual Studio Team Test” from http://teamtestplugins.codeplex.com/ . They provide a slightly better interface for editing the SOAP requests one at a time. But they do not allow mass changes.
Converting the web test to a coded web test (ie into C#) shows the SOAP requests as simple text and they could be edited there but I would prefer to keep the flexibility of a .webtest file.
Update: I have posted a partial answer to the question. Whilst it works, it feels the wrong way to do the work because it feels too complicated. So I am looking for a better overall approach.
StringHttpBody is base64 encoded. The raw body of the request is converted to a UTF-16 byte array and then base64 encoded, like so:
Convert.ToBase64String(Encoding.Unicode.GetBytes(oSession.GetRequestBodyAsString()));
For a quick view, you can copy/paste this string into Fiddler's Tools > TextWizard screen then use the From Base64 option to decode.
Here is part of an answer to working with the StringHttpBody fields. This is about decoding and encoding the fields to allow easier understanding and modification.
Read the input XML and find the contents of the StringHttpBody fields. Replace each field contents with the result of calling the following routine on the original contents. Write the all the lines to a new intermediate file. The byte array contains UTF-16 characters as high and low bytes. (All the characters I have seen so far have high byte zero.)
private string DecodeBody(string source) {
byte[] outBytes = Convert.FromBase64String(source);
StringBuilder sb = new StringBuilder();
Assert( (outBytes.Length % 2) != 0 );
for (int ix = 0; ix < outBytes.Length; ix += 2) {
Assert(outBytes[ix] != 0);
sb.Append((char)outBytes[ix + 1]);
}
return sb.ToString();
}
Now have a file containing a simple text version of the .webtest file. This file can easily be edited to parameterise fields of the requests. Have an routine used similar to the one above and writing to another intermediate file. The routine has statements such as:
source = source.Replace("<Database>db1a</Database>", "<Database>{{DatabaseName}}</Database>");
The final intermediate file is then reencoded to create a new .webtest file. Just as before the contents of the StringHttpBody fields are found and replaced with the result of calling a routine. The encoding routine is:
private string EncodeBody(string source) {
StringBuilder sb = new StringBuilder();
byte[] outBytes = new byte[2 * source.Length];
for (int ix = 0; ix < source.Length; ix++) {
char ch = source[ix];
outBytes[2 * ix] = (byte)(((int)ch) & 0xFF);
outBytes[2 * ix + 1] = (byte)((((int)ch) / 256) & 0xFF);
}
sb.Append(Convert.ToBase64String(outBytes));
return sb.ToString();
}
The flow of files is thus:
decode original.webtest > intermediate1
parameterise intermediate1 > intermediate2
encode intermediate2 > final.webtest
On the small number of .webtest files I have tried the encode operation is the inverse of the decode operation, the original file from before decoding is identical to the file after encoding. Having the two intermediate files allows easy checking and searching of the contents of the unencoded file and the effect of the parameterise step.
For
[StringHttpBody ContentType="application/json"]
To Decode the body:
var encodedString = childNode.InnerText;
var encodedStringBytes = Convert.FromBase64String(encodedString);
var decodedString = Encoding.Unicode.GetString(encodedStringBytes);
JObject jsonString =JsonConvert.DeserializeObject(decodedString);
To Encode the body:
childNode.InnerText =
Convert.ToBase64String(Encoding.Unicode.GetBytes(JsonConvert.SerializeObject(jsonString)));
I think this might help.