how to write real-time data to HDFS with Avro/Parquet? - hdfs

I have the following working in a unit test to write a single object in Avro/Parquet to a file in my Cloudera/HDFS cluster.
That said, given that Parquet is a columnar format, it seems like it can only write out an entire file in a batch mode (updates not supported).
So, what are the best practices for writing files for data ingested (via ActiveMQ/Camel) in real-time (small msgs at 1k msg/sec, etc)?
I suppose I could aggregate my messages (buffer in memory or other temp storage) and write them out in batch mode using a dynamic filename, but I feel like I'm missing something with the partitioning/file naming by hand, etc...
Configuration conf = new Configuration(false);
conf.set("fs.defaultFS", "hdfs://cloudera-test:8020/cm/user/hive/warehouse");
conf.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, false);
AvroReadSupport.setAvroDataSupplier(conf, ReflectDataSupplier.class);
Path path = new Path("/cm/user/hive/warehouse/test1.data");
MyObject object = new MyObject("test");
Schema schema = ReflectData.get().getSchema(object.getClass());
ParquetWriter<InboundWirelessMessageForHDFS> parquetWriter = AvroParquetWriter.<MyObject>builder(path)
.withSchema(schema)
.withCompressionCodec(CompressionCodecName.UNCOMPRESSED)
.withDataModel(ReflectData.get())
.withDictionaryEncoding(false)
.withConf(conf)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE) //required because the filename doesn't change for this test
.build();
parquetWriter.write(object);
parquetWriter.close();

based on my (limited) research...I'm assuming that files can't be appended to (by design)...so I simply must batch real-time data (in memory or otherwise) before writing out files in parquet...
How to append data to an existing parquet file

Related

How to read a huge CSV file from Google Cloud Storage line by line using Java?

I'm new to Google Cloud Platform. I'm trying to read a CSV file present in Google Cloud Storage (non-public bucket accessed via Service Account key) line by line which is around 1GB.
I couldn't find any option to read the file present in the Google Cloud Storage (GCS) line by line. I only see the read by chunksize/byte size options. Since I'm trying to read a CSV, I don't want to use read by chunksize since it may split a record while reading.
Solutions tried so far:
Tried copying the contents from CSV file present in GCS to temporary local file and read the temp file by using the below code. The below code is working as expected but I don't want to copy huge file to my local instance. Instead, I want to read line by line from GCS.
StorageOptions options =
StorageOptions.newBuilder().setProjectId(GCP_PROJECT_ID)
.setCredentials(gcsConfig.getCredentials()).build();
Storage storage = options.getService();
Blob blob = storage.get(BUCKET_NAME, FILE_NAME);
ReadChannel readChannel = blob.reader();
FileOutputStream fileOuputStream = new FileOutputStream(TEMP_FILE_NAME);
fileOuputStream.getChannel().transferFrom(readChannel, 0, Long.MAX_VALUE);
fileOuputStream.close();
Please suggest the approach.
Since, I'm doing batch processing, I'm using the below code in my ItemReader's init() method which is annotated with #PostConstruct. And In my ItemReader's read(), I'm building a List. Size of list is same as chunk size. In this way I can read lines based on my chunkSize instead of reading all the lines at once.
StorageOptions options =
StorageOptions.newBuilder().setProjectId(GCP_PROJECT_ID)
.setCredentials(gcsConfig.getCredentials()).build();
Storage storage = options.getService();
Blob blob = storage.get(BUCKET_NAME, FILE_NAME);
ReadChannel readChannel = blob.reader();
BufferedReader br = new BufferedReader(Channels.newReader(readChannel, "UTF-8"));
One of the easiest ways might be to use the google-cloud-nio package, part of the google-cloud-java library that you're already using: https://github.com/googleapis/google-cloud-java/tree/v0.30.0/google-cloud-contrib/google-cloud-nio
It incorporates Google Cloud Storage into Java's NIO, and so once it's up and running, you can refer to GCS resources just like you'd do for a file or URI. For example:
Path path = Paths.get(URI.create("gs://bucket/lolcat.csv"));
try (Stream<String> lines = Files.lines(path)) {
lines.forEach(s -> System.out.println(s));
} catch (IOException ex) {
// do something or re-throw...
}
Brandon Yarbrough is right, and to add to his answer:
if you use gcloud to login with your credentials then Brandon's code will work: google-cloud-nio will use your login to access the files (and that'll work even if they are not public).
If you prefer to do it all in software, you can use this code to read credentials from a local file and then access your file from Google Cloud:
String myCredentials = "/path/to/my/key.json";
CloudStorageFileSystem fs =
CloudStorageFileSystem.forBucket(
"bucket",
CloudStorageConfiguration.DEFAULT,
StorageOptions.newBuilder()
.setCredentials(ServiceAccountCredentials.fromStream(
new FileInputStream(myCredentials)))
.build());
Path path = fs.getPath("/lolcat.csv");
List<String> lines = Files.readAllLines(path, StandardCharsets.UTF_8);
edit: you don't want to read all the lines at once so don't use realAllLines, but once you have the Path you can use any of the other techniques discussed above to read just the part of the file you need: you can read one line at a time or get a Channel object.

Is it possible to write to s3 via a stream using s3 java sdk

Normally when a file has to be uploaded to s3, it has to first be written to disk, before using something like the TransferManager api to upload to the cloud. This cause data loss if the upload does not finish on time(application goes down and restarts on a different server, etc). So I was wondering if it's possible to write to a stream directly across the network with the required cloud location as the sink.
You don't say what language you're using, but I'll assume Java based on your capitalization. In which case the answer is yes: TransferManager has an upload() method that takes a PutObjectRequest, and you can construct that object around a stream.
However, there are two important caveats. The first is in the documentation for PutObjectRequest:
When uploading directly from an input stream, content length must be specified before data can be uploaded to Amazon S3
So you have to know how much data you're uploading before you start. If you're receiving an upload from the web and have a Content-Length header, then you can get the size from it. If you're just reading a stream of data that's arbitrarily long, then you have to write it to a file first (or the SDK will).
The second caveat is that this really doesn't prevent data loss: your program can still crash in the middle of reading data. One thing that it will prevent is returning a success code to the user before storing the data in S3, but you could do that anyway with a file.
Surprisingly this is not possible (at time of writing this post) with standard Java SDK. Anyhow thanks to this 3rd party library you can atleast avoid buffering huge amounts of data to either memory or disk since it buffers internally ~5MB parts and uploads them automatically within multipart upload for you.
There is also github issue open in SDK repository one can follow to get updates.
It is possible:
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.build();
s3Client.putObject("bucket", "key", youtINputStream, s3MetData)
AmazonS3.putObject
public void saveS3Object(String key, InputStream inputStream) throws Exception {
List<PartETag> partETags = new ArrayList<>();
InitiateMultipartUploadRequest initRequest = new
InitiateMultipartUploadRequest(bucketName, key);
InitiateMultipartUploadResult initResponse =
s3.initiateMultipartUpload(initRequest);
int partSize = 5242880; // Set part size to 5 MB.
try {
byte b[] = new byte[partSize];
int len = 0;
int i = 1;
while ((len = inputStream.read(b)) >= 0) {
// Last part can be less than 5 MB. Adjust part size.
ByteArrayInputStream partInputStream = new ByteArrayInputStream(b,0,len);
UploadPartRequest uploadRequest = new UploadPartRequest()
.withBucketName(bucketName).withKey(key)
.withUploadId(initResponse.getUploadId()).withPartNumber(i)
.withFileOffset(0)
.withInputStream(partInputStream)
.withPartSize(len);
partETags.add(
s3.uploadPart(uploadRequest).getPartETag());
i++;
}
CompleteMultipartUploadRequest compRequest = new
CompleteMultipartUploadRequest(
bucketName,
key,
initResponse.getUploadId(),
partETags);
s3.completeMultipartUpload(compRequest);
} catch (Exception e) {
s3.abortMultipartUpload(new AbortMultipartUploadRequest(
bucketName, key, initResponse.getUploadId()));
}
}

Pass Binary string/file content from c++ to node js

I'm trying to pass the content of a binary file from c++ to node using the node-gyp library. I have a process that creates a binary file using the .fit format and I need to pass the content of the file to js to process it. So, my first aproach was to extract the content of the file in a string and try to pass it to node like this.
char c;
std::string content="";
while (file.get(c)){
content+=c;
}
I'm using the following code to pass it to Node
v8::Local<v8::ArrayBuffer> ab = v8::ArrayBuffer::New(args.GetIsolate(), (void*)content.data(), content.size());
args.GetReturnValue().Set(ab);
In node a get an arrayBuffer but when I print the content to a file it is different to the one that show a c++ cout.
How can I pass the binary data succesfully?
Thanks.
Probably the best approach is to write your data to a binary disk file. Write to disk in C++; read from disk in NodeJS.
Very importantly, make sure you specify BINARY MODE.
For example:
myFile.open ("data2.bin", ios::out | ios::binary);
Do not use "strings" (at least not unless you want to uuencode). Use buffers. Here is a good example:
How to read binary files byte by byte in Node.js
var fs = require('fs');
fs.open('file.txt', 'r', function(status, fd) {
if (status) {
console.log(status.message);
return;
}
var buffer = new Buffer(100);
fs.read(fd, buffer, 0, 100, 0, function(err, num) {
...
});
});
You might also find these links helpful:
https://nodejs.org/api/buffer.html
<= Has good examples for specific Node APIs
http://blog.paracode.com/2013/04/24/parsing-binary-data-with-node-dot-js/
<= Good discussion of some of the issues you might face, including "endianness" and "interpreting numbers"
ADDENDUM:
The OP clarified that he's considering using C++ as a NodeJS Add-On (not a standalone C++ program.
Consequently, using buffers is definitely an option. Here is a good tutorial:
https://community.risingstack.com/using-buffers-node-js-c-plus-plus/
If you choose to go this route, I would DEFINITELY download the example code and play with it first, before implementing buffers in your own application.
It depends but for example using redis
Values can be strings (including binary data) of every kind, for
instance you can store a jpeg image inside a value. A value can't be
bigger than 512 MB.
If the file is bigger than 512MB, then you can store it in chunks.
But I wouldnt suggest since this is an in-memory data store
Its easy to implement in both c++ and node.js

How to perform unit test on the append function of Azure Data Lake written in .Net Framework?

I have created Azure webjobs that contains methods for file creation and Appending data to that file on Datalake Store. I am done with all its development part publishing webjobs etc. Now i am going to write unit tests to test whether the data i am sending is successfully appended to file or not All I need to know is how to perform such kind of unit test any idea?
what I currently thought of doing it is by cleaning all the data from my datalake file and then sending a test data to it. so on the basis of one of the column data of the whole data i sent, i will check whether it got appended or not. Is there any way that can give a quick status of whether my test data is written or not?
Note: Actually i want to know how to delete a particular row of a csv file on data lake but i dont want to use usql to search for the required row. (I am not directly sending data to Datalake it is written via Azure service bus queue which then triggers webjobs to append data to a file on datalake.)
Aside from looking at the file, I can see few other choices. If only your unit test is writing to the file, then you can send appends of variable lengths and then see whether the size of the file is updated appropriately as a result of successful appends. You can always read the file and see whether you data made it as well.
I solved my problem in the way that i Got the length of my file on Datalake store using:
var fileoffset = _adlsFileSystemClient.FileSystem.GetFileStatus(_dlAccountName, "/MyFile.csv").FileStatus.Length;
after getting length i sent my test data to the datalake and after that i again got the length of a file using same code. so the first length i.e before sending test data it was my offset and the length got after sending test data was my destination length i.e from offset to the destination length i read my datalake file using:
Stream Stream1 = _adlsFileSystemClient.FileSystem.Open(_dlAccountName, "/MyFile.csv", totalfileLength, fileoffset);
After getting my data in a stream I tried searching for the test data i sent using following code:
Note:I had a a column of guids in file on the basis of which i search my sent guid in a filestream. make sure to convert your search data to byte and then pass it to the function ReadOneSrch(..).
static bool ReadOneSrch(Stream fileStream, byte[] mydata)
{
int b;
long i = 0;
while ((b = fileStream.ReadByte()) != -1)
{
if (b == mydata[i++])
{
if (i == mydata.Length)
return true;
}
else
i = b == mydata[0] ? 1 : 0;
}
return false;
}

Why IStream::Commit failed to write data into a file?

I have a binary file, when I opend it, I used ::StgOpenStorage with STGM_READWRITE | STGM_SHARE_DENY_WRITE | STGM_TRANSACTED mode to get a root storage named rootStorage. And then, I used rootStorage.OpenStream with STGM_READWRITE | STGM_SHARE_EXCLUSIVE mode to get a substream named subStream.
Next, I wrote some data with subStream.Wirte(...), and called subStream.Commit(STGC_DEFAULT), but it just couldn't write the data in the file.
And I tried rootStorage.Commit(STGC_DEFAULT) also, the data can be written.
But when I used UltraCompare Professional - Binary Compare to compare the original file with the file I opend, a lot of extra data had been written at the end of the file. The extra data seems to be from the beginning of the file.
I just want to write a little data into the file while opening it. What should I do?
Binary file comparison will probably not work for structured storage files. The issue is that structured storage files often have extra space allocated in them--to handle transacted mode and to grow the file. If you want to do a file comparison, it will take more work. You will have to open the root storage in each file, then open the stream, and do a binary comparison on the streams.
I had found out why there are extra data on my file.
1. Why should I use IStorage.Commit()
I used STGM_READWRITE mode to create a storage. It's called transacted mode. In transacted mode, changes are accumulated and are not reflected in the storage object until an explicit commit operation is done. So I need to call rootStorage.Commit().
2. Why there are extra data after calling IStorage.Commit(STGC_DEFAULT)
According to this website:
The OLE-provided compound files use a two phase commit process unless STGC_OVERWRITE is specified in the grfCommitFlags parameter. This two-phase process ensures the robustness of data in case the commit operation fails. First, all new data is written to unused space in the underlying file. If necessary, new space is allocated to the file. Once this step has been successfully completed, a table in the file is updated using a single sector write to indicate that the new data is to be used in place of the old. The old data becomes free space to be used at the next commit. Thus, the old data is available and can be restored in case an error occurs when committing changes. If STGC_OVERWRITE is specified, a single phase commit operation is used.