Facing Performance issue while reading files from GCS using apache beam - google-cloud-platform

I was trying to read data using wildcard from gcs path. My files is in bzip2 format and there were around 300k files resides in the gcs path with same wildcard expression. I'm using the below code snippet to read files.
PCollection<String> val = p
.apply(FileIO.match()
.filepattern("gcsPath"))
.apply(FileIO.readMatches().withCompression(Compression.BZIP2))
.apply(MapElements.into(TypeDescriptor.of(String.class)).via((ReadableFile f) -> {
try {
return f.readFullyAsUTF8String();
} catch (IOException e) {
return null;
}
}));
But the performance is very bad and it will take around 3 days to read that file using above code with the current speed. Is there any alternative api I can use in cloud dataflow to read this amount of files from gcs with ofcourse good performance. I used TextIO earlier, but it was getting failed because of template serialisation limit which is 20MB.

Below TextIO() code solved the issue.
PCollection<String> input = p.apply("Read file from GCS",TextIO.read().from(options.getInputFile())
.withCompression(Compression.AUTO).withHintMatchesManyFiles()
);
withHintMatchesManyFiles() solved the issue. But still I don't know while FileIO performance is so bad.

Related

7Zip CLI Compress file with LZMA

I try to compress a file in the console with LZMA.
7z a -t7z output input
or
7z a -t7z -m0=lzma output input
However, I cannot open it on the client.
How can compress a file as an LZMA archive in the console?
It is possible the problem to be that the above commands add a file in an archive. However, I want to compress data in a data file without file structure.
Is there an option to compress a data file to a compressed data file with LZMA?
Edit
I see downvotes, which means the question is "not correct" in some way.
So I'll try to explain what I want to achieve.
I compress data serverside and use them on a client application. I successfully do it in Node.js like that:
const lzma = require('lzma');
lzma.compress(inputBuffer, 1, callback);
function callback(data, err) {
writefile(outputPath, Buffer.from(data));
}
However, it is very slow. So I want to call 7Zip for the compression.
My .NET server also compresses it in a similar way.
byte[] barData;
using (var barStream = dukasDataHelper.SerializeLightBars(lightBars.ToArray()))
using (var zippedStream = zipLzma.Zip(barStream))
{
barData = zippedStream.ToArray();
}
My problem is that I cannot set the correct options in CLI in order to be able to read the file in the client.
My client code C# is:
using (var blobStream = new MemoryStream(blobBytes))
using (var barStream = new ZipLzma().Unzip(blobStream))
{
SaveDataSet(barStream, localPath);
}
I have this error message when compress via CLI:
$exception {"Data Error"}
Data: {System.Collections.ListDictionaryInternal}
at SevenZipLzma.LZMA.Decoder.Code(Stream inStream, Stream outStream, Int64
inSize, Int64 outSize, ICodeProgress progress)
at SevenZipLzma.ZipLzma.Unzip(Stream stream)
Since the code works as I compress with Node.js and doesn't work when compressing via CLI, it means something is wrong.
7zip makes an archive of files and directories, whereas LZMA generates a single stream of compressed data. They are not the same format. LZMA can be used inside a 7zip archive to compress an entry (or LZMA2 or Deflate or several other compression methods).
You can try the xz command to generate LZMA streams with xz --format=lzma.

How to read a huge CSV file from Google Cloud Storage line by line using Java?

I'm new to Google Cloud Platform. I'm trying to read a CSV file present in Google Cloud Storage (non-public bucket accessed via Service Account key) line by line which is around 1GB.
I couldn't find any option to read the file present in the Google Cloud Storage (GCS) line by line. I only see the read by chunksize/byte size options. Since I'm trying to read a CSV, I don't want to use read by chunksize since it may split a record while reading.
Solutions tried so far:
Tried copying the contents from CSV file present in GCS to temporary local file and read the temp file by using the below code. The below code is working as expected but I don't want to copy huge file to my local instance. Instead, I want to read line by line from GCS.
StorageOptions options =
StorageOptions.newBuilder().setProjectId(GCP_PROJECT_ID)
.setCredentials(gcsConfig.getCredentials()).build();
Storage storage = options.getService();
Blob blob = storage.get(BUCKET_NAME, FILE_NAME);
ReadChannel readChannel = blob.reader();
FileOutputStream fileOuputStream = new FileOutputStream(TEMP_FILE_NAME);
fileOuputStream.getChannel().transferFrom(readChannel, 0, Long.MAX_VALUE);
fileOuputStream.close();
Please suggest the approach.
Since, I'm doing batch processing, I'm using the below code in my ItemReader's init() method which is annotated with #PostConstruct. And In my ItemReader's read(), I'm building a List. Size of list is same as chunk size. In this way I can read lines based on my chunkSize instead of reading all the lines at once.
StorageOptions options =
StorageOptions.newBuilder().setProjectId(GCP_PROJECT_ID)
.setCredentials(gcsConfig.getCredentials()).build();
Storage storage = options.getService();
Blob blob = storage.get(BUCKET_NAME, FILE_NAME);
ReadChannel readChannel = blob.reader();
BufferedReader br = new BufferedReader(Channels.newReader(readChannel, "UTF-8"));
One of the easiest ways might be to use the google-cloud-nio package, part of the google-cloud-java library that you're already using: https://github.com/googleapis/google-cloud-java/tree/v0.30.0/google-cloud-contrib/google-cloud-nio
It incorporates Google Cloud Storage into Java's NIO, and so once it's up and running, you can refer to GCS resources just like you'd do for a file or URI. For example:
Path path = Paths.get(URI.create("gs://bucket/lolcat.csv"));
try (Stream<String> lines = Files.lines(path)) {
lines.forEach(s -> System.out.println(s));
} catch (IOException ex) {
// do something or re-throw...
}
Brandon Yarbrough is right, and to add to his answer:
if you use gcloud to login with your credentials then Brandon's code will work: google-cloud-nio will use your login to access the files (and that'll work even if they are not public).
If you prefer to do it all in software, you can use this code to read credentials from a local file and then access your file from Google Cloud:
String myCredentials = "/path/to/my/key.json";
CloudStorageFileSystem fs =
CloudStorageFileSystem.forBucket(
"bucket",
CloudStorageConfiguration.DEFAULT,
StorageOptions.newBuilder()
.setCredentials(ServiceAccountCredentials.fromStream(
new FileInputStream(myCredentials)))
.build());
Path path = fs.getPath("/lolcat.csv");
List<String> lines = Files.readAllLines(path, StandardCharsets.UTF_8);
edit: you don't want to read all the lines at once so don't use realAllLines, but once you have the Path you can use any of the other techniques discussed above to read just the part of the file you need: you can read one line at a time or get a Channel object.

Is it possible to write to s3 via a stream using s3 java sdk

Normally when a file has to be uploaded to s3, it has to first be written to disk, before using something like the TransferManager api to upload to the cloud. This cause data loss if the upload does not finish on time(application goes down and restarts on a different server, etc). So I was wondering if it's possible to write to a stream directly across the network with the required cloud location as the sink.
You don't say what language you're using, but I'll assume Java based on your capitalization. In which case the answer is yes: TransferManager has an upload() method that takes a PutObjectRequest, and you can construct that object around a stream.
However, there are two important caveats. The first is in the documentation for PutObjectRequest:
When uploading directly from an input stream, content length must be specified before data can be uploaded to Amazon S3
So you have to know how much data you're uploading before you start. If you're receiving an upload from the web and have a Content-Length header, then you can get the size from it. If you're just reading a stream of data that's arbitrarily long, then you have to write it to a file first (or the SDK will).
The second caveat is that this really doesn't prevent data loss: your program can still crash in the middle of reading data. One thing that it will prevent is returning a success code to the user before storing the data in S3, but you could do that anyway with a file.
Surprisingly this is not possible (at time of writing this post) with standard Java SDK. Anyhow thanks to this 3rd party library you can atleast avoid buffering huge amounts of data to either memory or disk since it buffers internally ~5MB parts and uploads them automatically within multipart upload for you.
There is also github issue open in SDK repository one can follow to get updates.
It is possible:
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.build();
s3Client.putObject("bucket", "key", youtINputStream, s3MetData)
AmazonS3.putObject
public void saveS3Object(String key, InputStream inputStream) throws Exception {
List<PartETag> partETags = new ArrayList<>();
InitiateMultipartUploadRequest initRequest = new
InitiateMultipartUploadRequest(bucketName, key);
InitiateMultipartUploadResult initResponse =
s3.initiateMultipartUpload(initRequest);
int partSize = 5242880; // Set part size to 5 MB.
try {
byte b[] = new byte[partSize];
int len = 0;
int i = 1;
while ((len = inputStream.read(b)) >= 0) {
// Last part can be less than 5 MB. Adjust part size.
ByteArrayInputStream partInputStream = new ByteArrayInputStream(b,0,len);
UploadPartRequest uploadRequest = new UploadPartRequest()
.withBucketName(bucketName).withKey(key)
.withUploadId(initResponse.getUploadId()).withPartNumber(i)
.withFileOffset(0)
.withInputStream(partInputStream)
.withPartSize(len);
partETags.add(
s3.uploadPart(uploadRequest).getPartETag());
i++;
}
CompleteMultipartUploadRequest compRequest = new
CompleteMultipartUploadRequest(
bucketName,
key,
initResponse.getUploadId(),
partETags);
s3.completeMultipartUpload(compRequest);
} catch (Exception e) {
s3.abortMultipartUpload(new AbortMultipartUploadRequest(
bucketName, key, initResponse.getUploadId()));
}
}

how to write real-time data to HDFS with Avro/Parquet?

I have the following working in a unit test to write a single object in Avro/Parquet to a file in my Cloudera/HDFS cluster.
That said, given that Parquet is a columnar format, it seems like it can only write out an entire file in a batch mode (updates not supported).
So, what are the best practices for writing files for data ingested (via ActiveMQ/Camel) in real-time (small msgs at 1k msg/sec, etc)?
I suppose I could aggregate my messages (buffer in memory or other temp storage) and write them out in batch mode using a dynamic filename, but I feel like I'm missing something with the partitioning/file naming by hand, etc...
Configuration conf = new Configuration(false);
conf.set("fs.defaultFS", "hdfs://cloudera-test:8020/cm/user/hive/warehouse");
conf.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, false);
AvroReadSupport.setAvroDataSupplier(conf, ReflectDataSupplier.class);
Path path = new Path("/cm/user/hive/warehouse/test1.data");
MyObject object = new MyObject("test");
Schema schema = ReflectData.get().getSchema(object.getClass());
ParquetWriter<InboundWirelessMessageForHDFS> parquetWriter = AvroParquetWriter.<MyObject>builder(path)
.withSchema(schema)
.withCompressionCodec(CompressionCodecName.UNCOMPRESSED)
.withDataModel(ReflectData.get())
.withDictionaryEncoding(false)
.withConf(conf)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE) //required because the filename doesn't change for this test
.build();
parquetWriter.write(object);
parquetWriter.close();
based on my (limited) research...I'm assuming that files can't be appended to (by design)...so I simply must batch real-time data (in memory or otherwise) before writing out files in parquet...
How to append data to an existing parquet file

Any seekable compression library?

I'm looking for a general compression library that supports random access during decompression. I want to compress wikipedia into a single compressed format and at the same time I want to decompress/extract individual articles from it.
Of course, I can compress each articles individually, but this won't give much compression ratio. I've heard LZO compressed file consists of many chunks which can be decompressed separately, but I haven't found out API+documentation for that. I can also use the Z_FULL_FLUSH mode in zlib, but is there any other better alternative?
xz-format files support an index, though by default the index is not useful. My compressor, pixz, creates files that do contain a useful index. You can use the functions in the liblzma library to find which block of xz data corresponds to which location in the uncompressed data.
for seekable compression build on gzip, there is dictzip from the dict server and sgzip from sleuth kit
note that you can't write to either of these and as seekable is reading any way
DotNetZip is a zip archive library for .NET.
Using DotNetZip, you can reference particular entries in the zip randomly, and can decompress them out of order, and can return a stream that decompresses as it extracts an entry.
With the benefit of those features, DotNetZip has been used within the implementation of a Virtual Path Provider for ASP.NET, that does exactly what you describe - it serves all the content for a particular website from a compressed ZIP file. You can also do websites with dynamic pages (ASP.NET) pages.
ASP.NET ZIP Virtual Path Provider, based on DotNetZip
The important code looks like this:
namespace Ionic.Zip.Web.VirtualPathProvider
{
public class ZipFileVirtualPathProvider : System.Web.Hosting.VirtualPathProvider
{
ZipFile _zipFile;
public ZipFileVirtualPathProvider (string zipFilename) : base () {
_zipFile = ZipFile.Read(zipFilename);
}
~ZipFileVirtualPathProvider () { _zipFile.Dispose (); }
public override bool FileExists (string virtualPath)
{
string zipPath = Util.ConvertVirtualPathToZipPath (virtualPath, true);
ZipEntry zipEntry = _zipFile[zipPath];
if (zipEntry == null)
return false;
return !zipEntry.IsDirectory;
}
public override bool DirectoryExists (string virtualDir)
{
string zipPath = Util.ConvertVirtualPathToZipPath (virtualDir, false);
ZipEntry zipEntry = _zipFile[zipPath];
if (zipEntry != null)
return false;
return zipEntry.IsDirectory;
}
public override VirtualFile GetFile (string virtualPath)
{
return new ZipVirtualFile (virtualPath, _zipFile);
}
public override VirtualDirectory GetDirectory (string virtualDir)
{
return new ZipVirtualDirectory (virtualDir, _zipFile);
}
public override string GetFileHash(string virtualPath, System.Collections.IEnumerable virtualPathDependencies)
{
return null;
}
public override System.Web.Caching.CacheDependency GetCacheDependency(String virtualPath, System.Collections.IEnumerable virtualPathDependencies, DateTime utcStart)
{
return null;
}
}
}
And VirtualFile is defined like this:
namespace Ionic.Zip.Web.VirtualPathProvider
{
class ZipVirtualFile : VirtualFile
{
ZipFile _zipFile;
public ZipVirtualFile (String virtualPath, ZipFile zipFile) : base(virtualPath) {
_zipFile = zipFile;
}
public override System.IO.Stream Open ()
{
ZipEntry entry = _zipFile[Util.ConvertVirtualPathToZipPath(base.VirtualPath,true)];
return entry.OpenReader();
}
}
}
bgzf is the format used in genomics.
http://biopython.org/DIST/docs/api/Bio.bgzf-module.html
It is part of the samtools C library and really just a simple hack around gzip. You can probably re-write it yourself if you don't want to use the samtools C implementation or the picard java implementation. Biopython implements a python variant.
You haven't specified your OS. Would it be possible to store your file in a compressed directory managed by the OS? Then you would have the "seekable" portion as well as the compression. The CPU overhead will be handled for you with unpredictable access times.
I'm using MS Windows Vista, unfortunately, and I can send the file explorer into zip files as if they were normal files. Presumably it still works on 7 (which I'd like to be on). I think I've done that with the corresponding utility on Ubuntu, also, but I'm not sure. I could also test it on Mac OSX, I suppose.
If individual articles are too short to get a decent compression ratio, the next-simplest approach is to tar up a batch of Wikipedia articles -- say, 12 articles at a time, or however many articles it takes to fill up a megabyte.
Then compress each batch independently.
In principle, that gives better compression than than compressing each article individually, but worse compression than solid compression of all the articles together.
Extracting article #12 from a compressed batch requires decompressing the entire batch (and then throwing the first 11 articles away), but that's still much, much faster than decompressing half of Wikipedia.
Many compression programs break up the input stream into a sequence of "blocks", and compress each block from scratch, independently of the other blocks.
You might as well pick a batch size about the size of a block -- larger batches won't get any better compression ratio, and will take longer to decompress.
I have experimented with several ways to make it easier to start decoding a compressed database in the middle.
Alas, so far the "clever" techniques I've applied still have worse compression ratio and take more operations to produce a decoded section than the much simpler "batch" approach.
For more sophisticated techniques, you might look at
MG4J: Managing Gigabytes for
Java
"Managing Gigabytes: Compressing and Indexing Documents and
Images" by Ian H. Witten,
Alistair Moffat, and Timothy C. Bell