Apache Crunch: How to set multiple input paths?

Apache Crunch: How to set multiple input paths? - mapreduce

I have a problem: I can't set the multiple input paths when I use the Apache Crunch. How can I solve this problem?

you can add the multiple input files to crunch by specifying all the input paths in a List.
public class Name {
public static void main(String[] args) {
Pipeline pipeline = new MRPipeline(Name.class, jobName, conf);
List<Path> inputPathList = new ArrayList<>(); // Add your InputPaths here
PCollection<String> source = From.sequenceFile(inputPathList, Text.class);
}
}

Related

How to count the number of rows in the input file of the Google Dataflow file processing?

I am trying to count the number of rows in an input file and I am using Cloud dataflow Runner for creating the template. In the below code, I am reading the file from a GCS bucket, processing it and then storing the output in a Redis instance.
But I am unable to count the number of lines of the input file.
Main Class
public static void main(String[] args) {
/**
* Constructed StorageToRedisOptions object using the method PipelineOptionsFactory.fromArgs to read options from command-line
*/
StorageToRedisOptions options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(StorageToRedisOptions.class);
Pipeline p = Pipeline.create(options);
p.apply("Reading Lines...", TextIO.read().from(options.getInputFile()))
.apply("Transforming data...",
ParDo.of(new DoFn<String, String[]>() {
#ProcessElement
public void TransformData(#Element String line, OutputReceiver<String[]> out) {
String[] fields = line.split("\\|");
out.output(fields);
}
}))
.apply("Processing data...",
ParDo.of(new DoFn<String[], KV<String, String>>() {
#ProcessElement
public void ProcessData(#Element String[] fields, OutputReceiver<KV<String, String>> out) {
if (fields[RedisIndex.GUID.getValue()] != null) {
out.output(KV.of("firstname:"
.concat(fields[RedisIndex.FIRSTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));
out.output(KV.of("lastname:"
.concat(fields[RedisIndex.LASTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));
out.output(KV.of("dob:"
.concat(fields[RedisIndex.DOB.getValue()]), fields[RedisIndex.GUID.getValue()]));
out.output(KV.of("postalcode:"
.concat(fields[RedisIndex.POSTAL_CODE.getValue()]), fields[RedisIndex.GUID.getValue()]));
}
}
}))
.apply("Writing field indexes into redis",
RedisIO.write().withMethod(RedisIO.Write.Method.SADD)
.withEndpoint(options.getRedisHost(), options.getRedisPort()));
p.run();
}
Sample Input File
xxxxxxxxxxxxxxxx|bruce|wayne|31051989|444444444444
yyyyyyyyyyyyyyyy|selina|thomas|01051989|222222222222
aaaaaaaaaaaaaaaa|clark|kent|31051990|666666666666
Command to execute the pipeline
mvn compile exec:java \
-Dexec.mainClass=com.viveknaskar.DataFlowPipelineForMemStore \
-Dexec.args="--project=my-project-id \
--jobName=dataflow-job \
--inputFile=gs://my-input-bucket/*.txt \
--redisHost=127.0.0.1 \
--stagingLocation=gs://pipeline-bucket/stage/ \
--dataflowJobFile=gs://pipeline-bucket/templates/dataflow-template \
--runner=DataflowRunner"
I have tried to use the below code from the StackOverflow solution but it doesn't work me.
PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);
I have gone through the Apache Beam documentation as well but didn't find anything helpful. Any help on this will be really appreciated.

I resolved this issue by adding the Count.globally() and applying to a PCollection<String> after the pipeline reads the file.
I have added the below code:
PCollection<String> lines = p.apply("Reading Lines...", TextIO.read().from(options.getInputFile()));
lines.apply(Count.globally()).apply("Count the total records", ParDo.of(new RecordCount()));
where I have created a new Class (RecordCount.java) which extends DoFn<Long, Void> which just logs the count.
RecordCount.java
import org.apache.beam.sdk.transforms.DoFn;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class RecordCount extends DoFn<Long, Void> {
private static final Logger LOGGER = LoggerFactory.getLogger(RecordCount.class);
#ProcessElement
public void processElement(#Element Long count) {
LOGGER.info("The total number of records in the input file is: ", count);
}
}
}

Proper way to do this is to write the count to a storage system using a Beam connector (or using a Beam ParDo). Pipeline result is not directly available to the main program since Beam runner could parallelize computation and execution may not happen in the same computer.
For example (pseudocode):
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally())
.apply(ParDo(MyLongToStringParDo()))
.apply(TextIO.Write.to("gs://..."));
If you need to handle output directly in the main program, you can read from GCS using a client library after Beam program ends (make sure to specify p.run().waitUntilFinish() in this case). Alternatively, you can move your computation (that needs the count) into a Beam PTransform and make that part of your pipeline.

how to pass dynamic parameters in google cloud dataflow pipeline

I have written code to inject CSV file from GCS to BigQuery with hardcoded ProjectID, Dataset, Table name, GCS Temp & Staging location.
I am looking code that should read
ProjectID
Dataset
Table name
GCS Temp & Staging location parameters
from BigQuery table(Dynamic parameters).
Code:-
public class DemoPipeline {
public static TableReference getGCDSTableReference() {
TableReference ref = new TableReference();
ref.setProjectId("myprojectbq");
ref.setDatasetId("DS_Emp");
ref.setTableId("emp");
return ref;
}
static class TransformToTable extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
String input = c.element();
String[] s = input.split(",");
TableRow row = new TableRow();
row.set("id", s[0]);
row.set("name", s[1]);
c.output(row);
}
}
public interface MyOptions extends PipelineOptions {
/*
* Param
*
*/
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
options.setTempLocation("gs://demo-xxxxxx/temp");
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("Read From Storage", TextIO.read().from("gs://demo-xxxxxx/student.csv"));
PCollection<TableRow> rows = lines.apply("Transform To Table",ParDo.of(new TransformToTable()));
rows.apply("Write To Table",BigQueryIO.writeTableRows().to(getGCDSTableReference())
//.withSchema(BQTableSemantics.getGCDSTableSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER));
p.run();
}
}

Even to read from an initial table (Project ID / dataset / tables names) where other data is contained, you need to hardcode such information in somewhere. Properties files as Haris recommended is a good approach, look at the following suggestions:
Java Properties file. Used when parameters have to be changed or tuned. In general, changes that don't require new compilation. It's a file that has to live or attached to your java classes. Reading this file from GCS is feasible but a weird option.
Pipeline Execution Parameters. Custom parameters can be a workaround for your question, please check Creating Custom Options to understand how can be accomplished, here is a small example.

Java: return a LinkedHashSet

Basically, I'm trying to return a collection of strings in Java.
But...
each string must be unique because they're all the names of ".db" files in current folder, so I thought this collection should be LinkedHashSet.
The elements (filenames) must maintain the exact same order, so I can choose one of them by it's order number in the collection.
My main routine will show this collection in a GUI component (maybe a JList) for the user to choose one of them (without the .db extension).
I'm totally newbie (as you can see), so if you think there are better options than LinkedHashSet please tell me.
Also, how can I grab this collection in the main class?
What I've got so far:
public Set GetDBFilesList() {
//returns ORDERED collection of UNIQUE strings with db filenames
LinkedHashSet a = new LinkedHashSet();
FilenameFilter dbFilter = (File file, String name) -> {
return name.toLowerCase().endsWith(".db");
};
String dirPath = "";
File dir = new File(dirPath);
File[] files = dir.listFiles(dbFilter);
if (files.length > 0) {
for (File aFile : files) {
a.add(aFile.getName());
}
}
return a;
}

You want an ordered and unique collection - LinkedHashSet is a good choice.
Some comments on your methode:
Your should use Generics f.e.: LinkedHashSet<File> or LinkedHashSet<String>
The check for files.length is unnecessary, but you could check for null if the path is not a directory or an I/O error occured
You should name your variables properly: a is not a good name
Your methode can be static - maybe in a static helper class?
The Set.add methode returns true or false if the item was added or not, you should check that just in case
Putting all together:
//Your Main class
public class Main
{
public static void main(String[] args)
{
File dir = new File("");
Collection<File> dbFiles = DbFileManager.getDatabaseFiles(dir);
}
}
//Your DB File Reader Logic
public class DbFileManager
{
public static Collection<File> getDatabaseFiles(File directory)
{
Collection<File> dbFiles = new LinkedHashSet<>();
//filter code etc.
boolean success = dbFiles.addAll(directory.listFiles(filter));
//Check if everthing was added
return dbFiles;
}
}

Sitecore FileUtil.ZipFiles creating empty zip file

I am trying to use the ZipFiles() utility method and its producing an empty zip file. I am using Sitecore 6.5. There are no error, permissions or otherwise.
Any thoughts? Here is the code.
public void CreateZipFile(string zipfileName, List<string> files)
{
var zipfile = string.Format("{0}/{1}/{2}", TempFolder.Folder, "myfolder", zipfileName) ;
var fileArray = files.ToArray();
var x = FileUtil.ZipFiles(zipfile, fileArray);
}
EDIT:
I am passing the files like this
var files = new List<string> { FileUtil.MapPath("/temp/sample.xlf") };

The proper usage of FileUtil.ZipFiles method is:
FileUtil.ZipFiles("/test.zip", new []{"/web.config", "/otherfile.txt"})
Sitecore automatically maps paths. The zip file will be created in your web app root.
EDIT AFTER COMMENT
If you want to create a zip file outside the web root and with a flat structure inside, you can use Sitecore ZipWriter class like this:
public static string ZipFiles(string absolutePathToZipfile, string[] files)
{
using (ZipWriter zipWriter = new ZipWriter(absolutePathToZipfile))
{
foreach (string path in files)
{
using (FileStream fileStream = System.IO.File.OpenRead(path.StartsWith("/") ? FileUtil.MapPath(path) : path))
zipWriter.AddEntry(FileUtil.GetFileName(path), fileStream);
}
}
return absolutePathToZipfile;
}

SharpLibZip: Add file without path

I'm using the following code, using the SharpZipLib library, to add files to a .zip file, but each file is being stored with its full path. I need to only store the file, in the 'root' of the .zip file.
string[] files = Directory.GetFiles(folderPath);
using (ZipFile zipFile = ZipFile.Create(zipFilePath))
{
zipFile.BeginUpdate();
foreach (string file in files)
{
zipFile.Add(file);
}
zipFile.CommitUpdate();
}
I can't find anything about an option for this in the supplied documentation. As this is a very popular library, I hope someone reading this may know something.

My solution was to set the NameTransform object property of the ZipFile to a ZipNameTransform with its TrimPrefix set to the directory of the file. This causes the directory part of the entry names, which are full file paths, to be removed.
public static void ZipFolderContents(string folderPath, string zipFilePath)
{
string[] files = Directory.GetFiles(folderPath);
using (ZipFile zipFile = ZipFile.Create(zipFilePath))
{
zipFile.NameTransform = new ZipNameTransform(folderPath);
foreach (string file in files)
{
zipFile.BeginUpdate();
zipFile.Add(file);
zipFile.CommitUpdate();
}
}
}
What's cool is the the NameTransform property is of type INameTransform, allowing customisation of the name transforms.

How about using System.IO.Path.GetFileName() combined with the entryName parameter of ZipFile.Add()?
string[] files = Directory.GetFiles(folderPath);
using (ZipFile zipFile = ZipFile.Create(zipFilePath))
{
zipFile.BeginUpdate();
foreach (string file in files)
{
zipFile.Add(file, System.IO.Path.GetFileName(file));
}
zipFile.CommitUpdate();
}

The MSDN entry for Directory.GetFiles() states that The returned file names are appended to the supplied path parameter. (http://msdn.microsoft.com/en-us/library/07wt70x2.aspx), so the strings you are passing to zipFile.Add() contain the path.
According to the SharpZipLib documentation, there is an overload of the Add method,
public void Add(string fileName, string entryName)
Parameters:
fileName(String) The name of the file to add.
entryName (String) The name to use for the ZipEntry on the Zip file created.
Try this approach:
string[] files = Directory.GetFiles(folderPath);
using (ZipFile zipFile = ZipFile.Create(zipFilePath))
{
zipFile.BeginUpdate();
foreach (string file in files)
{
zipFile.Add(file, Path.GetFileName(file));
}
zipFile.CommitUpdate();
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Apache Crunch: How to set multiple input paths? - mapreduce

I have a problem: I can't set the multiple input paths when I use the Apache Crunch. How can I solve this problem?

Related

How to count the number of rows in the input file of the Google Dataflow file processing?

how to pass dynamic parameters in google cloud dataflow pipeline

Java: return a LinkedHashSet

Sitecore FileUtil.ZipFiles creating empty zip file

SharpLibZip: Add file without path

Categories

Resources