mock input dstream apache spark - unit-testing

I am trying to mock the input dstream while writing a spark stream unit test. I am able to mock the RDD but when I am trying to convert them into dstream, dstream object is coming up empty. I have used the following code-
val lines = mutable.Queue[RDD[String]]()
val dstream = streamingContext.queueStream(lines)
// append data to DStream
lines += sparkContext.makeRDD(Seq("To be or not to be.", "That is the question."))
Any help regarding the same will be highly appreciated.

Write UT for all DataFrameWriter, DataFrameReader, DataStreamReader, DataStreamWriter
The sample test case using the above steps
Mock
Behavior
Assertion
Maven based dependencies
<groupId>org.scalatestplus</groupId>
<artifactId>mockito-3-4_2.11</artifactId>
<version>3.2.3.0</version>
<scope>test</scope>
<groupId>org.mockito</groupId>
<artifactId>mockito-inline</artifactId>
<version>2.13.0</version>
<scope>test</scope>
Let’s use an example of a spark class where the source is Hive and the sink is JDBC
class DummySource extends SparkPipeline {
/**
* Method to read the source and create a Dataframe
*
* #param sparkSession : SparkSession
* #return : DataFrame
*/
override def read(spark: SparkSession): DataFrame = {
spark.read.table("Table_Name").filter("_2 > 1")
}
/**
* Method to transform the dataframe
*
* #param df : DataFrame
* #return : DataFrame
*/
override def transform(df: DataFrame): DataFrame = ???
/**
* Method to write/save the Dataframe to a target
*
* #param df : DataFrame
*
*/
override def write(df: DataFrame): Unit =
df.write.jdbc("url", "targetTableName", new Properties())
}
Mocking Read
test("Spark read table") {
val dummySource = new DummySource()
val sparkSession = SparkSession
.builder()
.master("local[*]")
.appName("mocking spark test")
.getOrCreate()
val testData = Seq(("one", 1), ("two", 2))
val df = sparkSession.createDataFrame(testData)
df.show()
val mockDataFrameReader = mock[DataFrameReader]
val mockSpark = mock[SparkSession]
when(mockSpark.read).thenReturn(mockDataFrameReader)
when(mockDataFrameReader.table("Table_Name")).thenReturn(df)
dummySource.read(mockSpark).count() should be(1)
}
Mocking Write
test("Spark write") {
val dummySource = new DummySource()
val mockDf = mock[DataFrame]
val mockDataFrameWriter = mock[DataFrameWriter[Row]]
when(mockDf.write).thenReturn(mockDataFrameWriter)
when(mockDataFrameWriter.mode(SaveMode.Append)).thenReturn(mockDataFrameWriter)
doNothing().when(mockDataFrameWriter).jdbc("url", "targetTableName", new Properties())
dummySource.write(df = mockDf)
}
Streaming code in ref
Ref : https://medium.com/walmartglobaltech/spark-mocking-read-readstream-write-and-writestream-b6fe70761242

Related

Running BeamSql WithoutCoder or Making Coder Dynamic

I am reading data from file and converting it to BeamRecord But While i am Doing Query on that it Show Error-:
Exception in thread "main" java.lang.ClassCastException: org.apache.beam.sdk.coders.SerializableCoder cannot be cast to org.apache.beam.sdk.coders.BeamRecordCoder
at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.registerTables(BeamSql.java:173)
at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.expand(BeamSql.java:153)
at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.expand(BeamSql.java:116)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:533)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:465)
at org.apache.beam.sdk.values.PCollectionTuple.apply(PCollectionTuple.java:160)
at TestingClass.main(TestingClass.java:75)
But When I am Providing it a Coder Then It Runs Perfectly.
I am little confused that if I am reading data from a file the file data schema changes on every run because I am using templates so is there any way I can use Default Coder or Without Coder, i can Run the Query.
For Reference Code is Below Please Check.
PCollection<String> ReadFile1 = PBegin.in(p).apply(TextIO.read().from("gs://Bucket_Name/FileName.csv"));
PCollection<BeamRecord> File1_BeamRecord = ReadFile1.apply(new StringToBeamRecord()).setCoder(new Temp().test().getRecordCoder());
PCollection<String> ReadFile2= p.apply(TextIO.read().from("gs://Bucket_Name/FileName.csv"));
PCollection<BeamRecord> File2_Beam_Record = ReadFile2.apply(new StringToBeamRecord()).setCoder(new Temp().test1().getRecordCoder());
new Temp().test1().getRecordCoder() --> Returning HardCoded BeamRecordCoder Values Which I need to fetch at runtime
Conversion From PColletion<String> to PCollection<TableRow> is Below-:
Public class StringToBeamRecord extends PTransform<PCollection<String>,PCollection<BeamRecord>> {
private static final Logger LOG = LoggerFactory.getLogger(StringToBeamRecord.class);
#Override
public PCollection<BeamRecord> expand(PCollection<String> arg0) {
return arg0.apply("Conversion",ParDo.of(new ConversionOfData()));
}
static class ConversionOfData extends DoFn<String,BeamRecord> implements Serializable{
#ProcessElement
public void processElement(ProcessContext c){
String Data = c.element().replaceAll(",,",",blank,");
String[] array = Data.split(",");
List<String> fieldNames = new ArrayList<>();
List<Integer> fieldTypes = new ArrayList<>();
List<Object> Data_Conversion = new ArrayList<>();
int Count = 0;
for(int i = 0 ; i < array.length;i++){
fieldNames.add(new String("R"+Count).toString());
Count++;
fieldTypes.add(Types.VARCHAR); //Using Schema I can Set it
Data_Conversion.add(array[i].toString());
}
LOG.info("The Size is : "+Data_Conversion.size());
BeamRecordSqlType type = BeamRecordSqlType.create(fieldNames, fieldTypes);
c.output(new BeamRecord(type,Data_Conversion));
}
}
}
Query is -:
PCollectionTuple test = PCollectionTuple.of(
new TupleTag<BeamRecord>("File1_BeamRecord"),File1_BeamRecord)
.and(new TupleTag<BeamRecord>("File2_BeamRecord"), File2_BeamRecord);
PCollection<BeamRecord> output = test.apply(BeamSql.queryMulti(
"Select * From File1_BeamRecord JOIN File2_BeamRecord "));
Is thier anyway i can make Coder Dynamic or I can Run Query with Default Coder.

HBase Get/Scan in a Scalding job

I'm using Scalding with Spyglass to read from/write to HBase.
I'm doing a left outer join of table1 and table2 and write back to table1 after transforming a column.
Both table1 and table2 are declared as Spyglass HBaseSource.
This works fine. But, i need to access a different row in table1 using rowkey to compute transformed value.
I tried the following for HBase get:
val hTable = new HTable(conf, TABLE_NAME)
val result = hTable.get(new Get(rowKey.getBytes()))
I'm getting access to Configuration in Scalding job as mentioned in this link:
https://github.com/twitter/scalding/wiki/Frequently-asked-questions#how-do-i-access-the-jobconf
This works when i run the scalding job locally.
But, when i run it in cluster, conf is null when this code is executed in Reducer.
Is there a better way to do HBase get/scan in a Scalding/Cascading job for cases like this?
Ways to do this...
1) You can use a managed resource
class SomeJob(args: Args) extends Job(args) {
val someConfig = HBaseConfiguration.create().addResource(new Path(pathtoyourxmlfile))
lazy val hPool = new HTablePool(someConfig, 3)
def getConf = {
implicitly[Mode] match {
case Hdfs(_, conf) => conf
case _ => whateveryou are doing for a local conf...
}
}
... somePipe.someOperation.... {
val gets = key.map { key => new Get(key) }
managed(hPool.getTable("myTableName")) acquireAndGet { table =>
val results = table.get(gets)
...do something with these results
}
}
}
2) You can use some more specific cascading code, where you write a custom scheme and inside that you will override the source method and possibly some others depending on your needs. In there you can access the JobConf like this:
class MyScheme extends Scheme[JobConf, SomeRecordReader, SomeOutputCollector, ..] {
#transient var jobConf: Configuration = super.jobConfiguration
override def source(flowProcess: FlowProcess[JobConf], ...): Boolean = {
jobConf = flowProcess match {
case h: HadoopFlowProcess => h.getJobConf
case _ => jconf
}
... dosomething with the jobConf here
}
}

Missing invocation to mocked type at this point;

I am new to jMockit. I am trying to mock multiple instances of java.io.File type in a method. There are some places where, I shouldn't mock file Object. For that reason, I am using #Injectable. It is throwing the below exception.
I don't want to mock all the instances of java.io.File.I want the instances returned from the methods to be actual Files.
The below is test class.
/**
*
*/
package org.iis.uafdataloader.tasklet;
import static org.junit.Assert.fail;
import java.io.File;
import java.io.FilenameFilter;
import java.io.IOException;
import java.util.regex.Pattern;
import mockit.Expectations;
import mockit.Injectable;
import mockit.Mocked;
import mockit.NonStrictExpectations;
import mockit.VerificationsInOrder;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.filefilter.RegexFileFilter;
import org.iis.uafdataloader.tasklet.validation.FileNotFoundException;
import org.junit.Test;
import org.springframework.batch.core.StepContribution;
import org.springframework.batch.core.scope.context.ChunkContext;
import org.springframework.batch.repeat.RepeatStatus;
/**
* #author K23883
*
*/
public class FileMovingTaskletTest {
private FileMovingTasklet fileMovingTasklet;
#Mocked
private StepContribution contribution;
#Mocked
private ChunkContext chunkContext;
/**
* Test method for
* {#link org.iis.uafdataloader.tasklet.FileMovingTasklet#execute(org.springframework.batch.core.StepContribution, org.springframework.batch.core.scope.context.ChunkContext)}
* .
*
* #throws Exception
*/
#Test
public void testExecuteWhenWorkingDirDoesNotExist(
// #Mocked final File file,
#Injectable final File sourceDirectory,
#Injectable final File workingDirectory,
#Injectable final File archiveDirectory,
#Mocked final RegexFileFilter regexFileFilter,
#Mocked final FileUtils fileUtils) throws Exception {
fileMovingTasklet = new FileMovingTasklet();
fileMovingTasklet.setSourceDirectoryPath("sourceDirectoryPath");
fileMovingTasklet.setInFileRegexPattern("inFileRegexPattern");
fileMovingTasklet.setArchiveDirectoryPath("archiveDirectoryPath");
fileMovingTasklet.setWorkingDirectoryPath("workingDirectoryPath");
final File[] sourceDirectoryFiles = new File[] {
new File("sourceDirectoryPath/ISGUAFFILE.D140728.C00"),
new File("sourceDirectoryPath/ISGUAFFILE.D140729.C00") };
final File[] workingDirectoryFiles = new File[] {
new File("workingDirectoryPath/ISGUAFFILE.D140728.C00"),
new File("workingDirectoryPath/ISGUAFFILE.D140729.C00") };
new NonStrictExpectations(){{
new File("sourceDirectoryPath");
result = sourceDirectory;
sourceDirectory.exists();
result = true;
sourceDirectory.isDirectory();
result = true;
// workingDirectory =
new File("workingDirectoryPath");
result = workingDirectory;
workingDirectory.exists();
result = false;
workingDirectory.mkdirs();
FileUtils.cleanDirectory(onInstance(workingDirectory));
FilenameFilter fileNameFilter = new RegexFileFilter(anyString,
Pattern.CASE_INSENSITIVE);
sourceDirectory.listFiles(fileNameFilter);
result = sourceDirectoryFiles;
System.out.println("sourceDirectoryFile :"
+ ((File[]) sourceDirectoryFiles).length);
// for (int i = 0; i < sourceDirectoryFiles.length; i++) {
// FileUtils.moveFileToDirectory(sourceDirectoryFiles[i],
// workingDirectory, true);
// }
// archiveDirectory =
new File("archiveDirectoryPath");
result = archiveDirectory;
workingDirectory.listFiles();
result = workingDirectoryFiles;
// for (int i = 0; i < workingDirectoryFiles.length; i++) {
// FileUtils.copyFileToDirectory(workingDirectoryFiles[i],
// archiveDirectory);
// }
}};
RepeatStatus status = fileMovingTasklet.execute(contribution,
chunkContext);
assert (status == RepeatStatus.FINISHED);
new VerificationsInOrder() {{
sourceDirectory.exists();
onInstance(sourceDirectory).isDirectory();
onInstance(workingDirectory).exists();
onInstance(workingDirectory).mkdirs();
onInstance(sourceDirectory).listFiles((FilenameFilter)any);
FileUtils.moveFileToDirectory((File)any, onInstance(workingDirectory), true);
times = 2;
FileUtils.copyFileToDirectory((File)any, onInstance(archiveDirectory));
times= 2;
}};
}
}
The below is actual implementation method
/*
* (non-Javadoc)
*
* #see org.springframework.batch.core.step.tasklet.Tasklet#execute(org.
* springframework.batch.core.StepContribution,
* org.springframework.batch.core.scope.context.ChunkContext)
*/
#Override
public RepeatStatus execute(StepContribution contribution,
ChunkContext chunkContext) throws Exception {
File sourceDirectory = new File(sourceDirectoryPath);
if (sourceDirectory == null || !sourceDirectory.exists()
|| !sourceDirectory.isDirectory()) {
throw new FileNotFoundException("The source directory '"
+ sourceDirectoryPath
+ "' doesn't exist or can't be read or not a directory");
}
File workingDirectory = new File(workingDirectoryPath);
if (workingDirectory != null && !workingDirectory.exists() ) {
workingDirectory.mkdirs();
}
FileUtils.cleanDirectory(workingDirectory);
FilenameFilter fileFilter = new RegexFileFilter(inFileRegexPattern,
Pattern.CASE_INSENSITIVE);
File[] sourceDirectoryFiles = sourceDirectory.listFiles(fileFilter);
System.out.println("sourceDirectoryFiles : " + sourceDirectoryFiles.length);
for (File file : sourceDirectoryFiles) {
FileUtils.moveFileToDirectory(file, workingDirectory, true);
}
File archiveDirectory = new File(archiveDirectoryPath);
for (File file : workingDirectory.listFiles()) {
FileUtils.copyFileToDirectory(file, archiveDirectory);
}
return RepeatStatus.FINISHED;
}
The below is stack trace.
java.lang.IllegalStateException: Missing invocation to mocked type at this point; please make sure such invocations appear only after the declaration of a suitable mock field or parameter
at org.iis.uafdataloader.tasklet.FileMovingTaskletTest$1.<init>(FileMovingTaskletTest.java:75)
at org.iis.uafdataloader.tasklet.FileMovingTaskletTest.testExecuteWhenWorkingDirDoesNotExist(FileMovingTaskletTest.java:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Please, help me in solving the problem.
#Injectable gives you a single mocked instance; it won't affect other instances of the mocked type. So, when the test attempts to record new File("sourceDirectoryPath"), it says "missing invocation to mocked type at this point" precisely because the File(String) is not mocked.
To mock the entire File class (including its constructors) so that all instances are affected, you need to use #Mocked instead, as the following example shows:
#Test
public void mockFutureFileObjects(#Mocked File anyFile) throws Exception
{
final String srcDirPath = "sourceDir";
final String wrkDirPath = "workingDir";
new NonStrictExpectations() {{
File srcDir = new File(srcDirPath);
srcDir.exists(); result = true;
srcDir.isDirectory(); result = true;
File wrkDir = new File(wrkDirPath);
wrkDir.exists(); result = true;
}};
sut.execute(srcDirPath, wrkDirPath);
}
The JMockit Tutorial describes the same mechanism, although with a slightly different syntax.
This said, I would suggest instead to write the test with real files and directories.

MapReduce job with mixed data sources: HBase table and HDFS files

I need to implement a MR job which access data from both HBase table and HDFS files. E.g., mapper reads data from HBase table and from HDFS files, these data share the same primary key but have different schema. A reducer then join all columns (from HBase table and HDFS files) together.
I tried look online and could not find a way to run MR job with such mixed data source. MultipleInputs seem only work for multiple HDFS data sources. Please let me know if you have some ideas. Sample code would be great.
After a few days of investigation (and get help from HBase user mailing list), I finally figured out how to do it. Here is the source code:
public class MixMR {
public static class Map extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String s = value.toString();
String[] sa = s.split(",");
if (sa.length == 2) {
context.write(new Text(sa[0]), new Text(sa[1]));
}
}
}
public static class TableMap extends TableMapper<Text, Text> {
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR1 = "c1".getBytes();
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
String key = Bytes.toString(row.get());
String val = new String(value.getValue(CF, ATTR1));
context.write(new Text(key), new Text(val));
}
}
public static class Reduce extends Reducer <Object, Text, Object, Text> {
public void reduce(Object key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String ks = key.toString();
for (Text val : values){
context.write(new Text(ks), val);
}
}
}
public static void main(String[] args) throws Exception {
Path inputPath1 = new Path(args[0]);
Path inputPath2 = new Path(args[1]);
Path outputPath = new Path(args[2]);
String tableName = "test";
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MixMR.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
scan.addFamily(Bytes.toBytes("cf"));
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
TableMap.class, // mapper
Text.class, // mapper output key
Text.class, // mapper output value
job);
job.setReducerClass(Reduce.class); // reducer class
job.setOutputFormatClass(TextOutputFormat.class);
// inputPath1 here has no effect for HBase table
MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, inputPath2, TableInputFormat.class, TableMap.class);
FileOutputFormat.setOutputPath(job, outputPath);
job.waitForCompletion(true);
}
}
There is no OOTB feature that supports this. A possible workaround could be to Scan your HBase table and write the Results to a HDFS file first and then do the reduce-side join using MultipleInputs. But this will incur some additional I/O overhead.
A pig script or hive query can do that easily.
sample pig script
tbl = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:* ...', '-loadKey true -limit 5')
AS (id:bytearray, info_map:map[],...);
fle = LOAD '/somefile' USING PigStorage(',') AS (id:bytearray,...);
Joined = JOIN A tbl by id,fle by id;
STORE Joined to ...

Get Folder size of HDFS from java

I have to HDFS folder size which is having sub directories from java.
From command line we can use -dus option, But anyone can help me on how to get the same using java.
The getSpaceConsumed() function in the ContentSummary class will return the actual space the file/directory occupies in the cluster i.e. it takes into account the replication factor set for the cluster.
For instance, if the replication factor in the hadoop cluster is set to 3 and the directory size is 1.5GB, the getSpaceConsumed() function will return the value as 4.5GB.
Using getLength() function in the ContentSummary class will return you the actual file/directory size.
You could use getContentSummary(Path f) method provided by the class FileSystem. It returns a ContentSummary object on which the getSpaceConsumed() method can be called which will give you the size of directory in bytes.
Usage :
package org.myorg.hdfsdemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class GetDirSize {
/**
* #param args
* #throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration config = new Configuration();
config.addResource(new Path(
"/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));
config.addResource(new Path(
"/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));
FileSystem fs = FileSystem.get(config);
Path filenamePath = new Path("/inputdir");
System.out.println("SIZE OF THE HDFS DIRECTORY : " + fs.getContentSummary(filenamePath).getSpaceConsumed());
}
}
HTH
Thank you guys.
Scala version
package com.beloblotskiy.hdfsstats.model.hdfs
import java.nio.file.{Files => NioFiles, Paths => NioPaths}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.commons.io.IOUtils
import java.nio.file.{Files => NioFiles}
import java.nio.file.{Paths => NioPaths}
import com.beloblotskiy.hdfsstats.common.Settings
/**
* HDFS utilities
* #author v-abelablotski
*/
object HdfsOps {
private val conf = new Configuration()
conf.addResource(new Path(Settings.pathToCoreSiteXml))
conf.addResource(new Path(Settings.pathToHdfsSiteXml))
private val fs = FileSystem.get(conf)
/**
* Calculates disk usage with replication factor.
* If function returns 3G for folder with replication factor = 3, it means HDFS has 1G total files size multiplied by 3 copies space usage.
*/
def duWithReplication(path: String): Long = {
val fsPath = new Path(path);
fs.getContentSummary(fsPath).getSpaceConsumed()
}
/**
* Calculates disk usage without pay attention to replication factor.
* Result will be the same with hadopp fs -du /hdfs/path/to/directory
*/
def du(path: String): Long = {
val fsPath = new Path(path);
fs.getContentSummary(fsPath).getLength()
}
//...
}
Spark-shell tool to show all tables and their consumption
A typical and illustrative tool for Spark-shell, looping loop over all bases, tables and partitions, to get sizes and report into a CSV file:
// sshell -i script.scala > ls.csv
import org.apache.hadoop.fs.{FileSystem, Path}
def cutPath (thePath: String, toCut: Boolean = true) : String =
if (toCut) thePath.replaceAll("^.+/", "") else thePath
val warehouse = "/apps/hive/warehouse" // the Hive default location for all databases
val fs = FileSystem.get( sc.hadoopConfiguration )
println(s"base,table,partitions,bytes")
fs.listStatus( new Path(warehouse) ).foreach( x => {
val b = x.getPath.toString
fs.listStatus( new Path(b) ).foreach( x => {
val t = x.getPath.toString
var parts = 0; var size = 0L; // var size3 = 0L
fs.listStatus( new Path(t) ).foreach( x => {
// partition path is x.getPath.toString
val p_cont = fs.getContentSummary(x.getPath)
parts = parts + 1
size = size + p_cont.getLength
//size3 = size3 + p_cont.getSpaceConsumed
}) // t loop
println(s"${cutPath(b)},${cutPath(t)},${parts},${size}")
// display opt org.apache.commons.io.FileUtils.byteCountToDisplaySize(size)
}) // b loop
}) // warehouse loop
System.exit(0) // get out from spark-shell
PS: I checked, size3 is always 3*size, no extra information.