How to write the column names with Hadoop Mapreduce? - python-2.7

I am running a Mapreduce using Hadoop streaming API - writing mapper and reducer in python. My questions are about formatting the final output (reducer output) and few others. Can I write column names (is it called column qualifier?)? Also how to maintain the column space to be constant when each row entry has different width?
As the each line in the input file is processed, is it possible to set a counter and increment it - how? Mapper does not have to send a key - just the value?
I hear log4j is used to log errors - what needs to be done in the reducer or log4j to get it logged in log4j - will it also work on python - otherwise, how to get it logged?

Related

Spark Optimisation Not working as expected

I am trying to run a different set of SQL operations in pySpark. However, the optimization is not happening as expected.
Ideally, spark optimizes the whole execution plan before doing the actual thing.
Logic written is as below -
df = spark.read.csv("s3://data/")
df.createOrReplaceTempView("data")
df2=spark.sql("select * from data where p_report_date=20200901")
df2.createOrReplaceTempView("data2")
df3=spark.sql("select * from data2")
df3.write.mode("overwrite").format("parquet").save("s3_path_2")
Data files are stored with partitions -
s3://data/p_report_date=20200901,
s3://data/p_report_date=20200902
s3://data/p_report_date=20200903
So basically the action is called in df3, before that all is just transformations.
I am assuming spark to read only the folder which is mentioned in the where clause in df2(s3://data/p_report_date=20200901). But it's reading all the partitions in s3://data first & then filtering the p_report_date.
We are using EMR6.5.0 & Spark-3.1.2
Question - Why is spark not optimizing the query to read only the mentioned folder first in place of reading everything??

Reading large result set from Mongo - performance issue

Faced performance issue where reading several thousands. We have RoR application where we read data stored in Mongo. We use Monogid. Each stored documents contain 17 fields (15 float, and 2 as integer). We execute query which supported by index. Cursor return very fast (<50ms) but reading all documents take more then 500ms.
To find bottleneck we run the same query in Mongo Shell and query took <50ms to complete and iterate overall rows in result set. We have tested Mongo Ruby Driver and query take 250ms to complete. The same result we have if using Moped. Finally we have wrote c++ app which use mongo c++ driver and time to iterate over all result set - <20ms. But if we unzip received BSON object (to output it to console) time rise up to 120ms.
Does extraction from BSON take that much time?

Informatica character sequence

I am trying to create character sequence like AAA, AAB, AAC, AAD,....,BAA, BAB, BAC,.... and So on in a flat file using Informatica. I have the formula to create the charater sequence.
Here I need to have sequence numbers generated in informatica. But I dont have any source file or database to have this source.
Is there any method in Informatica to create sequence using Sequence Generater when there is no source records to read?
This is bit tricky as Informatica will do row by row processing and your mapping won't initialize until you give source rows through input(File or DB). So for generating sequence of n length by Informatica trnasformations you need n rows in input.
Another soltion to this is to use Dummy Source(i.e. Source with one row) and you can pass the loop parameters from this source and then use Java transfornmation and Java code to generate this sequence.
There is no way to generate rows without a source in a mapping.
When I need to do that I use one of these methods :
Generating a file with as many lines as I need, with the seq command under Unix. It could also be used as a direct pipeline source without creating the file.
Getting lines from a database
For example Oracle can generate as many lines as you want with a hierarchical query :
SELECT LEVEL just_a_column
FROM dual
CONNECT BY LEVEL <= 26*26*26
DB2 can do that with a recursive query :
WITH DUMMY(ID) AS (
SELECT 1 FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT ID + 1 FROM DUMMY WHERE ID < 26*26*26
)
SELECT ID FROM DUMMY
You can generate rows using Java transformation. But even to use that , you need a source. I suggest you to use the formula in the Java transform and a dummy source to a database with a select getdate() statement so that a record is returned to call the Java transform. You can then generate the sequence as well in Java transform or connect sequence generator to output of Java transform to number them.
We have an option to create a sequence number even it is not available in the source.
Create a Sequence generator transformation. You will be getting NEXTVAL and CURRVAL.
In a property tab you will be having an option the create a sequence numbers.
Start values - the value from which it should start
Increment by - increment value
End value - the value in which it should end
Current value - your current value
Cycle - In case you require in cyclic
No.of cached values
Reset
Tracing level
Connect the NEXTVAL to your target column.

Hadoop mapreduce using 2 mapper and 1 reducer using c++

Following the instructions on this link, I implemented a wordcount program in c++ using single mapper and single reducer. Now I need to use two mappers and one reducer for the same problem.
Can someone help me please in this regard?
The number of mappers depends on the number of input splits created. The number of input splits depends on the size of the input, the size of a block, the number of input files (each input file creates at least one input split), whether the input files are splittable or not, etc. See also this post in SO.
You can set the number of reducers to as many as you wish. I guess in hadoop pipes you can do this by setting the -D mapred.reduce.tasks=... when running hadoop. See this post in SO.
If you want to quickly test how your program works with more than one mappers, you can simply put a new file in your input path. This will make hadoop create another input split and thus another map task.
PS: The link that you provide is not reachable.

HBase Mapreduce output to hdfs & HBASe

I have a mapreduce program that first scans an HBase table.
I want some reducer output to go to hdfs and some reducer output to be written to an hbase table. Can a reducer be configured to output to two different locations/formats like this?
A reducer can be configured to use multiple files to output using the MulitpleOutputsclass. The documentation at the top of that class provides a clear example for writing to multiple files. However, since there is no built in Outputformat for writing to HBase you might consider writing the 2nd stream to specific place on HDFS and then using another job to insert it into HBase.
If you don't want to write too much code, just open a Table in your mapper's or reducer's setup method and do a put statement into your hbase table. On the other hand, write your job such that the output file is an hdfs file. This way you get to both write to hbase and hdfs.
To be more elaborate, when you do a context.write(), you would write to the hdfs file, and on the other hand, the table.put can happen when you do a put.
Also, don't forget to close the table and anything else in your cleanup() method. The only backdrop is, if there are let's say 1000 mappers your table connection would be opened a 1000 times, but at any given point, only the max number of your mappers really run, so that would probably be 50, depending on your setup. Works for me at least!
i think multiple output can do the job..
chk tis out
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html