mapreduce program to read data from hive - mapreduce

I am new to hadoop mapreduce and hive.
I would like to read data from Hive using Mapreduce program(in java) and identify the average.
I am not sure how to implement in mapreduce. Please help me with sample program.
I am using ibm biginsights 64-bit to work on hadoop framework.
And I am unable to refer below link. Getting page cannot be found error.
https://cwiki.apache.org/Hive/tutorial.html#Tutorial-Custommap%252Freducescripts

enter code hereIs there a reason you are not simply using hql and
select avg(my_col) from my_table?
If you really need to do it in Java then you can use HiveClient and access via the hive jdbc api.
Here is a sample code snippet (elaborated from the HiveClient docs):
Connection con = null;
Statement stmt = null;
Resulset rs = null;
try {
con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
stmt = con.createStatement();
rs = stmt.executeQuery("select avg(my_col) as my_avg from my_table");
Double avg = rs.getDouble("my_avg");
// do something with it..
} finally {
// close rs, stmt, conn in reverse order
}
For further info: https://cwiki.apache.org/confluence/display/Hive/HiveClient
Note: you do NOT need to put this code into YOUR map/reduce Hive takes care of creating the map/reduce program (and the associated benefits of parallelization) itself.

Related

Facing Performance issue while reading files from GCS using apache beam

I was trying to read data using wildcard from gcs path. My files is in bzip2 format and there were around 300k files resides in the gcs path with same wildcard expression. I'm using the below code snippet to read files.
PCollection<String> val = p
.apply(FileIO.match()
.filepattern("gcsPath"))
.apply(FileIO.readMatches().withCompression(Compression.BZIP2))
.apply(MapElements.into(TypeDescriptor.of(String.class)).via((ReadableFile f) -> {
try {
return f.readFullyAsUTF8String();
} catch (IOException e) {
return null;
}
}));
But the performance is very bad and it will take around 3 days to read that file using above code with the current speed. Is there any alternative api I can use in cloud dataflow to read this amount of files from gcs with ofcourse good performance. I used TextIO earlier, but it was getting failed because of template serialisation limit which is 20MB.
Below TextIO() code solved the issue.
PCollection<String> input = p.apply("Read file from GCS",TextIO.read().from(options.getInputFile())
.withCompression(Compression.AUTO).withHintMatchesManyFiles()
);
withHintMatchesManyFiles() solved the issue. But still I don't know while FileIO performance is so bad.

How to perform unit test on the append function of Azure Data Lake written in .Net Framework?

I have created Azure webjobs that contains methods for file creation and Appending data to that file on Datalake Store. I am done with all its development part publishing webjobs etc. Now i am going to write unit tests to test whether the data i am sending is successfully appended to file or not All I need to know is how to perform such kind of unit test any idea?
what I currently thought of doing it is by cleaning all the data from my datalake file and then sending a test data to it. so on the basis of one of the column data of the whole data i sent, i will check whether it got appended or not. Is there any way that can give a quick status of whether my test data is written or not?
Note: Actually i want to know how to delete a particular row of a csv file on data lake but i dont want to use usql to search for the required row. (I am not directly sending data to Datalake it is written via Azure service bus queue which then triggers webjobs to append data to a file on datalake.)
Aside from looking at the file, I can see few other choices. If only your unit test is writing to the file, then you can send appends of variable lengths and then see whether the size of the file is updated appropriately as a result of successful appends. You can always read the file and see whether you data made it as well.
I solved my problem in the way that i Got the length of my file on Datalake store using:
var fileoffset = _adlsFileSystemClient.FileSystem.GetFileStatus(_dlAccountName, "/MyFile.csv").FileStatus.Length;
after getting length i sent my test data to the datalake and after that i again got the length of a file using same code. so the first length i.e before sending test data it was my offset and the length got after sending test data was my destination length i.e from offset to the destination length i read my datalake file using:
Stream Stream1 = _adlsFileSystemClient.FileSystem.Open(_dlAccountName, "/MyFile.csv", totalfileLength, fileoffset);
After getting my data in a stream I tried searching for the test data i sent using following code:
Note:I had a a column of guids in file on the basis of which i search my sent guid in a filestream. make sure to convert your search data to byte and then pass it to the function ReadOneSrch(..).
static bool ReadOneSrch(Stream fileStream, byte[] mydata)
{
int b;
long i = 0;
while ((b = fileStream.ReadByte()) != -1)
{
if (b == mydata[i++])
{
if (i == mydata.Length)
return true;
}
else
i = b == mydata[0] ? 1 : 0;
}
return false;
}

Use mysql embedded and --local-infile=1 with c++?

I am connecting to a mysql database using the embedding server (linking against mysqld) in c++. I have the following code:
static char *server_options[] = \
{ (char *)"mysql_test",
(char *)"--datadir=/home/cquiros/temp/mysql/db2",
(char *)"--default-storage-engine=MyISAM",
(char *)"--loose-innodb=0",
(char *)"--local-infile=1",
(char *)"--skip-grant-tables=1",
(char *)"--myisam-recover=FORCE",
(char *)"--key_buffer_size=16777216",
(char *)"--character-set-server=utf8",
(char *)"--collation-server=utf8_bin",
NULL };
int num_elements = (sizeof(server_options) / sizeof(char *)) - 1;
mysql_library_init(num_elements, server_options, NULL);
m_mysql = mysql_init(NULL);
char enable_load_infile = 1;
if (mysql_options(m_mysql,MYSQL_OPT_LOCAL_INFILE, (const char *)&(enable_load_infile)))
qDebug() << "Error setting option";
mysql_real_connect(m_mysql, NULL,NULL,NULL, "database1", 0,NULL,0);
The connection works and I can query and create tables however, when I try to execute "load data local infile ..." I always get "The used command is not allowed with this MySQL version" even though I am setting --local-infile=1 in the server options or setting it in code in:
char enable_load_infile = 1;
if (mysql_options(m_mysql,MYSQL_OPT_LOCAL_INFILE, (const char *)&(enable_load_infile)))
qDebug() << "Error setting option";
Any idea what I am doing wrong and how to fix it?
Many thanks for your help.
Carlos.
#QLands I realize its over a year since you've asked this question, but I figured I'd reply just for posterity in case others like me are googling for solutions.
I'm having the same issue, I can get LOAD DATA LOCAL INFILE statements to work from the Linux mysql CLI after I explicitly enabled it in the /etc/mysql/my.cfg file. However I CANNOT get it to work with the MySQL C++ connector -- I also get the error "The used command is not allowed with this MySQL version" when I try and run a LOAD DATA LOCAL INFILE command through the MySQL C++ connector. wtf right?
After much diligent googling and finding some back alley tech support posts I've come to conclude that the MySQL C++ connector did not (for whatever reason) decide to implement the ability for developers to be able to allow the local-infile=1 option. Apparently some people have been able to hack/fork the MySQL C++ connector to expose the functionality, but no one posted their source code -- only said it worked. Apparently there is a workaround in the MySQL C API after you initialize the connection you would use this:
mysql_options( &mysql, MYSQL_OPT_LOCAL_INFILE, 1 );
Here are some reference articles that lead me to this conclusion:
1.)
How can I get the native C API connection structure from MySQL Connector/C++?
2.)
Mysql 5.5 LOAD DATA INFILE Permissions
3.)
http://osdir.com/ml/db.mysql.c++/2004-04/msg00097.html
Essentially if you want the ability to use the LOAD DATA LOCAL INFILE functionality -- you have to use the mysql C API or execute it from the command line or hack/fork the existing mysql C++ api to expose the connection structure
:(

RWDBReader Cannot read more than 255 characters

We're using Rogue Wave tools for our database operations, writing in C++. When we try to read the results of a simple SQL query, like:
RWDBResult resParam = VimerParamTblSlc.execute (pConn);
RWDBTable resultParam = resParam.table ();
RWDBReader rdrParam = resultParam.reader ();
if (rdrParam())
{
// getting the resulting row fro, the reader
}
If the result contains more than 255 characters, then the reader (rdrParam) doesn't load the row at all, I mean it can't pass the if condition.
Is there a way to set this character limit for reading? Thanks.
We learn that it was a version problem with the adaptive server of Sybase and not RogueWave's fault. You need both the adaptive server and Open Client of version 12.5(or later).

Writing BLOB data to a SQL Server Database using ADO

I need to write a BLOB to a varbinary column in a SQL Server database. Sounds easy except that I have to do it in C++. I've been using ADO for the database operations (First question: is this the best technology to use?) So i've got the _Stream object, and a record set object created and the rest of the operation falls apart from there. If someone could provide a sample of how exactly to perform this seemingly simple operation that would be great!. My binary data is stored in a unsigned char array. Here is the codenstein that i've stitched together from what little I found on the internet:
_RecordsetPtr updSet;
updSet.CreateInstance(__uuidof(Recordset));
updSet->Open("SELECT TOP 1 * FROM [BShldPackets] Order by ChunkId desc",
_conPtr.GetInterfacePtr(), adOpenDynamic, adLockOptimistic, adCmdText);
_StreamPtr pStream ; //declare one first
pStream.CreateInstance(__uuidof(Stream)); //create it after
_variant_t varRecordset(updSet);
//pStream->Open(varRecordset, adModeReadWrite, adOpenStreamFromRecord, _bstr_t("n"), _bstr_t("n"));
_variant_t varOptional(DISP_E_PARAMNOTFOUND,VT_ERROR);
pStream->Open(
varOptional,
adModeUnknown,
adOpenStreamUnspecified,
_bstr_t(""),
_bstr_t(""));
_variant_t bytes(_compressStreamBuffer);
pStream->Write(_compressStreamBuffer);
updSet.GetInterfacePtr()->Fields->GetItem("Chunk")->Value = pStream->Read(1000);
updSet.GetInterfacePtr()->Update();
pStream->Close();
As far as ADO being the best technology in this case ... I'm not really sure. I personally think using ADO from C++ is a painful process. But it is pretty generic if you need that. I don't have a working example of using streams to write data at that level (although, somewhat ironically, I have code that I wrote using streams at the OLE DB level. However, that increases the pain level many times).
If, though, your data is always going to be loaded entirely in memory, I think using AppendChunk would be a simpler route:
ret = updSet.GetInterfacePtr()->Fields->
Item["Chunk"]->AppendChunk( L"some data" );