jar containing org.apache.hadoop.hive.dynamodb - mapreduce

I was trying to programmatically Load a dynamodb table into HDFS (via java, and not hive), I couldnt find examples online on how to do it, so thought I'd download the jar containing org.apache.hadoop.hive.dynamodb and reverse engineer the process.
Unfortunately, I couldn't find the file as well :(.
Could someone answer the following questions for me (listed in order of priority).
Java example that loads a dynamodb table into HDFS (that can be passed to a mapper as a table input format).
the jar containing org.apache.hadoop.hive.dynamodb.
Thanks!

It's in hive-bigbird-handler.jar. Unfortunately AWS doesn't provide any source or at least Java Doc about it. But you can find the jar on any node of an EMR Cluster:
/home/hadoop/.versions/hive-0.8.1/auxlib/hive-bigbird-handler-0.8.1.jar
You might want to checkout this Article:
Amazon DynamoDB Part III: MapReducin’ Logs
Unfortunately, Amazon haven’t released the sources for
hive-bigbird-handler.jar, which is a shame considering its usefulness.
Of particular note, it seems it also includes built-in support for
Hadoop’s Input and Output formats, so one can write straight on
MapReduce Jobs, writing directly into DynamoDB.
Tip: search for hive-bigbird-handler.jar to get to the interesting parts... ;-)

1- I am not aware of any such example, but you might find this library useful. It provides InputFormats, OutputFormats, and Writable classes for reading and writing data to Amazon DynamoDB tables.
2- I don't think they have made it available publically.

Related

GKE usage metering template requests only one data source although the documents says three data sources are required

I'm trying to visualize GKE usage metering data using a Data Studio dashboard following the official document.
https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-usage-metering#view_in_data-studio
It says
We created a dashboard you can copy into your project. When you copy the dashboard, you are prompted to select three data sources you just created.
I guess the three data sources means a data source created from the cost breakdown table, gke_cluster_resource_consumption and gke_cluster_resource_usage.
However, when creating a copy from the data studio template, I can choose only one data source.
Am I missing something?
I think the Google documentation is outdated.
However, if you follow the documentation, at the point where you clone the dashboard, it just asks you to create one DataSource (=cost breakdown table). What happens when you select this source? Maybe the error is only in the word "three".
The other two DataSources you mentioned are cited in other part of the document, so I guess they're for other purpose. The Google's dashboard have other linked DataSources, but they're not used.
So, apparently, you need only one DataSource to make this dashboard work.
If this doesn't work, I would say you're out of lucky. Try to ask Google in some community forum to fix their documentation.

How to write Parquet files on HDFS using C++?

I need to write in-memory data records to HDFS file in Parquet format using C++ language. I know there is a parquet-cpp library on github but i can't find example code.
Could anybody share copy or link to example code if you have any? Thanks.
There are examples for parquet-cpp in the github repo in the examples directory. They just deal with Parquet though, and do not involve HDFS access.
For HDFS access from C++, you will need libhdfs from Apache Hadoop. Or you may use Apache Arrow, which has HDFS integration, as desribed here.

Foswiki: Uploading and downloading topics without FTP

I have a Foswiki wiki on a server. Is it possible to script the following without FTP access (for various reasons I can't use it):
Download a topic's wikitext, modify it locally, then upload it again (overwriting the topic)
Upload wikitext to a new topic
I've been doing these tasks manually, but I'd like to automate them. I've looked into the Foswiki API and a few plugins, but nothing seems capable of doing this.
Is there a way? (any programming language)
If you have web access, you could drive the bin/view and bin/save scripts remotely from a script.
Take a look at our BuildContrib upload target for an example. It gets a strikeone key and downloads the original topic to recover any form data. It then uploads the topic text, creating a new version. It's written in perl, and uses LWP.
https://github.com/foswiki/distro/blob/master/BuildContrib/lib/Foswiki/Contrib/BuildContrib/Targets/upload.pm
The following isn't(!) the right solution (sure exists an nice Foswiki-way approach), but if you know perl, you can do anything with the:
Install Firefox
install MozRepl addon into it
Install the WWW::Mechanize::Firefox perl module
Now, you can script anything what you can do directly from the browser, e.g. logging into the Foswiki, click buttons, save topics, etc..etc. Drawback - it isn't an easy way - you need to know many details.
Myself using this technique for testing.

Installing the Kmeans PostgreSQL extension on Amazon RDS

I take part in some Django poroject and we use geo data (with GeoDjango).
I have installed PostGis as it described on AWS docs.
We have a lot of some points (markers) on the map. And we need to cluster them.
I found one library anycluster. This library need the PostgreSQL extension named kmeans-postgresql be installed on the Postgre database.
But my database is located on Amazon RDS. And I can't connect to it by SSH in order to install an extension...
Anybody knows how can I install kmeans-postgresql extension on my Amazon RDS database?
Or maybe you can advise me other ways of clustering?
The K-Means It is a really complex calculation that is useful to data mining and cluster analysis ( you can see more about it in the wikipedia page https://en.wikipedia.org/wiki/K-means_clustering ). It have a big complexity when have to deal with many points. The K-means extension to postgresql http://pgxn.org/dist/kmeans/doc/kmeans.html it is written in C and compiled in the database machine. This brings a better performance compared to an procedure in plpgsql. Unfortunately as #estevao_lucas answered, this extension it is not enabled into Amazon RDS.
If you really need the k-means result, I translated this implementation of it, created by Joni Salonen in http://jonisalonen.com/2012/k-means-clustering-in-mysql/ and changed to plpgsql https://gist.github.com/thiagomata/a9737c3455d6248bef9f. This function uses temporary table. It is possible change it to use only arrays of Pins, if you wanna to.
But, if you only need to show some pins in a map, you will probably be happy with a really faster and simpler function that groups the results into an [x,y] matrix. I have created such function because the kmeans function was taking too much time to process my database (with a lot more than 400K elements). So this implementation is really faster, but does not have all the features you would expect from the K-means module. Besides that, this grid function https://gist.github.com/thiagomata/18ea14853998468c1a1d returns very good results, when the goal it is to show a big number of pins in a map.
You can just install supported extensions on Amazon RDS and Kmeans isn't it.
ERROR: Extension "kmeans" is not supported by Amazon RDS
DETAIL: Installing the extension "kmeans" failed, because it is not on the list of extensions supported by Amazon RDS.
HINT: Amazon RDS allows users with rds_superuser role to install supported extensions. See: SHOW rds.extensions;
alexandria_development=> SHOW rds.extensions
RDS extensions:
btree_gin,
btree_gist,
chkpass,
citext,
cube,
dblink,
dict_int,
dict_xsyn,
earthdistance,
fuzzystrmatch,
hstore,
intagg,
intarray,
isn,
ltree,
pgcrypto,
pgrowlocks,
pg_prewarm,
pg_stat_statements,
pg_trgm,
plcoffee,
plls,
plperl,
plpgsql,
pltcl,
plv8,
postgis,
postgis_tiger_geocoder,
postgis_topology,
postgres_fdw,
sslinfo,
tablefunc,
test_parser,
tsearch2,
unaccent,
uuid-ossp

How to monitor an FTP upload directory in coldfusion without using event gateways?

Having spent a couple of hours coding an event gateway solution, I discover that they are not supported by CF standard edition. Buggerit! So back to the drawing board.
I can see how I can check the folder's dateLastModified attribute using cfdirectory and so I can run a scheduled task to see when a new file has been uploaded, but whats the best way of storing/comparing the file list so as to get a list of just the ones added since last check.
General hints/links appreciated
Assuming that, for whatever reason, you can't use a gateway, the simplest soluition that springs to mind is to move files you've procesed to a separate directory. Then, your scheduled task can just deal with files in the FTP directory itself.
they are not supported by CF standard
edition
Are you still using CF7? It has been supported by CF Standard Edition since CF8
As #Henry pointed out, you can use an Event Gateway.
If you decide not to use that approach, I'd suggest a ColdFusion scheduled task. Most foolproof algorithm for that task is storing the results of the last <cfdirectory/> call either in a persistent scope - application or server - or writing it out to a database or file (e.g. WDDX). Reason to hold on to all this information, rather than just a timestamp, is handling situations where newly added or changed files do not take on the correct timestamp for whatever reason (system clock off comes to mind).
If you use a database to capture the data you could use a MINUS/EXCEPT query in SQL Server or Oracle, respectively, to determine what's new. Else you'll need to do perform some nested looping in ColdFusion over the old a new queries to generate the list of new files.