How the Hadoop History Server is working? - mapreduce

There are 2 properties within configuration files I am confused with:
The property yarn.nodemanager.remote-app-log-dir in yarn-site.xml:
a.) This property controls, where the logs of map/reduce tasks will be logged?
b.) This is the responsibility of Node Manager (NM)?
The property mapreduce.jobhistory.done-dir from mapred-site.xml:
a.) Job related files like configurations etc. are stored in this location?
b.) This is the responsibility of Application Master (AM)?
Does the History Server (HS) combines both of these information and shows a consolidated information in UI?

Assuming you have enabled log-aggregation,
1.a. This is the log-aggregation dir, usually HDFS where NMs aggregate container-logs to.
1.b. Yes.
2.a. Yes.
2.b. No. MR JobHistory Server will do that, by deleting JobSummary file and mv other files to ${mapreduce.jobhistory.done-dir} from ${mapreduce.jobhistory.intermediate-done-dir}.
3. Yes. MR JobHistory Server Web, includes job info(from ${mapreduce.jobhistory.done-dir}) and container logs(from ${yarn.nodemanager.remote-app-log-dir}).

Related

The Apache Nifi state is not persisted to zookeeper

According to the official Nifi documentation, the state allows Nifi processors to "resume from the place where it left off after NiFi is restarted. Additionally, it allows for a Processor to store some piece of information so that the Processor can access that information from all of the different nodes in the cluster".
If my understanding is good, when we configure a zookeeper Provider, the state will not be persisted locally, instead, the data will be sent to zookeeper.
I've explored the zookeeper znodes and could not find any data related to the state, all I can find are the informations about the Coordinator and Primary nodes. However, the local state directory is still filled.
The configuration is very simple, I've 3 external ZK nodes and 3 Nifi instances.
Here is an exerpt of the nifi.properties file:
nifi.cluster.is.node=true
nifi.zookeeper.connect.string=zk-node1:2181,zk-node2:2181,zk-node3:2181
nifi.state.management.embedded.zookeeper.start=false
nifi.state.management.provider.cluster=zk-provider
And here is an exerpt of the state-management.xml file:
<cluster-provider>
<id>zk-provider</id>
<class>org.apache.nifi.controller.state.providers.zookeeper.ZooKeeperStateProvider</class>
<property name="Connect String">zk-node1:2181,zk-node2:2181,zk-node3:2181</property>
<property name="Root Node">/nifi</property>
<property name="Session Timeout">10 seconds</property>
<property name="Access Control">Open</property>
</cluster-provider>
When I try to ls the Zookeeper, I can see only 2 znodes: "components" but this znode is empty and the "leaders" zonde which contain some data about the Nifi Coordinator and Primary Nodes.
Also, when I explore the transactions logs, even after using some load balanced connections, I cannot find anything related to the Nifi State.
Could somebody explain what data goes the Zookeeper and why the local state directory is still filled even if we configure the zk provider ?
Thanks.
It depends on the processor, some cases it would never make sense to store cluster wide state because it could never be picked up by another node. For example, ListFile tracking from a local directory, another node cannot access the same directory so storing this state in ZK is not helpful.
There is always a local state provider in a write-ahead-log in the state directory, and it is up to the processor to say whether it should be cluster or local state when storing it.
The documentation for each processor should say how the state is stored. For example, from ListFile:
#Stateful(scopes = {Scope.LOCAL, Scope.CLUSTER}, description = "After performing a listing of files, the timestamp of the newest file is stored. "
+ "This allows the Processor to list only files that have been added or modified after "
+ "this date the next time that the Processor is run. Whether the state is stored with a Local or Cluster scope depends on the value of the "
+ "<Input Directory Location> property.")
If Input Directory Location is "remote" then it will use cluster state, otherwise local state.

Where is input temporarily stored during "bq load .. localfile.csv"?

The gcloud-sdk command "bq load" can take a local file as input.
From the output of the command, it looks like that file is first being uploaded into google cloud storage somewhere before the bigquery load job is scheduled. Given that the REST api for bigquery schedule-load-job endpoint also takes only "gs://" urls, and that the load-job needs the data to be reachable, I am pretty sure that such an upload to cloud-storage is taking place (though I can't find any documentation that explicitly describes "bq load" with local files.
My question then is: can someone tell me where the local file is temporarily uploaded to? Is it one of the gcloud project cloud-storage buckets, or somewhere else? Is it guaranteed to be deleted after the load-job completes?
I have a requirement for data to be kept only in a specific geographical region, thus the location of the (presumed) temporary storage is significant.
I could upload the data explicitly to storage, then use "bq load" with a reference to the cloud storage, but then need to arrange deletion of the data afterwards which is a minor inconvenience. A dedicated storage with a "lifecycle rule" could at least delete after 1 day, but the "bq load .. localfile" approach is cleaner..
If you run bq --help you can see how one of the global bq_flags is --location. It is defined as follows:
--location: “Default geographic location to use when creating datasets or determining where jobs should run (Ignored when not
applicable.)”
If you run:
bq load --location=eu {your-table} {your-source}
For a dataset located in EU, then the job should succeed and all jobs related should be run in EU.

Pointing multiple projects' log sinks to one bucket

I have a few GCP projects with log sinks to different storage buckets. I'd like to combine them into a single bucket. But the stackdriver export doesn't add any distinguishing information to the object names it creates; they all look like cloudaudit.googleapis.com/activity/2017/11/14/00:00:00_00:59:59_S0.json
What will happen if I start pushing them all to a single bucket? Will the different project sinks overwrite each other's objects? Is there any way to distinguish which project created the logs just from the object?
If not, I guess I should switch to pubsub sinks, and then write some code that produces objects with more desirable names. Are there any established patterns or examples for doing this?
Update: I filed https://issuetracker.google.com/issues/69371200 for this issue.
To enable this, just select custom destination on the sink and point to the bucket with this format: storage.googleapis.com/[BUCKET_ID].
I've just enabled this in a couple of my projects, as I'm curious to see the results when exporting to a bucket. However, I have been using a single BQ sink for all my projects, and the tables created have all the logs mixed, so no logs lost when using a single BQ sink.
I'm assuming for a GCS sink will work in the same way, but I'll tell you in a couple of days.
If a single bucket sink does not work, you can always use a single BQ sink (that will help in analyzing the logs), and when you no longer want to have them in BQ, export them and store the files wherever you want.
Also, since you'll be writing to your sink constantly, you can't use nearline or coldline, so the storage pricing is better in BQ than a regional bucket (0.02 USD/GB in BQ vs somewhere between 0.02 and 0.35 USD/GB for regional storage, depending on the region; BQ has 10GB free monthly, GCS 5GB).
I would generally recommend using a BQ sink, but I'll tell you what happens with my bucket logs.
Update:
A few hours later, and I've verified that shared bucket sinks work pretty much as you would expect. It concatenates logs chronologically regardless of the project origin, and only creates a single file for each time window. Hope this helps! (I still prefer BQ as a log sink...)
Update 2:
For the behavior you seek in the feature request, I would use BQ, but you could just as easily grep the project ID and separate the logs:
grep '"logName":"projects/<your-project-id>/' mixed-log.json > single-project-log.json
Or just get a cloud function triggered by bucket updates (so, every time you receive a log file in the sink) to run this for you.
Or namespace you buckets and have a cloud function moving them to wherever you need as soon as they are written.
The possibilities are endless!
If you have an organization or folder which includes all the projects that you want to collect logs from, then you can create a sink that collects from all projects in that org/folder.
Unfortunatlely, you cannot do this from the Cloud Console. Instead you must use gcloud with the --organization or --folder option or the API.

Is it possible to edit configuration nodes in a Node-Red flow?

In Node-Red, I'm using some Amazon Web Services nodes (from module node-red-node-aws), and I would like to read some configuration settings from a file (e.g. the access key ID & the secret key for the S3 nodes), but I can't find a way to set everything up dynamically, as this configuration has to be made in a config node, which can't be used in a flow.
Is there a way to do this in Node-Red?
Thanks!
Unless a node implementation specifically allows for dynamic configuration, this is not something that Node-RED does generically.
One approach I have seen is to have a flow update itself using the admin REST API into the runtime - see https://nodered.org/docs/api/admin/methods/post/flows/
That requires you to first GET the current flow configuration, modify the flow definition with the desired values and then post it back.
That approach is not suitable in all cases; the config node still only has a single active configuration.
Another approach, if the configuration is statically held in a file, is to insert them into your flow configuration before starting Node-RED - ie, have a place-holding config node configuration in the flow that you insert the credentials into.
Finally, you can use environment variables: if you set the configuration node's property to be something like $(MY_AWS_CREDS), then the runtime will substitute that environment variable on start-up.
You can update your package.json start script to start Node-RED with your desired credentials as environment variables:
"scripts": {
"start": "AWS_SECRET_ACCESS_KEY=<SECRET_KEY> AWS_ACCESS_KEY_ID=<KEY_ID> ./node_modules/.bin/node-red -s ./settings.js"
}
This worked perfect for me when using the node-red-contrib-aws-dynamodbnode. Just leave the credentials in the node blank and they get picked up from your environment variables.

Syncing seconday user store in WSO2 Identity Server cluster

I have setup the cluster for WSO2-IS (2 instances on different machines) based on the information provided here - https://docs.wso2.com/display/CLUSTER44x/WSO2+Clustering+and+Deployment+Guide
Setup DB with a user store, shared registry, 2 local registries
Copied the DB driver jar to component lib
Updated the master-datasource.xml
Updated the registry.xml (made sure the master is read-only false and worker is read-only true)
Updated the AXIS2.xml and used WKA for membership scheme
Performed other changes as suggested in the link
Started the master with -Dsetup option and the worker without -Dsetup option.
Verified that the governance folder is shown as a symlink
I can see the interaction between both the nodes, there are Hazelcast messages related to node joining when the worker is started.
User created in 1 is able to login to the other instance, service provider are also automatically available when viewed through UI.
The problem is that when I create a secondary user store (JDBC) in the first node and goto the list in the second node - the secondary user store is not present and I cannot view the users in the user list too.
Am I missing something or is it the way the cluster is supposed to perform i.e. secondary user stores have to be shared in some other way?
Thanks,
Vikas
Secondary user store configurations are not synced between two nodes by default. Once you create a secondary user store from UI, it will create a file in following location.
[WSO2_IS]/repository/deployment/server/userstores/
These configuration file need to copy by manually or have to use some synchronization mechanism to copy file to other node. since this is not a frequent task better to copy this file.
Fore more information
https://docs.wso2.com/display/IS500/Configuring+Secondary+User+Stores