change ACL policy to XACML - mapreduce

I'm trying to test a security method in MapReduce and i'm wondering if my approach makes sens.
I would like to transform access control list policy which exist in MapReduce to an XACML policy to do that i take the file where the ACL is defined and copy the name and value of each propriety then put it in a policy following the XACML format.
this is the ACL definition
<property>
<name>mapreduce.job.acl-modify-job</name>
<value>user </value>
</property>
<property>
<name>mapreduce.job.acl-view-job</name>
<value>user </value>
</property>
this is the policy in XACML
<Policy PolicyId="GeneratedPolicy" RuleCombiningAlgId="urn:oasis:names:tc:xacml:1.0:rule-combining-algorithm:ordered-permit-overrides">
<Target>
<Subjects>
<Subject>
<SubjectMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:string-equal">
<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#string">user </AttributeValue>
<SubjectAttributeDesignator AttributeId="urn:oasis:names:tc:xacml:1.0:subject:subject-id" DataType="http://www.w3.org/2001/XMLSchema#string"/>
</SubjectMatch>
</Subject>
</Subjects>
<Resources>
</AnyResource>
</Resources>
</Target>
<Rule RuleId="rule1" Effect="Permit">
<Target>
<Actions>
<Action>
<ActionMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:string-equal">
<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#string">mapreduce.job.acl-view-job</AttributeValue>
<ActionAttributeDesignator AttributeId="urn:oasis:names:tc:xacml:1.0:action:action-id" DataType="http://www.w3.org/2001/XMLSchema#string"/>
</ActionMatch>
</Action>
</Actions>
</Target>
</Rule>
<Rule RuleId="rule2" Effect="Deny"/>
</Policy>
is this considred correct?

A couple of comments on your policy:
it uses XACML 2.0. That's old! Switch to XACML 3.0
You have a whitespace in the value <AttributeValue DataType="http://www.w3.org/2001/XMLSchema#string">user </AttributeValue>. Get rid of it (unless you really mean to test on 'user '.
Your policy contains two rules:
the first one grants access if urn:oasis:names:tc:xacml:1.0:action:action-id == mapreduce.job.acl-view-job
the second one always denies access. I assume the intent is to deny access if no action matched. That's fine. I often call that a "catch-all" or safety-harness. There is another way of achieving this by using a combining algorithm on the policy called deny-unless-permit. If none of the rules apply, then the policy will yield deny. This only exists in XACML 3.0
Your policy uses a combining algorithm called permit-overrides (urn:oasis:names:tc:xacml:1.0:rule-combining-algorithm:ordered-permit-overrides). Generally I avoid using it because it means that in the case of a Deny and a Permit, Permit wins. That's too permissive to my liking. Use first-applicable (urn:oasis:names:tc:xacml:1.0:rule-combining-algorithm:first-applicable) instead. You can read up on combining algorithms here.
Ultimately, to make your policy scale, you may want to externalize the list of users rather than have a value for each user inside the policy. So rather than comparing your username to Alice or Bob or Carol, you would compare to an attribute called allowedUsers which you'd maintain inside a database for instance.
Another tip: could you make your policy easier to understand and more scalable if you split the value mapreduce.job.acl-view-job into the different parts (appName="mapreduce"; objectType="job"; action="job"). That would let you have policies about viewing, editing, deleting jobs more easily.

Related

S3 Lifecycle Policy Delete All Objects WITHOUT A Certain Tag Value

While reading over this S3 Lifecycle Policy document I see that it's possible to delete an S3 object containing a particular key=value pair e.g.,
<LifecycleConfiguration>
<Rule>
<Filter>
<Tag>
<Key>key</Key>
<Value>value</Value>
</Tag>
</Filter>
transition/expiration actions.
...
</Rule>
</LifecycleConfiguration>
But is it possible to create a similar rule that deletes any object NOT in the key=value pair? For example, anytime my object is accessed I could update it's tag with the days current date e.g., object-last-accessed=07-26-2019. Then I could create a Lambda function that deletes the current S3 Lifecycle policy each day and then create a new lifecycle policy that has a tag for each of the last 30 days, then my lifecycle policy would automatically delete any object that has not been accessed in the last 30 days; anything that was accessed longer than 30 days would have a date value older than any value inside the lifecycle policy and hence it would get deleted.
Here's an example of what I desire (note I added the desired field <exclude>,
<LifecycleConfiguration>
<Rule>
<Filter>
<exclude>
<Tag>
<Key>last-accessed</Key>
<Value>07-30-2019</Value>
</Tag>
...
<Tag>
<Key>last-accessed</Key>
<Value>07-01-2019</Value>
</Tag>
<exclude>
</Filter>
transition/expiration actions.
...
</Rule>
</LifecycleConfiguration>
Is something like my made up <exclude> value possible? I want to delete any S3 Object that has not been accessed in the last 30 days (that's different than an object which is older than 30 days).
From what I understand, this is possible but via a different mechanism.
My solution is to take a slightly different approach and set a tag on every object and then alter that tag as you need.
So in your instance when the object is created set object-last-accessed to "default" do that through an S3 trigger to a piece of Lambda or when the object is written to S3.
When the object is accessed, then update the tag value to the current date.
If you already have a bucket full of objects, you can use S3 batch to set the tag to the current date and use that as a delta reference point from which to assume files were last accessed
https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObjectTagging.html
Now set the lifecycle rule to remove objects with a tag of "default" after 10 days (or whatever you want).
Add additional rules to remove files with a tag of a date 10 days after that date. You will need to update the lifecycle rule periodically, but you can create 1000 at a time.
this doc gives details of the formal for a rule
https://docs.aws.amazon.com/AmazonS3/latest/API/API_LifecycleRule.html
I'd suggest something like this
<LifecycleConfiguration>
<Rule>
<ID>LastAccessed Default Rule</ID>
<Filter>
<Tag>
<Key>object-last-accessed</Key>
<Value>default</Value>
</Tag>
</Filter>
<Status>Enabled</Status>
<Expiration>
<Days>10</Days>
</Expiration>
</Rule>
<Rule>
<ID>Last Accessed 2020-05-19 Rule</ID>
<Filter>
<Tag>
<Key>object-last-accessed</Key>
<Value>2020-05-19</Value>
</Tag>
</Filter>
<Status>Enabled</Status>
<Expiration>
<Date>2020-05-29</Date>
</Expiration>
</Rule>
</LifecycleConfiguration>
Reading further on this, as I'm faced with this problem, an alternative is to just use the object lock retention mode which allows you to set a default retention on a bucket and then change that retention period as the file is accessed. This works at an version level, i.e. each version is retained for a period not the whole file, so may not be suitable for all. more details in https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lock-overview.html#object-lock-retention-modes

Accessing GCS with Hadoop client from outside of Cloud

I want to access Google Cloud Storage via Hadoop client. I want to use it on machine outside of Google Cloud.
I followed instructions from here.
I created service account and generated key file. I also created core-site.xml file and downloaded the necessary library.
However, when I am trying to run simple hdfs dfs -ls gs://bucket-name command, all I get is this:
Error getting access token from metadata server at: http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token
When I am doing this inside the Google Cloud it works, but trying to connect to GCS from outside, it shows error above.
How to connect to GCS with Hadoop Client in this way? Is it even possible? I have no route to 169.254.169.254 address.
Here is my core-site.xml(I changed the key path and email in this example):
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>spark.hadoop.google.cloud.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>spark.hadoop.google.cloud.auth.service.account.json.keyfile</name>
<value>path/to/key.json</value>
</property>
<property>
<name>fs.gs.project.id</name>
<value>ringgit-research</value>
<description>
Optional. Google Cloud Project ID with access to GCS buckets.
Required only for list buckets and create bucket operations.
</description>
</property>
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>The AbstractFileSystem for gs: uris.</description>
</property>
<property>
<name>fs.gs.auth.service.account.email</name>
<value>myserviceaccountaddress#google</value>
<description>
The email address is associated with the service account used for GCS
access when fs.gs.auth.service.account.enable is true. Required
when authentication key specified in the Configuration file (Method 1)
or a PKCS12 certificate (Method 3) is being used.
</description>
</property>
</configuration>
could be that the hadoop services haven’t taken the updates made in your core-site.xml file yet, so my suggestion is restart the hadoop’s services,another action that you can take is check the Access control options[1].
If You still having the same issue after having taken the action suggested, please post the complete error message.
[1]https://cloud.google.com/storage/docs/access-control/
The problem is with the fact that I've tried wrong authentication method. Used method assumes that it's running inside google cloud and it's trying to connect to google metadata servers. When running outside of google it doesn't work for obvious reasons.
The answer to this is here: Migrating 50TB data from local Hadoop cluster to Google Cloud Storage with the proper core-site.xml in the selected answer.
Property fs.gs.auth.service.account.keyfile should be used instead of spark.hadoop.google.cloud.auth.service.account.json.keyfile. The only difference is that this property needs p12 key file instead of json.

Multiple lifecycles s3cmd

I want to have multiple lifecycles for many folders in my bucket.
This seems easy if I use web interface but this should be an automated process so, at least in my case, it must use s3cmd.
It works fine when I use:
s3cmd expire ...
But, somehow, everytime I run this my last lifecycle gets overwrited.
There's an issue on github:
https://github.com/s3tools/s3cmd/issues/863
My question is: is there another way?
You made me notice I had the exact same problem as you. Another way to access the expire rules with s3cmd is to show the lifecycle configuration of the bucket.
s3cmd getlifecycle s3://bucketname
This way you get some xml formatted text:
<?xml version="1.0" ?>
<LifecycleConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Rule>
<ID>RULEIDENTIFIER</ID>
<Prefix>PREFIX</Prefix>
<Status>Enabled</Status>
<Expiration>
<Days>NUMBEROFDAYS</Days>
</Expiration>
</Rule>
<Rule>
<ID>RULEIDENTIFIER2</ID>
<Prefix>PREFIX2</Prefix>
<Status>Enabled</Status>
<Expiration>
<Days>NUMBEROFDAYS2</Days>
</Expiration>
</Rule>
</LifecycleConfiguration>
If you put that text in a file, changing the appropriate fields (put identifiers of your choice, set the prefixes you want and the number of days until expiration), you now can use the following command (changing FILE for the path where you put the rules):
s3cmd setlifecycle FILE s3://bucketname
That should work (in my case, now I see several rules when I execute the getlifecycle command, although I do not know yet if the objects actually expire or not).

WSO2 ESB EI611 VFS ActionAfterProcess & ActionAfterFailure - options

Requirement is, not to move or delete the files after copying to a
different folder, leave it as it is, after copying the file and pick
up the latest files only
<parameter name="transport.vfs.ActionAfterFailure">MOVE</parameter>
<parameter name="transport.vfs.ActionAfterProcess">MOVE</parameter>
<parameter name="transport.vfs.ActionAfterFailure">DELETE</parameter>
<parameter name="transport.vfs.ActionAfterProcess">DELETE</parameter>
For inbound enpoint with protocol="file" the above paramaters and the options MOVE & DELETE are allowed. How do I add the option of NO ACTION ?
If this option(NO ACTION) is not possible with inbound endpoint, can we use proxy-service with transports="vfs" and use the no-action option? What's the syntax?
Documentation of WSO2 says, no action is possible as third option, but there's no syntax or format to it. Inbound Endpoint IDE properties, support only MOVE or DELETE. Proxy-service is a name-value pair.
Not sure it it works in EI6, but in ESB 4.8.1 you can do it like the following.
<parameter name="transport.vfs.ActionAfterProcess">NONE</parameter>
Thanks, I kinda expected that WSO2 purposely kept MOVE & DELETE as the only option, to avoid redundancy. Otherwise, the behavior or file polling will be erroneous. That's why they dropped "NONE", maybe to avoid picking up old files or files existing in the folder already. But, this should have been clear in the documentations, God, their docs is killing me.

Submiting the same oozie workflow job multiple times at the same time

I am wondered how oozie handle conflicts(if there really exists) when I submit two same workflow job(just the Oozie sample examples) at the same time.
I can submit the same two job successful and oozie server return two different jobId.In Oozie Web Console, I saw status of two job are all RUNNING, then all SUCCEEDED after some time.
My workflow.xml as followers:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf">
<start to="mr-node"/>
<action name="mr-node">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/mapreduce_test/output-data"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.oozie.example.SampleMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.oozie.example.SampleReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/${wf:user()}/mapreduce_test/input</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/${wf:user()}/mapreduce_test/output-data/</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
I know in the "prepare" label delete output directory helps make the action repeatable and enables retries after failure, I also understand the basic action run model.
So, My questions are:
The same two jobs are really running concurrently?(I saw the two all in running state in oozie web console).
Is there exists write conflict?(two same job point one output directory)
Oozie does not detect any job duplication or anything like it. It accepts the workflow jobs and schedule them on the cluster for execution and monitor till the completion or failure.
The same two jobs are really running concurrently?(I saw the two all in running state in oozie web console).
Yes. Both the jobs will be running concurrently.
Is there exists write conflict?(two same job point one output directory)
Oozie does not have any checks related to write conflicts. I guess these are taken care by either map reduce or hdfs framework.
As per your question:
1. Oozie schedule jobs on the cluster for execution till the end of end with status like success/failure.
2. Both Jobs will be running at same time and will execute same action what have been defined.
To Avoid the same you you can perform below steps which will bit helpfull.
As Oozie Jobs started with execution by triggering job.properties or coordinator.properties and starts the workflow to executing by as per mention interval passed through job.xml/coordinator.xml.
So when ever a request have been submitted it sill make a fresh entry in
COORD_JOBS for coordinator and WF_JOBS for workflow
tables in Metastore DB of Oozie {which could be "Oracle/MySQL/Postgress/Derby".
So even though the Job have been triggered the same can be start repeatedly as every time New ID have been set for respected Job. {As COORDINATOR JOB ID has been set as Incremental Basis}.
One way to avoid Duplicate Processing of the same Job, you can be done from Metastore DB end with some validation check.
Create a Trigger for COORD_JOBS table under Metastore DB, which will check the table entry with Job Name, query alike
IF (SELECT COUNT(*) FROM COORD_JOBS WHERE (app_name = NEW.app_name) AND (status="RUNNING")) > 0 THEN
SET NEW='Error: Cannot Update table.';
END IF;
These DB Table trigger in COORD_JOBS/WF_JOBS tables will check every time when ever the Oozie tries to make update with new Job.
COORD_JOBS table can be replace with WF_JOBS table, which stores the Information of Workflow job details started by Coordinator.properties,