Accessing GCS with Hadoop client from outside of Cloud - google-cloud-platform

I want to access Google Cloud Storage via Hadoop client. I want to use it on machine outside of Google Cloud.
I followed instructions from here.
I created service account and generated key file. I also created core-site.xml file and downloaded the necessary library.
However, when I am trying to run simple hdfs dfs -ls gs://bucket-name command, all I get is this:
Error getting access token from metadata server at: http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token
When I am doing this inside the Google Cloud it works, but trying to connect to GCS from outside, it shows error above.
How to connect to GCS with Hadoop Client in this way? Is it even possible? I have no route to 169.254.169.254 address.
Here is my core-site.xml(I changed the key path and email in this example):
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>spark.hadoop.google.cloud.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>spark.hadoop.google.cloud.auth.service.account.json.keyfile</name>
<value>path/to/key.json</value>
</property>
<property>
<name>fs.gs.project.id</name>
<value>ringgit-research</value>
<description>
Optional. Google Cloud Project ID with access to GCS buckets.
Required only for list buckets and create bucket operations.
</description>
</property>
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>The AbstractFileSystem for gs: uris.</description>
</property>
<property>
<name>fs.gs.auth.service.account.email</name>
<value>myserviceaccountaddress#google</value>
<description>
The email address is associated with the service account used for GCS
access when fs.gs.auth.service.account.enable is true. Required
when authentication key specified in the Configuration file (Method 1)
or a PKCS12 certificate (Method 3) is being used.
</description>
</property>
</configuration>

could be that the hadoop services haven’t taken the updates made in your core-site.xml file yet, so my suggestion is restart the hadoop’s services,another action that you can take is check the Access control options[1].
If You still having the same issue after having taken the action suggested, please post the complete error message.
[1]https://cloud.google.com/storage/docs/access-control/

The problem is with the fact that I've tried wrong authentication method. Used method assumes that it's running inside google cloud and it's trying to connect to google metadata servers. When running outside of google it doesn't work for obvious reasons.
The answer to this is here: Migrating 50TB data from local Hadoop cluster to Google Cloud Storage with the proper core-site.xml in the selected answer.
Property fs.gs.auth.service.account.keyfile should be used instead of spark.hadoop.google.cloud.auth.service.account.json.keyfile. The only difference is that this property needs p12 key file instead of json.

Related

Google-Cloud-Storage | Artifactory does not delete objects in bucket

We are currently in the process of setting up an Artifactory Pro instance on GCP and want to use GCS as its Filestore. The connection to the bucket is successful, uploads and downloads to and from the bucket via Artifactory are successful (using a generic repo).
However: Artifactory does not delete an artifact if we tell it so, via the GUI. The Artifact gets deleted and disappears in the GUI, (Trash Can is disabled in the System Settings) but continues to exist in the bucket in GCS.
This is our binarystore.xml:
<?xml version="1.0" encoding="UTF-8"?>
<config version="v1">
<chain>
<provider id="cache-fs" type="cache-fs">
<provider id="eventual" type="eventual">
<provider id="retry" type="retry">
<provider id="google-storage" type="google-storage"/>
</provider>
</provider>
</provider>
</chain>
<provider id="google-storage" type="google-storage">
<endpoint>commondatastorage.googleapis.com</endpoint>
<bucketName>rtfdev</bucketName>
<identity>xxx</identity>
<credential>xxx</credential>
<bucketExists>false</bucketExists>
<httpsOnly>true</httpsOnly>
<httpsPort>443</httpsPort>
</provider>
</config>
Our setup:
Artifactory 7.12.6
OS: Debian 10 (buster)
Machine Type: e2-highcpu-4 (4 vCPUs, 4 GB memory)
Disk: 200 GB SSD
The questions are:.
Is this working as intended? Does Artifactory never ever delete artifacts in a bucket?
On a related note: How can we convince Artifactory to be more verbose with its interactions with GCS? (the artifactory-binarystore.log is suspiciously empty, console.log is quiet as well...)
The reason you are not seeing the artifact being deleted immediately from the storage is the fact that Artifactory is using a checksum based storage.
TL;DR - you will see the artifact deleted from storage once the garbage collection process will delete it.
Artifactory stores any binary file only once. This is what we call "once and once only storage". First time a file is uploaded, Artifactory runs the required checksum calculations when storing the file, however, if the file is uploaded again (to a different location, for example), the upload is implemented as a simple database transaction that creates another record mapping the file's checksum to its new location. There is no need to actually store the file again in storage. No matter how many times a file is uploaded, the filestore only hosts a single copy of the file.
Deleting a file is also a simple database transaction in which the corresponding database record is deleted. The file itself is not directly deleted, even if the last database entry pointing to it is removed. So-called "orphaned" files are removed in the background by Artifactory's garbage collection processes.

Spring Boot Logging and Google Cloud Platform Log Viewer

I'm running a Spring Boot application within the Google Cloud Platform and viewing the log files viewing the Google Platform Logs Viewer. Before using Spring Boot and just using simple servlets, the logging entries would be displayed as:
Each request would be grouped and all the logging information for that request could be seen by expanding the row. However, when using Spring Boot the requests are no longer grouped and the log entries are just shown line by line. When there are multiple requests the log entries get very confusing as a result because it isn't possible to view them in a grouped way. I have my logging.properties setup in the same way:
.level = INFO
handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level=FINEST
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
java.util.logging.SimpleFormatter.format = [%1$tc] %4$s: %2$s - %5$s %6$s%n
The Logger is initialised in each class as:
private static final java.util.logging.Logger LOG = java.util.logging.Logger.getLogger(MyClass.class.getName());
And then the logging API is used as:
LOG.info("My Message");
I don't understand why the statements are being logged differently and no longer grouped but it must have something with the way Spring Boot handles logging?
Since recent runtimes, AppEngine is evolving with a behaviour that is more and more converging with a container based approach, more "opened" as new other products (like Cloud Run for example).
This is changing a little the way we're developing with GAE, specific legacy libraries aren't available (SearchAPI...), and it is changing also how logs are managed.
We can reproduce this "nested log feature" with new java11 runtime, but we need to manage it ourself.
As official docs mentioned:
In the Logs Viewer, log entries correlated by the same trace can be
viewed in a "parent-child" format.
It means, if we retrieve the trace identifier received inside X-Cloud-Trace-Context HTTP header of our request, we can then use it to add a new LogEntry by passing it as the trace identifier attribute.
This can be done by using Stackdriver Logging Client libraries
With Spring GCP
Fortunately, Spring Cloud GCP is there to make our lives easier.
You can find a sample project which implements it. Be careful, it's a AppEngine Flexible example, but it will work fine with Standard runtime.
It uses Logback.
From a working Spring Boot project on GAE Java11, steps to follow are :
Add spring-cloud-gcp-starter-logging dependency :
<!-- Starter for Stackriver Logging -->
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-gcp-starter-logging</artifactId>
<version>1.2.1.RELEASE</version>
</dependency>
Add a logback-spring.xml inside src/main/resources folder :
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<include resource="org/springframework/cloud/gcp/autoconfigure/logging/logback-appender.xml" />
<include resource="org/springframework/boot/logging/logback/defaults.xml"/>
<include resource="org/springframework/boot/logging/logback/console-appender.xml" />
<root level="INFO">
<!-- If running in GCP, remove the CONSOLE appender otherwise logs will be duplicated. -->
<appender-ref ref="CONSOLE"/>
<appender-ref ref="STACKDRIVER" />
</root>
</configuration>
Enable Spring GCP logging feature, inside src/main/resources/application.properties :
spring.cloud.gcp.logging.enabled=true
And use LOGGER inside your code:
#SpringBootApplication
#RestController
public class DemoApplication {
private static final Log LOGGER = LogFactory.getLog(DemoApplication.class);
public static void main(String[] args) {
SpringApplication.run(DemoApplication.class, args);
}
#GetMapping()
public SomeData get() {
LOGGER.info("My info message");
LOGGER.warn("My warning message");
LOGGER.error("My error message");
return new SomeData("Hello from Spring boot !");
}
}
Result will be in Stackdriver Logging viewer, for appengine.googleapis.com/request_log :

Trash config for hdfs not working

After adding following config in core-site.xml for hdfs, it does not create /user/X/.Trash folder when using webhdfs delete API.
<property>
<name>fs.trash.interval</name>
<value>10080</value>
</property>
<property>
<name>fs.trash.checkpoint.interval</name>
<value>1440</value>
</property>
Expectation is that, all the deleted files must be in .Trash folder for the trash interval. However it works when using hadoop command (hadoop fs -rm /test/1). Does anyone has any idea ?
As specified here
The trash feature works by default only for files and directories deleted using the Hadoop shell. Files or directories deleted programmatically using other interfaces (WebHDFS or the Java APIs, for example) are not moved to trash, even if trash is enabled, unless the program has implemented a call to the trash functionality. (Hue, for example, implements trash as of CDH 4.4.)

Submiting the same oozie workflow job multiple times at the same time

I am wondered how oozie handle conflicts(if there really exists) when I submit two same workflow job(just the Oozie sample examples) at the same time.
I can submit the same two job successful and oozie server return two different jobId.In Oozie Web Console, I saw status of two job are all RUNNING, then all SUCCEEDED after some time.
My workflow.xml as followers:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf">
<start to="mr-node"/>
<action name="mr-node">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/mapreduce_test/output-data"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.oozie.example.SampleMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.oozie.example.SampleReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/${wf:user()}/mapreduce_test/input</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/${wf:user()}/mapreduce_test/output-data/</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
I know in the "prepare" label delete output directory helps make the action repeatable and enables retries after failure, I also understand the basic action run model.
So, My questions are:
The same two jobs are really running concurrently?(I saw the two all in running state in oozie web console).
Is there exists write conflict?(two same job point one output directory)
Oozie does not detect any job duplication or anything like it. It accepts the workflow jobs and schedule them on the cluster for execution and monitor till the completion or failure.
The same two jobs are really running concurrently?(I saw the two all in running state in oozie web console).
Yes. Both the jobs will be running concurrently.
Is there exists write conflict?(two same job point one output directory)
Oozie does not have any checks related to write conflicts. I guess these are taken care by either map reduce or hdfs framework.
As per your question:
1. Oozie schedule jobs on the cluster for execution till the end of end with status like success/failure.
2. Both Jobs will be running at same time and will execute same action what have been defined.
To Avoid the same you you can perform below steps which will bit helpfull.
As Oozie Jobs started with execution by triggering job.properties or coordinator.properties and starts the workflow to executing by as per mention interval passed through job.xml/coordinator.xml.
So when ever a request have been submitted it sill make a fresh entry in
COORD_JOBS for coordinator and WF_JOBS for workflow
tables in Metastore DB of Oozie {which could be "Oracle/MySQL/Postgress/Derby".
So even though the Job have been triggered the same can be start repeatedly as every time New ID have been set for respected Job. {As COORDINATOR JOB ID has been set as Incremental Basis}.
One way to avoid Duplicate Processing of the same Job, you can be done from Metastore DB end with some validation check.
Create a Trigger for COORD_JOBS table under Metastore DB, which will check the table entry with Job Name, query alike
IF (SELECT COUNT(*) FROM COORD_JOBS WHERE (app_name = NEW.app_name) AND (status="RUNNING")) > 0 THEN
SET NEW='Error: Cannot Update table.';
END IF;
These DB Table trigger in COORD_JOBS/WF_JOBS tables will check every time when ever the Oozie tries to make update with new Job.
COORD_JOBS table can be replace with WF_JOBS table, which stores the Information of Workflow job details started by Coordinator.properties,

Sitecore allow role to publish content in specific areas only

I am trying to create a role within Sitecore which can publish content, but only within a specific area(s) of the site. I've added the standard Sitecore\Client Publishing role to my role, but I can't see how to prevent the role from being able to publish all areas of the site. I've looked at the Security editor and the Access viewer, but setting the write access of the sections only seems to affect the ability to edit those sections and has no effect on the ability to publish on those sections.
Workflow is the typical way this is handled. Giving roles access to approve (this could be called 'publish') content of certain sections of the content tree will be the best way to achieve what you are describing. Combine this with an auto-publish action to make it more user friendly.
One thing to keep in mind though using this method is referenced items (images from media library the content may be using for example). Take a look at the 'Publishing Spider' module on the shared source library http://trac.sitecore.net/PublishingSpider
EDIT: Update
I recently discovered this setting in the web.config: "Publishing.CheckSecurity". If set to true, this setting will only publish items if the user has read + write on the item and will only remove items from the web DB if the user has delete permissions.
I had a similar situation once and I created roles per section which only had read and write to that section and no where else (let say 'editor section 1') and another role which only had publishing permission for that section (let say 'publisher section 1'). Then added 'editor section 1' role to 'publisher section 1' role which gives you the role for publishing only specific section.
You do not need multiple workflows, same workflow with multiple roles can also achieve this goal
Answer to this is to set Publishing.CheckSecurity to true
You need to find this code inside web
<!-- PUBLISHING SECURITY
Check security rights when publishing?
When CheckSecurity=true, Read rights are required for all source items. When it is
determined that an item should be updated or created in the target database,
Write right is required on the source item. If it is determined that the item
should be deleted from target database, Delete right is required on the target item.
In summary, only the Read, Write and Delete rights are used. All other rights are ignored.
Default value: false
-->
<setting name="Publishing.CheckSecurity" value="false" />
Set the value="true"
But again you have to govern the security tightly, and assign user role properly. Failed to
do so you will experience buggy publishing.
Hope that will help