After adding following config in core-site.xml for hdfs, it does not create /user/X/.Trash folder when using webhdfs delete API.
<property>
<name>fs.trash.interval</name>
<value>10080</value>
</property>
<property>
<name>fs.trash.checkpoint.interval</name>
<value>1440</value>
</property>
Expectation is that, all the deleted files must be in .Trash folder for the trash interval. However it works when using hadoop command (hadoop fs -rm /test/1). Does anyone has any idea ?
As specified here
The trash feature works by default only for files and directories deleted using the Hadoop shell. Files or directories deleted programmatically using other interfaces (WebHDFS or the Java APIs, for example) are not moved to trash, even if trash is enabled, unless the program has implemented a call to the trash functionality. (Hue, for example, implements trash as of CDH 4.4.)
Related
I want to access Google Cloud Storage via Hadoop client. I want to use it on machine outside of Google Cloud.
I followed instructions from here.
I created service account and generated key file. I also created core-site.xml file and downloaded the necessary library.
However, when I am trying to run simple hdfs dfs -ls gs://bucket-name command, all I get is this:
Error getting access token from metadata server at: http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token
When I am doing this inside the Google Cloud it works, but trying to connect to GCS from outside, it shows error above.
How to connect to GCS with Hadoop Client in this way? Is it even possible? I have no route to 169.254.169.254 address.
Here is my core-site.xml(I changed the key path and email in this example):
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>spark.hadoop.google.cloud.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>spark.hadoop.google.cloud.auth.service.account.json.keyfile</name>
<value>path/to/key.json</value>
</property>
<property>
<name>fs.gs.project.id</name>
<value>ringgit-research</value>
<description>
Optional. Google Cloud Project ID with access to GCS buckets.
Required only for list buckets and create bucket operations.
</description>
</property>
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>The AbstractFileSystem for gs: uris.</description>
</property>
<property>
<name>fs.gs.auth.service.account.email</name>
<value>myserviceaccountaddress#google</value>
<description>
The email address is associated with the service account used for GCS
access when fs.gs.auth.service.account.enable is true. Required
when authentication key specified in the Configuration file (Method 1)
or a PKCS12 certificate (Method 3) is being used.
</description>
</property>
</configuration>
could be that the hadoop services haven’t taken the updates made in your core-site.xml file yet, so my suggestion is restart the hadoop’s services,another action that you can take is check the Access control options[1].
If You still having the same issue after having taken the action suggested, please post the complete error message.
[1]https://cloud.google.com/storage/docs/access-control/
The problem is with the fact that I've tried wrong authentication method. Used method assumes that it's running inside google cloud and it's trying to connect to google metadata servers. When running outside of google it doesn't work for obvious reasons.
The answer to this is here: Migrating 50TB data from local Hadoop cluster to Google Cloud Storage with the proper core-site.xml in the selected answer.
Property fs.gs.auth.service.account.keyfile should be used instead of spark.hadoop.google.cloud.auth.service.account.json.keyfile. The only difference is that this property needs p12 key file instead of json.
When I use spark locally, writing data on my local filesystem, it creates some usefull .crc file.
Using the same job on Aws EMR and writing on S3, the .crc files are not written.
Is this normal? Is there a way to force the writing of .crc files on S3?
those .crc files are just created by the the low level bits of the Hadoop FS binding so that it can identify when a block is corrupt, and, on HDFS, switch to another datanode's copy of the data for the read and kick off a re-replication of one of the good copies.
On S3, stopping corruption is left to AWS.
What you can get off S3 is the etag of a file, which is the md5sum on a small upload; on a multipart upload it is some other string, which again, changes when you upload it.
you can get at this value with the Hadoop 3.1+ version of the S3A connector, though it's off by default as distcp gets very confused when uploading from HDFS. For earlier versions, you can't get at it, nor does the aws s3 command show it. You'd have to try some other S3 libraries (it's just a HEAD request, after all)
We have one AWS S3 bucket in which we get new CSV files at 10 minute interval. Goal is to ingest these files into Hive.
So the obvious way for me is to use Apache Flume for this and use Spooling Directory source which will keep looking for new files in landing directory and ingest them in Hive.
We have read-only permissions for S3 bucket and for landing directory in which files will be copied and Flume suffixes ingested files with .COMPLETED suffix. So in our case Flume won't be able to mark completed files because of permission issue.
Now questions are:
What will happen if Flume is not able to add suffix to completed
files? Will it give any error or it will silently fail? (I am actually testing this but if anyone has already tried this then I don't have to reinvent the wheel)
Whether
Flume will be able to ingest files without marking them with
.COMPLETED?
Is there any other Big Data tool/technology better
suited for this use case?
Flume Spooling Directory Source needs to have write permission either to rename or delete the processed/read log file.
check 'fileSuffix', 'deletePolicy' settings.
If it doesnt rename/delete the completed files, it can't figure out which files are already processed.
You might want to write a 'script' that reads from read-only S3 bucket to a 'staging' folder with write permissions and provide this staging folder as source to flume.
We have two s3 buckets, and we have a sync cron job that should copy bucket1 changes to bucket2.
aws s3 sync s3://bucket1/images/ s3://bucket2/images/
When a new image is added to bucket1, it correctly gets copied over to bucket2.
However, if we upload a new version of that image to bucket2, when the sync job next runs it actually copies the older version from bucket1 over to bucket2, replacing the newer version we just put there.
This is part of a migration process, and in time the only place images will be uploaded to will be bucket2, but for the time being sometimes they may be uploaded to either, and we only want changes form bucket1 to be copied up to bucket2, NOT the other way round.
Why does the aws sync job seem to think that the file on bucket1 has changed? Does it not know that the file in bucket2 is newer, so it should be left alone?
The AWS Command-Line Interface (CLI) aws s3 sync command copies content from the Source location to the Destination location. It only copies files that have been added or changed since the last sync.
It is designed as a one-way sync, not a two-way sync. Your file is being overwritten because the file in the Source is not present in the Destination. This is correct behavior.
There is limited range to tweak these controls, such as (from the sync command documentation):
--exact-timestamps (boolean) When syncing from S3 to local, same-sized items will be ignored only when the timestamps match exactly. The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.
However, there does not appear to be an option that stops overwriting of files merely because a file with the same name exists, or something with a preference to keep newer files.
If you want a two-way sync with more specific rules, you will need to code it yourself.
I'm using EMR to move a folder from local file system to S3 in Spark using fs.moveFromLocalFile API. Everything works fine except a 0-byte file created by EMRFS with name _$folder$ for EVERY folder that is uploaded.
Is there any way to move folders without this dummy file creation for every folder? (other than manually deleting this file). Also, why is this dummy file created? I'm currently using s3:// protocol recommended by EMR team.
My experience is that the mkdir() function usually called for local file systems or hdfs will result in an s3 empty file being created with the name of the mkdir folder and appended by _$folder$. In S3, there is no concept of an "empty folder" because you cannot have a key (pathname) with a null value (the file).
In a perfect world, mkdir(s3://bucket/path) should be a noop.
Don't know about EMR fs; this sounds like the same extension used by The S3n client. These files are stripped in the client when listing/stat-ing paths.
ASF's S3a creates one with a "/" suffix.