Hive / S3 error: "No FileSystem for scheme: s3" - amazon-web-services

I am running Hive from a container (this image: https://hub.docker.com/r/bde2020/hive/) in my local computer.
I am trying to create a Hive table stored as a CSV in S3 with the following command:
CREATE EXTERNAL TABLE local_test (name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 's3://mybucket/local_test/';
However, I am getting the following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException No FileSystem for scheme: s3)
What is causing it?
Do I need to set up something else?
Note:
I am able to run aws s3 ls mybucket and also to create Hive tables in another directory, like /tmp/.

Problem discussed here.
https://github.com/ramhiser/spark-kubernetes/issues/3
You need to add reference to aws sdk jars to hive library path. That way it can recognize file schemes,
s3, s3n, and s3a
Hope it helps.
EDIT1:
hadoop-aws-2.7.4 has implementations on how to interact with those file systems. Verifying the jar it has all the implementations to handle those schema.
org.apache.hadoop.fs tells hadoop to see which file system implementation it need to look.
Below classes are implamented in those jar,
org.apache.hadoop.fs.[s3|s3a|s3native]
The only thing still missing is, the library is not getting added to hive library path. Is there anyway you can verify that path is added to hive library path?
EDIT2:
Reference to library path setting,
How can I access S3/S3n from a local Hadoop 2.6 installation?

Related

Create Athena table using s3 source data

Below is given the s3 path where I have stored the files obtained at the end of a process. The below-provided path is dynamic, that is, the value of the following fields will vary - partner_name, customer_name, product_name.
s3://bucket/{val1}/data/{val2}/output/intermediate_results
I am trying to create Athena tables for each output file present under output/ as well as under intermediate_results/ directories, for each val1-val2.
Each file is a CSV.
But I am not much familiar with AWS Athena so I'm unable to figure out the way to implement this. I would really appreciate any kind of help. Thanks!
Use CREATE TABLE - Amazon Athena. You will need to specify the LOCATION of the data in Amazon S3 by providing a path.
Amazon Athena will automatically use all files in that path, including subdirectories. This means that a table created with a Location of output/ will include all subdirectories, including intermediate_results. Therefore, your data storage format is not compatible with your desired use for Amazon Athena. You would need to put the data into separate paths for each table.

Reading from Amazon S3 in Flink: UnsupportedFileSystemSchemeException

I am trying to read from an S3 bucket in a Java flink (version 1.11) application. I am trying to read from an S3 file into a DataSet object using the following code:
ExecutionEnvironment executionEnv = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> dataSet = executionEnv.readTextFile("s3a://<bucket>/<key>");
Every time I try to run this I run into this error:
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 's3a'
...
When I look at other posts about this error, it is because I need to configure flink-s3-fs-hadoop in my environment. I have tried configuring it for a while now, but the same error keeps occurring--I have the .jar for fink-s3-fs-hadoop in my plugins directory but it is not even recognized.
What are the steps for making this work within an IDE? All documentation online is very vague.

Google BigQuery cannot read some ORC data

I'm trying to load ORC data files stored in GCS into BigQuery via bq load/bq mk and facing an error below. The data files copied via hadoop discp command from on-prem cluster's Hive instance version 1.2. Most of the orc-files are loaded successfully, but few are not. There is no problem when I read this data from Hive.
Command I used:
$ bq load --source_format ORC hadoop_migration.pm hive/part-v006-o000-r-00000_a_17
Upload complete.
Waiting on bqjob_r7233761202886bd8_00000175f4b18a74_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '<project>-af9bd5f6:bqjob_r7233761202886bd8_00000175f4b18a74_1': Error while reading data, error message:
The Apache Orc library failed to parse metadata of stripes with error: failed to open /usr/share/zoneinfo/GMT-00:00 - No such file or directory
Indeed, there is no such file and I believe it shouldn't be.
Google doesn't know about this error message but I've found similar problem here: https://issues.apache.org/jira/browse/ARROW-4966. There is a workaround for on-prem servers of creating sym-link to /usr/share/zoneinfo/GMT-00:00. But I'm in a Cloud.
Additionally, I found that if I extract data from orc file via orc-tools into json format I'm able to load that json file into BigQuery. So I suspect that the problem not in the data itself.
Does anybody came across such problem?
Official Google support position below. In short BigQuery doesn't understand some timezone's description and we suggested to change it in the data. Our workaround for this was to convert ORC data to parquet and then load it into table.
Indeed this error can happen. Also when you try to execute a query from the BigQuery Cloud Console such as:
select timestamp('2020-01-01 00:00:00 GMT-00:00')
you’ll get the same error. It is not just related to the ORC import, it’s how BigQuery understands timestamps. BigQuery supports a wide range of representations as described in [1]. So:
“2020-01-01 00:00:00 GMT-00:00” -- incorrect timestamp string literal
“2020-01-01 00:00:00 abcdef” -- incorrect timestamp string literal
“2020-01-01 00:00:00-00:00” -- correct timestamp string literal
In your case the problem is with the representation of the time zone within the ORC file. I suppose it was generated that way by some external system. If you were able to get the “GMT-00:00” string with preceding space replaced with just “-00:00” that would be the correct name of the time zone. Can you change the configuration of the system which generated the file into having a proper time zone string?
Creating a symlink is only masking the problem and not solving it properly. In case of BigQuery it is not possible.
Best regards,
Google Cloud Support
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#time_zones

Is there a way to see files stored in localstack's mocked S3 environment

I've setup a localstack install based off the article How to fake AWS locally with LocalStack. I've tested copying a file up to the mocked S3 service and it works great.
I started looking for the test file I uploaded. I see there's an encoded version of the file I uploaded inside .localstack/data/s3_api_calls.json, but I can't find it anywhere else.
Given: DATA_DIR=/tmp/localstack/data I was expecting to find it there, but it's not.
It's not critical that I have access to it directly on the file system, but it would be nice.
My question is: Is there anywhere/way to see files that are uploaded to the localstack's mock S3 service?
After the latest update, now we have only one port which is 4566.
Yes, you can see your file.
Open http://localhost:4566/your-funny-bucket-name/you-weird-file-name in chrome.
You should be able to see the content of your file now.
I went back and re-read the original article which states:
"Once we start uploading, we won't see new files appear in this directory. Instead, our uploads will be recorded in this file (s3_api_calls.json) as raw data."
So, it appears there isn't a direct way to see the files.
However, the Commandeer app provides a view into localstack that includes a directory listing of the mocked S3 buckets. There isn't currently a way to see the contents of the files, but the directory structure is enough for what I'm doing. UPDATE: According to #WallMobile it's now possible to see the contents of files too.
You could use the following command
aws --endpoint-url=http://localhost:4572 s3 ls s3:<your-bucket-name>
In order to list the exact folder in s3 bucket you could use this command:
aws --endpoint-url=http://localhost:4566 s3 ls s3://<bucket-name>/<folder-in-bucket>/
Image is saved as base64 encoded string in the file recorded_api_calls.json
I have passed DATA_DIR=/tmp/localstack/data
and the file is saved at /tmp/localstack/data/recorded_api_calls.json
Open the file and copy the data (d) from any API call that looks like this
"a": "s3", "m": "PUT", "p": "/bucket-name/foo.png"
extract this data using bas64 decoding
You can use this script to extract data from localstack s3
https://github.com/nkalra0123/extract-data-from-localstack-s3/blob/main/script.sh
From my understanding, localstack saves the data in memory by default. This is what happens unless you specify a data directory. Obvious, if in memory, you won't see any files anywhere.
To create a data directory you can run a command such as:
mkdir ~/.localstack
Then you have to instruct localstack to save the data at that location. You can do so by adding the DATA_DIR=... path and a volume in your docker-compose.yml file like so:
localstack:
image: localstack/localstack:latest
ports:
- 4566:4566
- 8055:8080
environment:
- SERVICES=s3
- DATA_DIR=/tmp/localstack/data
- DOCKER_HOST=unix:///var/run/docker.sock
volumes:
- /home/alexis/.localstack:/tmp/localstack
Then rebuild and start the docker.
Once the localstack process started, you'll see a JSON database under ~/.localstack/data/....
WARNING: if you are dealing with very large files (Gb), then it is going to be DEAD SLOW. The issue is that all the data is going to be saved inside that one JSON file in base64. In other words, it's going to generate a file much bigger than what you sent and re-reading it is also going to be enormous. It may be possible to fix this issue by setting the legacy storage mechanism to false:
LEGACY_PERSISTENCE=false
I have not tested that flag (yet).

AWS EMR Spark: Error writing to S3 - IllegalArgumentException - Cannot create a path from an empty string

I have been trying to fix this for a long time now ... no idea why I get this? FYI, I'm running Spark on a cluster on AWS EMR Cluster. I debugged and clearly see the destination path provided ... something like s3://my-bucket-name/. The spark job creates orc files and writes them after creating a partition like so: date=2017-06-10. Any ideas?
17/07/08 22:48:31 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
at org.apache.hadoop.fs.Path.<init>(Path.java:134)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Path.suffix(Path.java:361)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:138)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:82)
code that writes orc:
dataframe.write
.partitionBy(partition)
.option("compression", ZLIB.toString)
.mode(SaveMode.Overwrite)
.orc(destination)
I have seen a similar problem when writing parquet files to S3. The problem is the SaveMode.Overwrite. This mode doesn't seem to work correctly in combination with S3. Try to delete all the data in your S3 bucket my-bucket-name before writing into it. Then your code should run successfully.
To delete all files from your bucket my-bucket-name you can use the following pyspark code:
# see https://www.quora.com/How-do-you-overwrite-the-output-directory-when-using-PySpark
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
# see http://crazyslate.com/how-to-rename-hadoop-files-using-wildcards-while-patterns/
fs = FileSystem.get(URI("s3a://my-bucket-name"), sc._jsc.hadoopConfiguration())
file_status = fs.globStatus(Path("/*"))
for status in file_status:
fs.delete(status.getPath(), True)