Enterprise Architecture & Integration, SOA, ESB, Web Services & Cloud Integration

Enterprise Architecture & Integration, SOA, ESB, Web Services & Cloud Integration

Friday, 27 December 2019

Accessing AWS S3 buckets from Apache Spark throwing Bad Request 400 error

There is no need for any introduction for Apache Spark (http://spark.apache.org) which is very popular for processing large data set in a very quick time. Spark can read data from various sources. S3 is one such popular data source for storing big data sets. So, you would end up very soon to read large data set from S3 and process using Spark for implementing useful business application use cases.

In my journey, I spent a lot of time in troubleshooting one issue "com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request;". I had to go through several blogs, official documentations (of aws, spark, hadoop), stack overflow questions/answers etc to finally make it working. So, I thought of writing a brief write up that can help some one.

Please follow these three steps:

1. You would need to populate the below key and value pairs
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
sc.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY)
sc.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY)
sc.hadoopConfiguration.set("fs.s3a.endpoint",
AWS_REGION)
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")

Please change AWS_REGION, ACCESS_KEY and SECRET_KEY as appropriate.

For official documentation, you can refer https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

2. If you are still facing issue with V4 signature issue (https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html), ideally passing "com.amazonaws.services.s3.enableV4" should work. If it doesn't, you can follow another option which did the magic for me FINALLY.

import com.amazonaws.SDKGlobalConfiguration
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")

For official documentation, you can refer https://docs.aws.amazon.com/sdk-for-java/


3. Version issue. There are many versions, dependencies between spark, hadoop & aws sdk, I have used the below versions in my "build.sbt":
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.4" % "provided",
"org.apache.hadoop" % "hadoop-aws" % "2.7.4" % "provided",
"com.amazonaws" % "aws-java-sdk" % "1.7.4" % "provided",
)
Here is the complete program for your reference:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.Logger
import org.apache.log4j.Level
import com.amazonaws.SDKGlobalConfiguration

object S3BigDataAnalysis {
  def main(args: Array[String]) {
    Logger.getLogger("org").setLevel(Level.ERROR);
   
    val conf = new SparkConf()
    conf.setAppName("S3BigDataAnalysis")
    conf.setMaster("local")

    val sc = new SparkContext(conf)
    sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
    sc.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY)
    sc.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY)
    sc.hadoopConfiguration.set("fs.s3a.endpoint", AWS_REGION)
    //sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
        System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")


    val iot_devices = sc.textFile("s3a://iot-devices/2019/*.json")
   
iot_devices.foreach(println)
  }
}


For understanding the research done by several people, you can refer below links:

https://stackoverflow.com/questions/34209196/amazon-s3a-returns-400-bad-request-with-spark

https://stackoverflow.com/questions/57477385/read-files-from-s3-bucket-to-spark-dataframe-using-scala-in-datastax-spark-submi

https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark?noredirect=1&lq=1

https://stackoverflow.com/questions/55119337/read-files-from-s3-pyspark


Hope this tip will be very useful, save several hours of yours. Please leave your comment here if you have any query. I will try to answer at the earliest opportunity!

🙏

Thursday, 28 November 2019

Installing Java 11 in Amazon Linux

Many of the old Java applications have got stuck with JDK 1.8 itself due to several reasons. When possible, you should be migrating to JDK13 or at least JDK11.

Below steps will help you to install JDK 11 and check the java version also.

Step 1: 
Ensure you have internet connectivity from your Amazon EC2 instance. And, then type the below command

sudo amazon-linux-extras install java-openjdk11

The above command will install JDK11.

Step 2:
Type the below command to check the version

java -version

The result will be somthing like this

openjdk version "11.0.5" 2019-10-15 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.5+10-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.5+10-LTS, mixed mode, sharing)

Please note that from Java 9 onwards, the version numbering system has changed. For JDK1.8, the version number will be something like "1.8.0_192". For latest version, you wont see 1.x.y_zzz. For example, JD11, it will be like this "11.x.y"

Have fun!