Enterprise Architecture & Integration, SOA, ESB, Web Services & Cloud Integration

Enterprise Architecture & Integration, SOA, ESB, Web Services & Cloud Integration

Friday, 27 December 2019

Accessing AWS S3 buckets from Apache Spark throwing Bad Request 400 error

There is no need for any introduction for Apache Spark (http://spark.apache.org) which is very popular for processing large data set in a very quick time. Spark can read data from various sources. S3 is one such popular data source for storing big data sets. So, you would end up very soon to read large data set from S3 and process using Spark for implementing useful business application use cases.

In my journey, I spent a lot of time in troubleshooting one issue "com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request;". I had to go through several blogs, official documentations (of aws, spark, hadoop), stack overflow questions/answers etc to finally make it working. So, I thought of writing a brief write up that can help some one.

Please follow these three steps:

1. You would need to populate the below key and value pairs
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
sc.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY)
sc.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY)
sc.hadoopConfiguration.set("fs.s3a.endpoint",
AWS_REGION)
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")

Please change AWS_REGION, ACCESS_KEY and SECRET_KEY as appropriate.

For official documentation, you can refer https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

2. If you are still facing issue with V4 signature issue (https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html), ideally passing "com.amazonaws.services.s3.enableV4" should work. If it doesn't, you can follow another option which did the magic for me FINALLY.

import com.amazonaws.SDKGlobalConfiguration
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")

For official documentation, you can refer https://docs.aws.amazon.com/sdk-for-java/


3. Version issue. There are many versions, dependencies between spark, hadoop & aws sdk, I have used the below versions in my "build.sbt":
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.4" % "provided",
"org.apache.hadoop" % "hadoop-aws" % "2.7.4" % "provided",
"com.amazonaws" % "aws-java-sdk" % "1.7.4" % "provided",
)
Here is the complete program for your reference:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.Logger
import org.apache.log4j.Level
import com.amazonaws.SDKGlobalConfiguration

object S3BigDataAnalysis {
  def main(args: Array[String]) {
    Logger.getLogger("org").setLevel(Level.ERROR);
   
    val conf = new SparkConf()
    conf.setAppName("S3BigDataAnalysis")
    conf.setMaster("local")

    val sc = new SparkContext(conf)
    sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
    sc.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY)
    sc.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY)
    sc.hadoopConfiguration.set("fs.s3a.endpoint", AWS_REGION)
    //sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
        System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")


    val iot_devices = sc.textFile("s3a://iot-devices/2019/*.json")
   
iot_devices.foreach(println)
  }
}


For understanding the research done by several people, you can refer below links:

https://stackoverflow.com/questions/34209196/amazon-s3a-returns-400-bad-request-with-spark

https://stackoverflow.com/questions/57477385/read-files-from-s3-bucket-to-spark-dataframe-using-scala-in-datastax-spark-submi

https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark?noredirect=1&lq=1

https://stackoverflow.com/questions/55119337/read-files-from-s3-pyspark


Hope this tip will be very useful, save several hours of yours. Please leave your comment here if you have any query. I will try to answer at the earliest opportunity!

🙏