My Two Cents on Enterprise Architecture, SOA & Micro-services

Friday, 27 December 2019

Accessing AWS S3 buckets from Apache Spark throwing Bad Request 400 error

There is no need for any introduction for Apache Spark (http://spark.apache.org) which is very popular for processing large data set in a very quick time. Spark can read data from various sources. S3 is one such popular data source for storing big data sets. So, you would end up very soon to read large data set from S3 and process using Spark for implementing useful business application use cases.

In my journey, I spent a lot of time in troubleshooting one issue "com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request;". I had to go through several blogs, official documentations (of aws, spark, hadoop), stack overflow questions/answers etc to finally make it working. So, I thought of writing a brief write up that can help some one.

Please follow these three steps:

1. You would need to populate the below key and value pairs

sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
sc.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY)
sc.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY)
sc.hadoopConfiguration.set("fs.s3a.endpoint", AWS_REGION)
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")

Please change AWS_REGION, ACCESS_KEY and SECRET_KEY as appropriate.

For official documentation, you can refer https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

2. If you are still facing issue with V4 signature issue (https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html), ideally passing "com.amazonaws.services.s3.enableV4" should work. If it doesn't, you can follow another option which did the magic for me FINALLY.

import com.amazonaws.SDKGlobalConfiguration
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")

For official documentation, you can refer https://docs.aws.amazon.com/sdk-for-java/

3. Version issue. There are many versions, dependencies between spark, hadoop & aws sdk, I have used the below versions in my "build.sbt":

libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.4" % "provided",
"org.apache.hadoop" % "hadoop-aws" % "2.7.4" % "provided",
"com.amazonaws" % "aws-java-sdk" % "1.7.4" % "provided",
)

Here is the complete program for your reference:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.Logger
import org.apache.log4j.Level
import com.amazonaws.SDKGlobalConfiguration

object S3BigDataAnalysis {
def main(args: Array[String]) {
    Logger.getLogger("org").setLevel(Level.ERROR);

    val conf = new SparkConf()
    conf.setAppName("S3BigDataAnalysis")
    conf.setMaster("local")

    val sc = new SparkContext(conf)
    sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
    sc.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY)
    sc.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY)
    sc.hadoopConfiguration.set("fs.s3a.endpoint", AWS_REGION)
    //sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
        System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")

    val iot_devices = sc.textFile("s3a://iot-devices/2019/*.json")
    iot_devices.foreach(println)
}
}

For understanding the research done by several people, you can refer below links:

https://stackoverflow.com/questions/34209196/amazon-s3a-returns-400-bad-request-with-spark

https://stackoverflow.com/questions/57477385/read-files-from-s3-bucket-to-spark-dataframe-using-scala-in-datastax-spark-submi

https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark?noredirect=1&lq=1

https://stackoverflow.com/questions/55119337/read-files-from-s3-pyspark

Hope this tip will be very useful, save several hours of yours. Please leave your comment here if you have any query. I will try to answer at the earliest opportunity!

🙏

Thursday, 28 November 2019

Installing Java 11 in Amazon Linux

Many of the old Java applications have got stuck with JDK 1.8 itself due to several reasons. When possible, you should be migrating to JDK13 or at least JDK11.

Below steps will help you to install JDK 11 and check the java version also.

Step 1:
Ensure you have internet connectivity from your Amazon EC2 instance. And, then type the below command

sudo amazon-linux-extras install java-openjdk11

The above command will install JDK11.

Step 2:
Type the below command to check the version

java -version

The result will be somthing like this

openjdk version "11.0.5" 2019-10-15 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.5+10-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.5+10-LTS, mixed mode, sharing)

Please note that from Java 9 onwards, the version numbering system has changed. For JDK1.8, the version number will be something like "1.8.0_192". For latest version, you wont see 1.x.y_zzz. For example, JD11, it will be like this "11.x.y"

Have fun!

Tuesday, 4 July 2017

SSL 3.0 / TLS 1.0 vulnerability issue and solution

Since TLS v1 has vulnerability issues, you are strongly advised to start using TLSv1.1 or TLSv1.2 to secure your corporate applications.

In order to force the application server or standalone application to use TLS v1.2 for example, you can please pass the following JVM argument

-Dhttps.protocols=TLSv1.2

Saturday, 1 July 2017

Maven Dependency Management - Reduce the war file size

A complex web application project might be using a large number of third party libraries in addition to your own application libraries. Together all, these jar files would be increasing the war file size which will be creating issues while transferring and deploying the files in the UAT and production servers.

This can be sorted out by following two steps
1. In your Maven pom.xml, you need to change the scope of the artefact to "provided" instead of "compile".

<dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <scope>provided</scope>
        </dependency>

what Maven does is - it will use the libraries for compiling the source code but will not bundle the dependent libraries along with war file. Now, look at the size of the war file - it would be few KBs, not MBs.

2. Run the following command "mvn dependency:copy-dependencies" in your project home where you have pom.xml. The maven will analyse the pom.xml and copy all required dependant libraries under target/dependency folder. You can copy these dependent jar files under designated server lib of your favorite application serer, and the job is done!

Thursday, 29 June 2017

Maven Dependency Management - Listing Dependency Jar files

Here is a small tip if you've started using Maven very recently.

How would you know the list of dependencies used in your Maven project?

D:\work\messaging>mvn dependency:list
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Messaging 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:2.8:list (default-cli) @ messaging ---
[INFO]
[INFO] The following files have been resolved:
[INFO] com.rabbitmq:amqp-client:jar:4.1.1:compile
[INFO] org.slf4j:slf4j-api:jar:1.7.21:compile
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3.265 s
[INFO] Finished at: 2017-06-29T10:48:58+05:30
[INFO] Final Memory: 12M/114M
[INFO] ------------------------------------------------------------------------