There is no need for any introduction for Apache Spark (http://spark.apache.org) which is very popular for processing large data set in a very quick time. Spark can read data from various sources. S3 is one such popular data source for storing big data sets. So, you would end up very soon to read large data set from S3 and process using Spark for implementing useful business application use cases.
In my journey, I spent a lot of time in troubleshooting one issue "com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request;". I had to go through several blogs, official documentations (of aws, spark, hadoop), stack overflow questions/answers etc to finally make it working. So, I thought of writing a brief write up that can help some one.
Please follow these three steps:
1. You would need to populate the below key and value pairs
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
sc.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY)
sc.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY)
sc.hadoopConfiguration.set("fs.s3a.endpoint", AWS_REGION)
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
Please change AWS_REGION, ACCESS_KEY and SECRET_KEY as appropriate.
For official documentation, you can refer https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
2. If you are still facing issue with V4 signature issue (https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html), ideally passing "com.amazonaws.services.s3.enableV4" should work. If it doesn't, you can follow another option which did the magic for me FINALLY.
import com.amazonaws.SDKGlobalConfiguration
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")
For official documentation, you can refer https://docs.aws.amazon.com/sdk-for-java/
3. Version issue. There are many versions, dependencies between spark, hadoop & aws sdk, I have used the below versions in my "build.sbt":
libraryDependencies ++= Seq(Here is the complete program for your reference:
"org.apache.spark" %% "spark-core" % "2.4.4" % "provided",
"org.apache.hadoop" % "hadoop-aws" % "2.7.4" % "provided",
"com.amazonaws" % "aws-java-sdk" % "1.7.4" % "provided",
)
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.Logger
import org.apache.log4j.Level
import com.amazonaws.SDKGlobalConfiguration
object S3BigDataAnalysis {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR);
val conf = new SparkConf()
conf.setAppName("S3BigDataAnalysis")
conf.setMaster("local")
val sc = new SparkContext(conf)
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
sc.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY)
sc.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY)
sc.hadoopConfiguration.set("fs.s3a.endpoint", AWS_REGION)
//sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")
val iot_devices = sc.textFile("s3a://iot-devices/2019/*.json")
iot_devices.foreach(println)
}
}
For understanding the research done by several people, you can refer below links:
https://stackoverflow.com/questions/34209196/amazon-s3a-returns-400-bad-request-with-spark
https://stackoverflow.com/questions/57477385/read-files-from-s3-bucket-to-spark-dataframe-using-scala-in-datastax-spark-submi
https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark?noredirect=1&lq=1
https://stackoverflow.com/questions/55119337/read-files-from-s3-pyspark
Hope this tip will be very useful, save several hours of yours. Please leave your comment here if you have any query. I will try to answer at the earliest opportunity!
🙏
Opinion and Commentary: Blogs provide individuals with a platform to express their opinions and share personal experiences. godaddy coupon code july 2023 They give voice to marginalized communities, fostering diversity of thought and promoting dialogue on social, cultural, and political issues.
ReplyDeleteDenizli
ReplyDeleteKonya
Denizli
ısparta
Bayburt
JUNTY
Sakarya
ReplyDeleteKayseri
Van
Konya
Samsun
S8M
van evden eve nakliyat
ReplyDeletesivas evden eve nakliyat
çankırı evden eve nakliyat
bartın evden eve nakliyat
erzincan evden eve nakliyat
XİA
23B7D
ReplyDeleteUrfa Parça Eşya Taşıma
Çanakkale Parça Eşya Taşıma
Kars Lojistik
Bilecik Parça Eşya Taşıma
Kastamonu Evden Eve Nakliyat
2D6BD
ReplyDeleteTrabzon Parça Eşya Taşıma
Batıkent Parke Ustası
Ordu Şehirler Arası Nakliyat
Kocaeli Şehirler Arası Nakliyat
Maraş Lojistik
Bayburt Şehirler Arası Nakliyat
Yozgat Şehirler Arası Nakliyat
Kilis Parça Eşya Taşıma
Bitlis Şehir İçi Nakliyat
FDEB4
ReplyDeleteÇerkezköy Organizasyon
Tekirdağ Cam Balkon
Karaman Şehir İçi Nakliyat
Silivri Duşa Kabin Tamiri
Ordu Parça Eşya Taşıma
Hatay Lojistik
Isparta Şehirler Arası Nakliyat
Karabük Şehir İçi Nakliyat
Şırnak Şehirler Arası Nakliyat
14174
ReplyDeleteÇorlu Lojistik
Bee Coin Hangi Borsada
Karaman Şehir İçi Nakliyat
Edirne Lojistik
Clysterum Coin Hangi Borsada
Giresun Parça Eşya Taşıma
Batman Şehir İçi Nakliyat
Okex Güvenilir mi
Düzce Parça Eşya Taşıma
3BF2E
ReplyDeletehatay kızlarla canlı sohbet
bingöl random görüntülü sohbet
bitlis rastgele sohbet uygulaması
kocaeli yabancı görüntülü sohbet
en iyi görüntülü sohbet uygulamaları
konya bedava görüntülü sohbet sitesi
çanakkale bedava sohbet
amasya canlı sohbet et
mobil sesli sohbet
71808
ReplyDeleteBtcturk Borsası Güvenilir mi
Bitcoin Kazma
Coin Çıkarma Siteleri
Referans Kimliği Nedir
Bonk Coin Hangi Borsada
Bitcoin Kazanma
Görüntülü Sohbet Parasız
Kaspa Coin Hangi Borsada
Linkedin Takipçi Hilesi