Monica Shiralkar wrote:Yes, I had initially wanted to do that but mocking S3 hadn't worked fine for me.I had got error "No FileSystem for scheme "s3" "while using unittest.mock to mock spark.read.csv call to S3
Well, you were not really even testing anything related to s3. All you verified (assuming that test ran), that spark mock has been called, but that's not the full story. I don't know a lot of subtleties there, but what you potentially would have wanted at least, is to test whether
spark.read.csv been called, assuming you'd want to ensure that reader is fixed to csv file type.
But again, do you see? This type of discussion is not only confusing, but pretty much useless for the application you are building - hence I said, that such test is just an extra maintenance without much value.
Monica Shiralkar wrote:If corelate this with the statement above, does it mean that unit test should not be written for it and rather integration test should be written.?
Well, correct. Unit tests shouldn't communicate with external blob storage, that's why I said, for what you need to test, you can mock a read blob, which would be a DataFrame, wouldn't it, and test your application business logic, assuming you successfully read a blob from s3 bucket and that got loaded into data frame.
While the integration test would be that reads an actual file from s3 bucket (not just that!) and tests some more elaborate behaviour, and which by proxy would test whether a reading from s3 bucket succeeds, meaning application has an access to it etc... I'm assuming an access to s3 bucket wouldn't be needed from where application gets built, but rather from where it runs, that's another reason why that supposed to be part of a bigger integration test, might be running once a day or so, which loads spark job to cluster (i.e. aws:emr gcp:dataproc) depending where you'd have it in practice.
I perhaps would leave integration tests for later, until your application more matures (along with your CI/CD pipeline).