The processes discussed here were introduced in Chapter 6. A reason for introducing them there and following up here is that many of the concepts—such as logging, monitoring, and error handling—had not yet been covered. At this point, however, it is just a matter of connecting the dots and providing more detail within the context of each of these three sections.
Design and Configure Exception Handling
Refer to Figure 6.72, which is a representation of a pipeline that contains an Azure Batch Custom activity and two failure dependencies. The first is from a Validation activity named Validate Brainwaves, which checks for files in a specified folder the batch job is expected to perform its transformation on. If the folder exists and there are files in it, then the flow continues to the Custom activity. If there are no files to process, then the Lookup activity performs its operation and continues the flow to the batch job. If the lookup fails, then the pipeline execution is stopped and the status is set to Failed. The other failure condition is from the Custom activity named Calculate Frequency Median. If you haven’t already looked at this section from Chapter 6, consider doing so now, as it discusses a bit more about error handling in the batch source code. If for any reason the batch process returns an error, the Fail activity named Batch Load Fail is triggered. The Fail activity logs the error message and code into the pipeline details for review, which will help expedite the troubleshooting efforts, as you will have a relevant explanation of the exception that is easy to find. Keep in mind there are Retry and Retry Interval configurations for a Custom activity, just as for a Data Flow activity. If failures are handled well in your batch code, then rerunning the job again after a short pause might result in a successful outcome.
Debug Spark Jobs by Using the Spark UI
When it comes to debugging all kinds of applications, much of the time and effort is spent trying to identify the component(s) that caused the issue. Once the component(s), like a Spark job, for example, is identified as a key contributor to the problem, the next step is to search for logs from that specific component. Those logs might hold information that explain in detail what took place. To get an overview of Spark jobs in Azure Databricks, refer to Figure 6.42, which illustrates the workflow dashboard, which is useful for getting an overall view of the job health. Drilling down into the job specifics will render more information to guide you further along the debugging process. Consider, for example, Figure 9.21, which renders resource utilization during the job run. Figure 9.25 is an illustration of the Spark UI, which shows very low‐level information about the Spark job. Expanding the DAG visualization group renders Figure 9.26, which is a method‐by‐method execution path of the job. This is about as granular as you can get, and it should show you precisely where the issue is being experienced. After finding the specific location, you should discuss the issue with a developer to determine the course of action. Lastly, if the Spark job is being run from an Azure Synapse Analytics Spark pool, then in addition to the information provided, there also exist an illustration of the Spark application(see Figure 9.16) and direct access to the stderr and stdout logs.