This chapter discussed numerous aspects for optimizing data storage, data processing, and pipelines. From a data storage perspective, you learned how data skewing, data spills, and shuffling have a negative impact on the storage and usability of your data. Using a command like PDW_SHOWSPACEUSED to show how data is stored across distribution is a way to determine if your data is skewed. Using an explain plan (i.e., query plan) can expose data shuffling, which causes latency due to the time required to retrieve data from the different nodes where the data is stored. You also learned that optimizing your partitions, enhancing table indexes, and making sure you are reading from cache all improve query performance.

Many data management views (DMV) are very useful for troubleshooting inefficient data storage and performance. They are specialized for indexes, transactions, queries, partitions, and everything else necessary for managing and optimizing a database. For example, the sys.dm_pdw_request_steps DMV lists all the steps taken for a given query. That means once you have narrowed down any latency to a specific query, you can break it down even further into the different steps the query must perform to complete.

Error handling and troubleshooting perspectives like retry capabilities, Fail activities, and failure dependencies improve the resiliency of your data analytics pipeline by automating the recovery from any expected errors. The retry options available in both Data flow and Custom Pipeline activities help to manage transient issues preventing successful execution. Setting the maximum number of retries and a wait period between those retries prevents the occurrence of the retry storm antipattern. A Fail activity provides some custom logging to be captured when an activity fails. This is a more efficient approach than simply rendering a Failed pipeline status and having someone troubleshoot from scratch. An activity bound to a failure dependency provides the means to recover or repair data from a previously failed activity. If the activity bound to the Fail activity completes successfully, then the pipeline status is set to Successful.

This concludes the written portion of the book. I asked my daughter Lea to give me the last sentence for the book, and she provided a quote from Eleanor Roosevelt: “If life were predictable, it would cease to be life and be without flavor.” The quote is paradoxical, as data analytics has a lot to do with predictions. As you build your data analytics solutions, consider building in some flavor.

Leave a Reply

Your email address will not be published. Required fields are marked *