Optimize Pipeline for Descriptive versus Analytical Workloads – Troubleshoot Data Storage Processing

The “Analytics Types” section in Chapter 2 described the numerous categories of data analytics—descriptive, diagnostic, predictive, preemptive, and prescriptive—each of which is an analytical workload. This is concluded by what you learned in the previous section: that OLTP operations are transactional, and OLAP operations are analytical. With the review of those five data analytics types, you should understand that none of them is transactional. Therefore, all optimization techniques that have been discussed that are applicable to OLAP operations apply here: the parallel execution and runtime scaling capabilities of data flows, the plethora of DMVs, indexing, caching, columnstores, execute plans, query execution statistics, partitions, distributions, and compacting files, to name some of the more impactful examples.

The data analytics pipelines you run in the analytical context means that you will be ingesting data a lot more often from numerous data sources. That statement is highly probable because of HTAP or combined OLTP/OLAP scenarios where the data remains in the same datastore for both transactional and analytical reasons. Therefore, it is prudent to discuss PolyBase, external tables, and the Copy activity. The External Table section in Chapter 2 described the relationship between PolyBase and external tables (refer to Figure 2.12). You also created external tables on a serverless SQL pool in Exercise 4.11, with a SQL command like the following:

CREATE EXTERNAL TABLE SampleBrainwaves
([Timestamp] NVARCHAR(50),
[AF3theta] NVARCHAR(50),
[AF3alpha] NVARCHAR(50),
[AF3betaL] NVARCHAR(50),…)
WITH
(LOCATION = ‘EMEA/brainjammer/out/2022/04/03/*/*.parquet/*’,
DATA_SOURCE = SampleBrainwavesSource,
FILE_FORMAT = SampleBrainwavesParquet)

These two technologies, PolyBase and external tables, work together so that SQL queries, like the one here, can be run against data stored in files, commonly stored in an ADLS container. As shown in Figure 4.33, the following is the output of the preceding CREATE EXTERNAL TABLE SampleBrainwaves SQL query:

SELECT TOP 10 * FROM SampleBrainwaves

There are two optimization points to call out in the context of PolyBase and external tables. The first has to do with the number and size of files being loaded into the external table, and the other concerns querying the data existing on it. Your first thought when asked about number thresholds in the context of Azure Synapse Analytics is the number 60—the law of 60, in particular. The most optimal number of files to load into an external table is 60 or more, which maximizes the parallelism of your data load. To review why that is done, remember that in Exercise 5.6 you split a single large file of 391 MB into 60 files. Because there are 60 distributions on dedicated SQL pool external tables, the load of those files could happen on multiple nodes and would be more performant when compared to a single, large file due to the concurrent parallelism.

The other technique that can optimize performance has to do with querying the data on external tables. The magic of PolyBase is delivered with APIs that provide the capability to query files with SQL statements. Those files remain hosted on an Azure storage account, which is not backed by significant compute power. This means that each time a query is executed on an external table, the entire file is loaded into the tempdb. The query is executed against the data in the tempdb and not the file hosted in an ADLS container. You can image the impact on query performance if the file is big. If you find that the same query is run on an external table many times, then you might consider storing the dataset on a local table.

From a Copy activity perspective, you must consider that the data being ingested into the Azure Synapse Analytics workspace must traverse across the network. Therefore, the most important consideration is the physical location of the data that needs ingestion in proximity to the physical location of the nodes. You do not want data to be transmitted across an ocean but rather across the internal network inside a single datacenter. So, keep the datastore, and the nodes that will ingest and transform your data as close to each other as possible.

The other major influencer on copy (aka ingestion) performance is the size or amount of data being copied. If the data is in the form of a file then by all means use compression. Compression can reduce the size of a file being transferred over the network, which will greatly reduce the time required to do so. When ingesting from a relational database, copy data in bulk as opposed to line by line. If you move data line by line, each row is read at a time from the source datastore and then inserted at the same frequency into the destination. That is very slow. When you use BULK INSERT or a bulk copy program (BCP), as discussed in numerous chapters but introduced in Chapter 2, retrieving many rows at once and then inserting them concurrently drastically decreases the time required to copy data.

Optimize Pipeline for Descriptive versus Analytical Workloads – Troubleshoot Data Storage Processing

Leave a Reply Cancel reply