Why microbenchmarks are not suitable for performance analysis

What are microbenchmarks, and how do they differ from real-world programs?

Microbenchmarks are very small programs that test a small number of operations in a loop, often as little as an individual computation. They’re different from real-world programs in that the result of the computations aren’t used for any purpose in the program. Microbenchmarks are used to measure the performance of a computation, and are put in a loop because a single computation is generally too short to be measured. Microbenchmarks are not real-world programs because they are not meaningul programs written to meet business needs or perform a useful task.

Ex. 1: Microbenchmark to measure performance of the ADD instruction: There is no purpose in incrementing Y, as it isn’t used anywhere in the program.

01 X PIC 9(6) COMP-3.
01 Y PIC 9(9) COMP-3 VALUE 0.

PERFORM VARYING X FROM 1 BY 1 UNTIL X > 99999
    ADD 1 TO Y
END-PERFORM.

Ex. 2: Real-world loop computing data that is used elsewhere in the program.

01 X PIC 9(6) COMP-3.
01 RECEIPTS-GRP.
02 RECEIPTS OCCURS 99999 TIMES.
    03 TOTAL PIC 9(18) COMP-3.
    03 SUBTOTAL PIC 9(18) COMP-3.
01 MAX-ITEMS PIC 9(5) COMP-3.
01 TAX-RATE PIC 9(2)V9(2) COMP-3.

<Code that computes subtotals>
…
PERFORM VARYING X FROM 1 BY 1 UNTIL X > MAX-ITEMS
    COMPUTE TOTAL(X) = SUBTOTAL(X) * TAX-RATE
END-IF
…
<Other code that uses TOTAL>

Why is it not appropriate to do performance testing on microbenchmarks?

There may be interactions between instructions that show up in tight loops (a loop made up of a small number of machine instructions, usually coming from a small number of COBOL statements within the loop, such as in Ex. 1) that are uncommon in real-world programs. These interactions mean that microbenchmarks are not reliable indicators of performance; optimization done on microbenchmarks may be optimizing for interactions that don’t show up in real-world code.

Spending time optimizing microbenchmarks may make those microbenchmarks run faster, but the effort may have minimal impact on actual applications. Since it’s the real-world applications that will be running in production on a regular basis, not the microbenchmarks, there is no use in analyzing and optimizing microbenchmarks unless similar issues can be found in real-world code. There is little value in improving code for test purposes that has minimal impact to real-world applications.

Here are some specific things to consider:

Interactions Between Instructions

Instructions in a program are not executed in isolation. Instructions issued earlier can have an effect on instructions issued later. One such example is that in the tight loop in Ex. 1, the loop counter (X) as well as the data item being incremented (Y) are going to be read from and written to repeatedly in close succession. This strains the hardware to always have the correct data ready when it’s needed, without introducing stalls (where subsequent instructions are delayed because they’re waiting on a prior result). While IBM Z hardware and the Enterprise COBOL for z/OS compiler both have improved in recent releases to handle this better, this interaction is likely to occur more frequently in a microbenchmark than in real-world code. There are many other types of interactions between instructions as well. Optimizing a microbenchmark may result in optimizing interactions between instructions that don’t frequently occur in real-world code, and thus the work of optimizing microbenchmarks may have minimal real-world impact.

Variation

Another problem is that some microbenchmarks may have a short execution time, which means that any variation in execution time (“noise” in the system due to system load and other factors) is exaggerated. For example, if the application takes 2 seconds to run and then it takes 3 seconds when it is run again, does that indicate a 1.5x increase? Or is it due to noise? Whereas if a long running program takes 100 seconds to run and then takes 101 seconds when run again, the “noise” becomes just a small fraction of the total.

Run 1	Run 2	Increase (seconds)	Increase (multiplication factor)
2 seconds	3 seconds	1 second of noise	1.5x
100 seconds	101 seconds	1 second of noise	1.01x

Running a short program many times over can help eliminate some of this variance, but also there is overhead associated with running a program in general. This overhead includes invoking the COBOL runtime, bringing the program into memory and initializing data items, and the compiler can not help improve the performance of this overhead. If the running time of the program itself is small, that overhead becomes a large portion of the application performance. Whether you run it once or thousands of times, that ratio of overhead to program running is still the same.

Run 1 (run once)	Run 2 (run 100x)
2 seconds program runtime + 1 second overhead = 3 seconds	(2 seconds program runtime * 100 runs) + (1 second overhead * 100 runs)= 200 + 100 = 300 seconds

When is it appropriate to do performance testing on a microbenchmark?

It may be that a microbenchmark-style loop actually appears in real-world code. If so, and if performance measurements indicate it is a bottleneck, then it is worth looking at, because unlike a standalone microbenchmark, we now have an example where that loop affects the overall performance of the application, and so a performance improvement actually will be beneficial.

Are all real-world programs appropriate for performance testing?

A short-running real-world program is subject to the same concerns as a short-running microbenchmark, at which point it’s more useful to see performance measurements for a whole application, perhaps being run with more data, so the actual bottlenecks in the application get exposed.

Recommended approach to performance testing

Performance testing should focus on using long-running, real world applications. In cases where applications are not performing as well as they should, or have worse performance on newer hardware or when compiled with a newer compiler version, a detailed performance report can indicate which programs, and which instructions in those programs, are performing worse between versions. This information does not need to be gathered in production, but should be gathered with the actual application and ideally with real-world data. A performance report and measurements gives the compiler developers a targeted place to look, enabling IBM to fix situations that have a direct impact on the performance of client applications running in production. Time spent optimizing microbenchmarks may not have any impact on a client application performance in production and so it takes away from more useful work that could benefit our clients.

IBM offers COBOL performance tuning webinars. Register to join a live webinar or find a pre-recorded webinar video here.

Originally published on the IBM Z and Linux Community Blog.

Why microbenchmarks are not suitable for performance analysis

ByMike Chase

What are microbenchmarks, and how do they differ from real-world programs?

Why is it not appropriate to do performance testing on microbenchmarks?

Interactions Between Instructions

Variation

When is it appropriate to do performance testing on a microbenchmark?

Are all real-world programs appropriate for performance testing?

Recommended approach to performance testing

Mike Chase

Related

You Might Have Missed....

A Mainframe Milestone: Unparalleled Business Value, and Cutting-Edge Advancement

Pink mainframes and a condiment cannon: Meet PJ Catalano

Meet the man securing mainframes: Mark Wilson

Leave a Reply Cancel reply

Understanding the Continuous Delivery of Db2 and Application Compatibility

IBM Utilities

Security

Availability, Resiliency, and Scalability

More Big Iron

A Mainframe Milestone: Unparalleled Business Value, and Cutting-Edge Advancement

New Generative AI Capabilities, Where the Mainframerz are Meeting, and more

Pink mainframes and a condiment cannon: Meet PJ Catalano

People and History of the Mainframe Trivia