The Cost of a Major Batch Window Failure

July 31, 2012 No Comments

By: Steve Woodard, SoftBase

A recent software glitch during a systems upgrade at a prominent financial institution resulted in nearly a week’s worth of unprocessed batch job transactions and left millions of customers without access to their accounts. This has significantly affected the organization’s reputation and will cost an estimated tens of millions of euros in compensation.

A glitch like this is an IT manager’s worst nightmare, and can cost hundreds of millions in lost transactions, fines, legal fees and customer confidence. This particular incident occurred as the result of a major batch window failure, meaning that the organization’s nightly batch jobs did not complete in time. Batch jobs are programs that process large amounts of critical business data, typically from the previous day, such as sales transactions, inventory levels, account balance updates, etc. The glitch underscores the vital importance of ensuring that your batch DB2 applications complete on time.

No one wants to see any organization experience this kind of disaster. But it begs the questions: What would happen if your batch jobs completed a week late? What would it cost your company? What could it mean for your job security?

There are many different estimates on what this could cost an organization. One estimate suggests that the average financial impact for just one hour of downtime for a financial institution is nearly $1.5M. Another estimate suggests that the median cost of downtime for organizations using DB2 z/OS is around $6K per minute, or $360K per hour. So, if your company falls into the median category and you experience a week of downtime, this will have cost your company more than $60M.

There are several steps organizations can take to avoid this type of situation and ensure that their batch jobs are completed on time and with fewer resources. For example, by paying attention to the following:

Plan for Emergencies – It may seem obvious, but ensuring that your organization has a contingency plan for a batch job failure is the easiest and most effective way to prevent an expensive outage. Normally, if a batch job fails in the middle of a process, it has to first ‘undo’ every change it makes, the problem must be found and fixed, and finally the job can be restarted. All of these steps take time, and when constrained to a 6-8 hour batch window, time is not a luxury. Consider investigating a checkpoint solution, which essentially lets each job save its work periodically. If a job does fail, it can be restarted from the most recent ‘save point’.
Nightly vs Real-Time Processing – Large companies have hundreds or even thousands of jobs that need to execute every night. These long running programs are typically only allowed to run during a the latest hours of the night, when real-time and online transactions are low. Running these jobs at night allows the company to better manage demands on the CPU. However, if there are many real-time transactions occurring during these hours, organizations may need to shrink or rearrange their batch window, or purchase more powerful (and expensive) hardware.
Relentless Data Growth – DB2 production databases are steadily growing in size over time. The more DB2 data that needs to be processed each evening, the longer the required nightly batch jobs take to complete. Companies should plan to archive old and unnecessary data to reduce unnecessary processing. Alternatively, software upgrades, CPU upgrades or third party tools can help you better control your growing database.
Inefficient Batch Applications – Many companies lack the ability to determine why their batch DB2 jobs are running longer. They are unable to find the key SQL statements that are extending their batch window and where they are encountering contention problems. There’s an entire field of study dedicated to tuning DB2 applications. Consider hiring a DB2 tuning consultant. Alternatively, many IBM and third party tools are designed to help you quickly identify and fix program inefficiencies.

The numbers clearly illustrate that a batch window failure can be disastrous for an organization. It’s vitally important that IT is prepared with the necessary tools to avoid a software glitch.

BIO

Steve Woodard is the President and CEO of both SoftBase, a leading provider of application testing and tuning solutions for IBM’s DB2® database utilizing the OS/390® and z/OS® operating systems, and Quadrant Software, a leading provider of Document Output Management (DOM) solutions for the IBM i (iSeries) enterprise. Prior to joining SoftBase, Steve was President of Ultimus Inc., a leading provider of Business Process Management solutions with over 1900 customers and partners worldwide. Prior to Ultimus, he was the Senior Vice President of Global Operations at Venturcom/Ardence, making significant contributions to worldwide revenue growth, the expansion of sales/OEM channels, and the creation of global partner networks that helped lead to acquisition by Citrix Systems Inc. in 2007.
Steve holds a B.S. degree in Electrical Engineering Technology from Keene State College in NH, where he has been a member of the adjunct faculty. He enjoys time with family, church and community service, fitness, all sports, and his Parson Russell Terriers.