Denver Mug BlogMark Pepall, CTO
Mind the gapSynopsis: the consolidation and commoditisation of IT infrastructure within large companies has led to large cost savings over the last decade. But along the way, IT understanding of the business value of the services they deliver has been lost. A new generation of technology and processes is required to glue disparate IT silos together. Business Service Management (BSM) directly addresses many of these shortcomings. Consolidation and fragmentationWhen I first worked in London, if the Head of Treasury was looking over my shoulder as I tapped on the keyboard, I realised pretty quickly that the issue I was working on was important. Back in those days, many London banks ran big, monolithic applications contained on a single monolithic mainframe. Most departments had their own, individual IT departments and IT workers often sat side-by-side with the business users they supported. If I needed help from the DBAs, I'd shout over to the other side of the office to get them on the case. The wastage and overlap was staggering, but this was the early nineties everyone was making money (and empire building) so the costs didn't seem much of an issue. In the intervening 15 years, large companies have made great strides in consolidating their IT infrastructure, leveraging their vendor relationships and standardising technology, thereby realising huge cost savings. By using commodity items to underpin their IT services, companies have been able to package and off-shore, near-shore and outsource services cost-effectively. Somewhere along the way, however, most have irretrievably broken the understanding IT workers used have of the business value of the services they provide. In the old days, there weren't too many layers between the hardware and the business. These days, since the triumph of distributed computing and Web architecture, SOA, Agile computing, there can be a multitude of support layers between the two and support for many of the layers is often outsourced to different companies, often on different continents (network hardware support in London, network operations in Mumbai, SAN hardware by a different London provider to the network one, SQL support from Bangalore, Oracle support from Romania, Windows support from Hyderabad, etc). Outsourcing and management by ticketOf course, out-sourcing only reinforces the siloed approach and heightens the walls between IT silos. A database or Windows support companies based in India can concentrate on providing cheap commodity level one administrators, backed up by very good level three engineers to deliver a powerful, and irresistible support offering. The trouble is that this leads a very inward, domain-centric view of the world. One of my colleagues received a call last year, from an exasperated trader in Singapore. The bank's main payment gateway server, in London, was down at a critical time in Tokyo's time zone. He'd logged a call, but his ticket was in a queue, and "would be processed in due course". The London-based support team had no idea that his downtime was costing the company USD $100,000 per hour. This "management by ticket" is one symptom of the loss of business awareness: in the absence of any other information, support groups tend to process trouble tickets in chronological order ("ticket priority" is a completely redundant field - everyone uses "high", otherwise they'll wait forever for a resolution!). There is also considerable pressure on many providers to process tickets quickly: it makes their figures look better, helps them keep inside their (often inappropriately framed) SLAs and, worse still, in many cases, more tickets processed means more income. I saw one ticket at a large US investment house do three complete circuits around six different outsourced providers - it really was a case of pass the parcel. All groups were exceeding their individual SLAs - but the end result was an appalling lack of delivery to the business user who needed to set up some mandatory reporting to the FSA. The emphasis is on quantity, not quality. Lack of awareness of business impactEven where the functionality is not outsourced, simple oversights can occur in the heat of battle. At one of our customer sites, the support staff had to rebuild the overnight settlement batch, that processes several hundred thousand trades each night, after an early evening infrastructure issue. At 3 am, the application support person signed off with an email that said "OK, it's all working now except for 5 trades stuck in [system X] - I'll sort those out in the morning". It turned out those 5 trades totalled USD $20 million. If this bank's monitoring expressed the backlog in dollar-value (a business metric), rather than MQ message count (an IT metric), this issue would have been given the focus and urgency it warranted. This is even more important when low level infrastructure services break. For example, a SAN fabric failure in most large companies will result in a flood of tickets: a sea of red. Event management software companies (like IBM, HP, CA, BMC) have some good products that help reduce such event storms (Event Correlation Analysis) and help isolate the underlying root cause issues. Even so, these products generally do not include any business awareness: placing a dollar value next to an issue on a ticket or event management dashboard can help support personnel to prioritise appropriately. There are other factors that also contribute to this breakdown. The cracks between disjointed data centre infrastructures are made worse by mergers and acquisitions. The waves of consolidation and take-overs following the 2001 recession and the sub-prime crisis in 2008 have forced disparate IT systems and products from different companies together. The wave of "right"-sizing (i.e. down-sizing) since the sub-prime crisis has not helped either. After waves of redundancies, many of the IT staff who have left are those who knew the history of and had an overall, end-to-end understanding of how the application or business service hangs together. For complex services that interface with many other systems, there is simply no substitute for this sort of experience. The head of back office technology at a large investment bank told me that he is not a fan of component-based architecture like SOA: "The problem is that business processes cut right across and simply do not align with the technology components". Data centre support groups react to such expressions of lack of faith by increasing the depth of their IT monitoring point solutions. For example, they get earlier and more detailed warnings if their database server starts producing errors, but it is still silo-focused and doesn't tell them if, say, a slower (but still functional) database response time is causing a transaction processing backlog to build downstream. Enter Business Service ManagementSo what can be done to improve things, given that nobody is going to bring support from India back in house or put business and IT people on the same floor again? I don't believe the answer is going to come from the data centre (I should know, I've spent my entire working life in the data centre org chart!). Fortunately, vendors are responding to these issues with a wave of second generation application management tools, generally known as Business service Management, although some vendors and analysts use the term Application Performance Management (APM). Vendors' definition tend to align closely with their product offerings and product strengths and therefore vary considerably. Wikipedia defines BSM thus: Business service management (BSM) is a methodology for monitoring and measuring information technology (IT) services from a business perspective. BSM consists of both structured process and enabling software. The Information Technology Infrastructure Library (ITIL), a set of IT management frameworks and concepts, has recently identified BSM as a best practice for IT infrastructure management and operations. BSM provides many benefits, both to IT and to the business, but I'll focus here on how it can help bridge the gaps opened by the fragmentation of IT services. For more background on BSM/APM, see Gartner's "Magic Quadrant for Application Performance Monitoring" [1]. The most important point is to start top down, from the business side, with the business process (for example, from a service catalogue) and work outwards/downwards from there. The first priority should be to give a clear view to everyone in the chain of the overall, end-to-end health of the business service. End-user experience (EUE) monitoring is absolutely essential. Analysts [2] report figures ranging from 40% to 70% of all issues are reported by end-users, not traditional monitoring systems. And users are generally your most expensive monitoring tool! In complex, multi-tier, multi-vendor environments, application or service outages often go unnoticed. Synthetic user monitoring (where robots or probes perform emulated end-user actions) can give early warning of IT issues before users login, and indicate performance and availability at times when users are not logged on to the system. It is also vital to use a mix of synthetic and real-user monitoring (RUM) (although most vendors' RUM offerings only monitor Java & .NET effectively). EUE is also ideal for quantifying response time - degraded response time can be an early warning of impending problems. EUE can give service views often not possible with first generation monitoring tools (given that support people no longer sit near the users, this is the next best thing). The most visible artefacts of a BSM implementation are the end-user dashboards. At the highest level these should present overall views of the health of business services and include relevant business metrics like the number of users affected and the dollar value of downtime, even if it is only an estimate, to ensure support personnel have some idea of business impact of the issue. Most support personnel, in my experience, get no other feedback of the gravity of the issues they are dealing with. BSM dashboards should be underpinned by a business model, that makes sense of and defines the relationship between IT components and business services. Ideally, this model should be automatically generated. Without this underlying logic, dashboards can be difficult to build, become obsolete very quickly and impossible to maintain. A good model will allow display of relationships and dependencies up and down the tree: for example, an IT operations manager can see the business servers impacted by a failed SQL server that hosts a dozen shared databases or, conversely, quickly drill down from a poorly performing business service to see what part of the IT infrastructure is the root cause (load balancer? Web server? Application server? Back end database? Network provider?). This deep-dive functionality is a key benefit of BSM, particularly if linked to diagnostics systems, and substantially reduces the cost of dealing with tickets by minimizing pass-the-parcel reactions ([3]) - this is often enough alone to justify the cost of a BSM rollout. The business model is a good substitute for the business awareness lost by specialisation; it has the added benefit of taking much of the knowledge about application structure and dependencies out of the heads of a few individuals and making it accessible to a wider audience. SummaryBSM is no silver bullet, but even the very simplest implementations address many of the issues discussed above and will go a long way towards restoring business confidence in IT delivery, by becoming the glue that joins disparate silos together, reducing the gaps in between - by improving awareness of business impact, providing early warning of problems and reducing repair times. The Gartner report [1] lists vendor strengths and weaknesses. In particular, consider those vendors in the Magic Quadrant: HP, CA, Compuware and Quest Software. References:
| ||||
|
|
|||
| © Denver Technology (Europe) Ltd 2010 | ||||