Benchmarking Terms of Reference

This is an extract from an original article by Pam Morris - Total Metrics - published in IFPUG Book on Software Measurement 2011.

Terms of Reference

It is our recommendation that before engaging a benchmark supplier, or funding an in-house benchmarking program, that the sponsors work with the benchmarker and stakeholders to establish the ‘Terms of Reference’ for the benchmarking activity. These terms should include the agreed position for each of the following:

1. Strategic Intent of the Benchmark

o How will the results be used?

2. Type of Benchmark

o Internal and/or external?

3. Benchmark Performance Metrics

o What are the processes or products required to be assessed to satisfy the goals of the benchmark and how will they be measured?

4. Standards for Measures

o What are the agreed units of measurement, data accuracy and validation requirements?

5. Scope of the Benchmark

o What are the inclusion and exclusion criteria for projects and applications?

6. Frequency of Benchmark

o When and how often should measures be collected and reported?

7. Benchmark Peers

o What are the criteria by which equivalent sample data will be selected for comparison?

8. Benchmarking Report

o Who will be the audience, and what will be the report’s structure, content and level of detail provided, to support the results?

9. Dispute Resolution Process

o What is the process that will be followed should disagreement arise about the validity of the benchmarking results?

1. Strategic Intent of the Benchmark

Sponsors of the benchmark need to work with IT Management to establish:

• The objectives of the benchmarking activity i.e. what are the results required to demonstrate; within what period, and for what purpose. What are the criteria by which the benchmark will be judged to be successful. Common reasons for benchmarking include monitoring:

o Process improvement initiatives

o Outsourcing contract performance against targets

o Consistency in performance across organisational units

o Benefits achieved from new investments or decisions compared to benefits claimed

o Performance compared to competitors or industry as a whole

• The stakeholders, i.e. who will be responsible for the benchmark’s design, data collection, analysis, review, approval, sign off and funding.

2. Type of Benchmark

Establish whether the organisation will benchmark:

• Internally to demonstrate improvement trends over time for the organisations internal processes, or

• Externally to compare internal results with external independent organisational units, or Industry as a whole.

Organizations that are aware of their own limitations will recognise their need to improve without first being compared externally to demonstrate how much improvement is required. As a first step, it is recommended that organizations start by internally benchmarking, and then when their own measurement and benchmarking processes are established, do some external benchmarking to establish their industry competitiveness. However, prior to determining standards for the collection, analysis and reporting of their benchmark metrics, they should first identify their proposed strategy for externally benchmarking. This enables their internal benchmarking framework to be aligned to that of the External Benchmark Data Set, thereby facilitating the next step of External Benchmarking without any rework to realign the data.

3. Benchmark Performance Metrics

Benchmarking AD/M should ideally monitor the performance all of the four perspectives identified in the Balanced Scorecard approach - Financial, Customer, Business Processes, Learning and Growth. Whilst this is the ideal approach, in our experience IT organizations focus their initial IT benchmarking activities on areas that directly impact their IT costs. They measure the cost effectiveness and quality of their IT processes and products by optimising the following Key Result Areas (KRAs):

Cost-effectiveness of the process - are they getting ‘value’ for money invested?
Efficiency of the process - how ‘productively’ is their software being developed or maintained?
Speed of Delivery - how ‘quickly’ can they deliver software product or ‘solve a problem’
Quality of the product - how ‘good’ is the software product or service they deliver
Quality of the process - how much time and money was wasted in ‘rework’
Customer Satisfaction - how well does their delivery of software products and related services meet and or exceed their customer’s expectations.

Benchmarking is not a ‘one size fits all activity’. Many ‘Benchmarking Service Providers’ offer turn-key solutions that fail to take into account the individual needs of their clients. By clearly defining the strategic intent of the benchmark before engaging a Benchmark Provider an organisation ensures that client organisational goals are met and the solution being offered provides a good “fit’. Once this is decided they can then focus on benchmarking Key Performance Indicators (KPIs) that demonstrate achievement of those goals. For example, for many telecommunications and financial sector companies, maintaining competitor advantage is the key to their success, so they need their IT department to constantly deliver new, innovative products to their market. In this case, ‘speed of delivery’ becomes their highest priority to optimise their competitive position. In comparison, recent budget cuts for Government Agencies may focus their improvement needs on maximizing their IT cost-effectiveness. Before starting a benchmarking activity identify the key organisational goals and their corresponding KRAs, then one or two KPIs within that area that will demonstrate the achievement of the identified goals. When conducting an external benchmark some compromise may need to be made in the selection of KPI’s as these must align to performance measures for which industry/peer data is available.

4. Standards for Measures

When comparing between projects, business units and/or organisations you need to ensure that the measurement units collected are equivalent. This is not merely a matter of stating that cost will be measured in US dollars, size will be measured in Function Points and effort will be measured in days. Whilst ‘cost’ of software projects is probably the most carefully collected project metric, and the most important for the organization to monitor, it is a very difficult unit of measure to benchmark over time. This is becoming increasingly the case in a world of off-shore multi-country development, where currency conversion rates fluctuate daily and salary rates rise with different rates of inflation across countries and time. Comparing dollars spent per function point this year, to previous years, requires multiple adjustments and each adjustment has the potential to introduce errors. Instead most organizations choose to measure cost effectiveness by measuring the effort input instead of cost input. Whilst it may seem straight forward to measure the Project Productivity Rate as the number of function points delivered per person per day, in order to really compare ‘apples to apples’, the benchmarking analysis needs to ensure that for each of the participating organisational units the following characteristics of the size and effort measures are consistent:

Type of Function Points recorded i.e. IFPUG, COSMIC or NESMA function points? Has the size reported been actually measured or is it an approximation derived by converting Lines of Source Code to function points? Which version of the Functional Size methodology (IFPUG 4.0 to 4.3? ) has been used and has all data in the sample set been measured with using this same version?
Type of Day recorded i.e. not all organisations work the same number of hours in a day. If days were calculated by dividing time sheet hours by 8, then how was the number of hours collected? Did they include all the hours ‘worked’ including overtime (10 hours = 1.25 days), only hours ‘paid’ thereby excluding 2 hours unpaid overtime? (8 hours = 1 day). Did they collect hours from project codes on time sheets and include only productive working hours dedicated to the project i.e. excluding breaks, non-project meetings, email etc? (6 hour day = 0.75 days).
Accuracy of the Measures i.e. did they accurately measure the function points and extract exact effort hours from time sheets or did they roughly estimate size using approximation techniques or multiply the team size by the months allocated to the projects to get hours and days?
Scope of the Effort Measures i.e. did they include all the effort of all the people that contributed to the project including the steering group, administration staff, business users, operational staff, or did they just include the time of the project manager, analysts, programmers and testing team?
Scope of the Size Measures i.e. when measuring functional size, did they measure all the software delivered to the Users, including package functionality delivered unchanged or did they just measure the functionality built and/or configured by the project team?
Scope of the Project Life Cycle Activities included in the effort data – did the project team work on the whole lifecycle from planning through to implementation or did the business area complete the planning and requirements before handing the project to the development team? Did the project effort figures include or exclude all the activities included in the project budget such as, the extensive research into project technology choices during the planning stage, the data loading activity for all the data files and the extensive worldwide training of thousands of end users?

Every organisation has different ways of measuring and recording their metrics. The resulting productivity rate may vary up to 10 fold depending on which of the various combinations of the above choices are made for measuring effort and size. To avoid basing decisions on invalid comparisons, agreed standards need to be established at the beginning of the benchmarking activity for each of the measures supporting the selected KPIs for each contributing organizational unit. Each measure needs to be clearly defined and communicated to all participants involved in the collection, recording and analysis of the data. If some data is inconsistent with the standards then it should be either excluded from the benchmark or transformed to be consistent and appropriate error margins noted and applied to the results.

To simplify this process, and facilitate external industry benchmarking, it is recommended that organisations adopt the defacto data collection standards and definitions for measuring AD/M developed by the International Software Benchmarking Standards Group (ISBSG).

The ISBSG community recognized the need for formal standardisation of AD/M measurement and in 2004 developed the first working draft of a Benchmarking Standard which became the basis for the new ISO/IEC framework of Benchmarking standards. The first part of a 5 part framework for Benchmarking Information Technology was approved in May 2011, to become an ISO International standard (ISO/IEC 29155-1. Systems and software engineering -- Information technology project performance benchmarking framework -- Part 1: Concepts and definitions. ) Seventeen countries participated in the review of the interim drafts and the final approval vote for the standard. This international collaborative process ensures the result is robust and the outcome is accepted across the IT industry. The ISBSG is already a recognised industry leader in setting standards for data collection. A number of software metrics related tools vendors and Benchmarking Providers have adopted the ISBSG data collection and reporting standards and have integrated the ISBSG data set in their tools

5. Scope of the Benchmark

Not all of the software implementation projects or software applications supported are suitable candidates for inclusion in the Benchmarking activity or can be grouped into a homogeneous set for comparison. All candidate projects and applications should be investigated and categorised on the following types of characteristics in order to make a decision about their acceptability into the benchmarking set, or if they need to be grouped and compared separately:

Different Delivery Options - different types of projects include different development activities as part of their delivery. For example a package implementation with little customization has significantly reduced effort expended on design and coding activities, care would need to be taken to determine if it is appropriate to include these types of projects in a benchmarking set of bespoke software projects.
Different Types of Requirements – whilst most projects require delivery of both non-functional and functional requirements, some focus primarily on enhancing the non-functional (technical and/or quality) characteristics of the software or fixing defects. Examples of technical projects are: a platform upgrade, reorganising the database structure to optimise performance, refactoring code to optimise flexibility or upgrading the look and feel of the user interface to enhance usability. Whilst these projects may consume large amounts of development team effort, they deliver few, if any, function points. It is therefore inappropriate to include technical or ‘defect fixing’ projects into productivity comparisons which include projects that primarily deliver user functionality.
Different Resourcing Profiles – projects that only include the planning, requirements specification and acceptance testing effort, with all other life cycle processes being outsourced, should not be grouped into a data set that includes projects where effort has been recorded for all phases of the project life cycle.
Different Technology Profiles – the ISBSG have identified several technology based attributes that significantly impact the rate of delivery of a project including the coding language, the development platform (environment) and the database technology. Be aware that it is difficult to establish and compare trends over time if there is wide variation in the mix of the technology profiles within a single project or of the projects in the benchmarking set.
Different Size Profiles – as a risk mitigation strategy, very large scale projects (>3000 fps) often require more formal administrative governance processes, more rigorous development processes, more complete project documentation and utilisation of speciality resources, compared to average sized projects (300 to 1500 fps). All of which add additional overhead effort to the project which negatively impacts productivity. Interestingly very small projects (<50fps) that follow the same formal development process as larger projects also tend to show low productivity rates (up to 5 fold lower than medium sized projects), due to the disproportionate overhead of administration, management and documentation effort. These small projects (<50 fps) also show wide variations (up to 10 fold) in productivity and therefore should be excluded from Benchmarking data sets. When aggregating projects in the benchmarking set ensure that there is an even mix of project sizes, or group projects into benchmarking sets of comparable size bands.
Diverse and/or Large User Base – projects that have a very diverse set of business user stakeholders with significantly different functional and cultural requirements, consume more effort to develop, maintain and support, than projects with a single set of homogenous users.
Different Functional Domains – the ISO/IEC framework standards for functional size measurement (ISO/IEC 14143-parts 1 to 6) recognises that software functionality can be classified into different functional domains. E.g. ‘Process rich’ real time and process control software compared to ‘data rich’ information management software. Whilst the IFPUG method measures in all domains, the characteristics of the domain will influence the size result. For example in data rich domains the stored data will contribute more significantly to the final result than in process rich or strongly algorithmic domains. The differing contribution of the data to the final size will impact the measured productivity and quality metrics. Care should be taken to ensure that benchmarking data sets comprise software from similar domains.
Different Project Classifications – different organisations have different definitions for what constitutes an IT project. Some define a ‘project’ as the implementation of a business initiative (e.g. implement a new Government Goods and Services Tax) others regard a Project as a Work Package implemented by a Project Team. A business initiative project may have requirements to modify many applications and will comprise multiple sub-projects, where a sub-project is equivalent to a Change Request or Work Package with discrete requirements for each application. The ‘project’ will have its own overhead activities required to manage and integrate all Work Packages. These overhead activities cannot be attributed to a particular Work Package, but to the project as a whole. The sub-projects will have their own effort, cost and size profiles. Often the Sub-projects are implemented in different technologies since they impact different applications, further compounding issues of aggregating metrics and profiling the project. Similar issues arise with definitions of a project when treating a new Release of an Application as a ‘Project’ to be benchmarked. Typically the Release is made up of multiple Change Requests and each Change Request is implemented by its own ‘project’ team. Release overheads are incurred in a number of activities such as Release Management, Planning, System testing, Integration and Acceptance testing. These activities are usually not recorded at the Change Request level.

When benchmarking against industry ‘projects’ you need to ensure that you are comparing against a ‘Project/Release’ or a ‘Sub-Project/ Work Package’ since the productivity rates of the Project/Release type ‘project’ will be decreased by the overhead effort and cost.

It is recommended that prior to selecting the projects or applications to be benchmarked they are first grouped into like ‘projects’ and then classified using the above categories, to either ensure that each of the benchmarking sets consists of an even mix of all types, or if this is not able to be achieved, that they are grouped into ‘like’ categories for comparison exclusively within those categories.

6. Frequency of Benchmark

The frequency in which data is collected, analysed and reported will be determined by the goals of the Benchmarking activity. However, when determining how often these activities need to be done the following need to be considered:

Project Durations and Demonstrating Trends - if the benchmarking objective is to demonstrate the benefits of implementing new tools or technologies, it may take several cycles before these benefits become evident. The learning curve experienced when adopting new practices often shows a negative effect on productivity for anything up to 18 months after implementation. In addition, if project durations are over 1 to 2 years then it may take several years to demonstrate any benefits. In this case it may be best to baseline the metrics, then benchmark again after two years in order observe a result. Benchmarking trends in a KPI assumes that ‘everything else’ stays the same and any improvements observed are due to the changes implemented, or any failure to see improvement is due to failure of the change to be effective. Unfortunately the IT world does not ‘stand still’ while you benchmark. IT technology, tools and techniques tend to be in a continual state of evolution. Over successive benchmarking periods, external forces of change will be introduced and will have an impact. The challenge to the benchmarker is to capture these variables and identify their influence on the results. It is therefore imperative that the benchmarker is fully apprised of all the “soft” factors that are likely to impact the “hard” benchmark metrics.
Allocating Projects to Benchmarking Periods - projects with long durations may span several benchmarking periods. Some benchmarkers implement ‘macro’ benchmarking whereby they collect all the effort and costs consumed for a 12 month period from the financial and time sheeting systems and then divide by the function points delivered in that period. Issues arise when projects span several periods so their inputs (effort and cost) are included in all the periods but their outputs (function points delivered) are only included in the final period. This phenomenon skews the productivity to be very low for initial periods and very high for the last period. A work-around can be achieved by proportioning the function points across the periods based on an ‘earned-value’ type approach.
Usefulness of the Result – if the benchmark periods are set too widely apart, by the time the data is analysed and reported the usefulness of the information may have diminished, as often the course of time has changed the relevance of the results to current practices. The late delivery of results may identify an issue that, for maximum effectiveness, should have been identified and addressed at the point it occurred. For example, in the referenced case study the organisation only reported their benchmark results annually. By the time they identified that their new strategy, to implement small projects in response to stakeholder demands, was costing them 5 times as much as aggregating requirements into larger projects, it had already cost them millions of dollars. When benchmarking is used for process improvement and there are long delays in reporting, it is difficult to do a root cause analysis on why a project is an exception, if the project team has since disbanded and the history is lost. However, if benchmarks are reported at intervals that are too short, normal deviations from the median, or ‘noise’ in the results, can be incorrectly interpreted as a trend and responded to inappropriately. Select a benchmarking period that is aligned with the organization’s decision making processes, so the recommendations in the benchmarking report can be actioned promptly. For example, results should be reported prior to decisions on budget allocations, or timed to be presented before steering group strategy meetings.
Statistical Validity of the Result – before deciding on a benchmark period you need to assess how many projects will be implemented in that period that satisfy the inclusion criteria for the benchmark; and you need to have collected sufficient data to support the benchmark. In order for the result to be statistically significant you need a valid sample size and a valid methodology for selecting the sample. The rule of thumb is to sample at least 10% of the total instances, and the sample set to be not less than 30. Ideally the margin of error for the result is less than 10%, with a confidence level of 95%. However, this is difficult to achieve if you are benchmarking retrospectively and you need to rely on data that has been collected prior to the benchmarking Terms of Reference being established. Prior to starting the benchmarking activity the stakeholders should agree on what is an acceptable margin of error and desired confidence level in the result. This is important since large outsourcing contracts are known to impose year on year performance improvement targets for suppliers of 10%. If the sample set is small and the margin of error is greater than 10% then the benchmarking activity will not be sensitive enough to demonstrate any productivity gains achieved.

7. Benchmarking Peers

Previous discussions have highlighted the factors to categorise individual projects and applications to ensure that sample sets of data for internal benchmarking are comparable. However, when an external data benchmarking set is derived from industry, or selected from one or more external organizations, then additional factors need to be considered.

Organisational Type of the Benchmarking Partner needs to be comparable, e.g. care should be taken comparing the results from a large IT development shop with those from a small boutique developer, as they will have significantly different development environments. Large government and banking and financial institutions stand out as having productivity rates that are generally lower than other types of organisations. These organisations typically have projects that impact very large, monolithic, multi-layered legacy systems. The productivity of their enhancement projects is negatively impacted by their applications’ inherently complex internal structure; multiple interfaces; out of date systems documentation; and inaccessibility to developers who are familiar with all of the underlying functionality. Compounding the technical issues, any major project decision is required to be approved by multiple levels of bureaucracy, adding further delays and consuming additional effort and costs.
Different User Priorities – the end use of the software (e.g. military, medical, financial etc.) may dictate the rigour applied to the software development process. A requirement for high quality bug-free software will focus development activities and priorities on prevention of defect injection and maximum defect clearance rates rather than project cost effectiveness and efficiency. Different end user priorities need to be considered when selecting appropriate benchmarking partners.
Quality of the External Dataset – due to the reticence of organizations to make their performance data publicly accessible, the most common way for organisations to externally benchmark against industry data is either by engaging an external benchmark provider organisation that has their own data repository, or by purchasing industry data from the ISBSG. Benchmarking clients’ need to fully investigate the provenance of the dataset they are going to be compared against (based on the criteria outlined in the Terms of Reference above) prior to deciding on their approach. The ISBSG’s Dataset has the advantage of being a very cost effective solution and an ‘open repository’. I.e. ISBSG provide detailed demographics of their industry sourced benchmark data and their data includes all attributes of the projects, while maintaining the anonymity of submitters. Most benchmark provider organizations have a more ‘black box’ approach and only disclose the aggregated summarised results of their benchmark dataset, making it more difficult for a client to independently assess the relevance and validity of comparing it with their own data. The ISBSG data also discloses the age of its data which is important in a fast changing IT environment. Over 70% of the Maintenance and Support data and over 30% of the Development and Enhancement data is less than 4 years old. It is also very widely representative in that it is voluntarily submitted by IT organisations from over 20 countries. Each project set is independently validated by ISBSG for its integrity and assigned a quality rating, so the user can decide on whether to include or exclude a particular project or application from the benchmark set. However, the client organisation should also be realistic in their expectations of the external benchmark dataset. The normal process of submission, validation and analysis of external benchmark data means that the data can be up to 18 months old before it is formally “published” as part of the benchmark set. A client who is undertaking leading edge developments may have difficulty finding comparable data sets.
Filtering of Submission Data in an External Dataset – If a dataset has been contributed to voluntarily then the submitters typically select their ‘best’ projects for inclusion. The resultant mean KPIs derived from the data set tend to represent the ‘best in class’ rather than industry norms. In our experience with the ISBSG data the industry norm is closer to the 25 percentile of performance than the mean. In contrast, benchmarking datasets that have been derived from adhoc sampling methods, have median and mean values that align more closely to the median and mean values found in industry. The submission profile of the industry dataset needs to be known and understood when comparing and reporting the data and determining where an organisation is positioned compared to industry.

8. Benchmarking Report

Prior to commencing the benchmarking process it is recommended that the sponsors and key stakeholders agree on how the information will be reported. They need to decide on the reports:

Structure and content – i.e. table of contents and the format of the results.
Level of granularity – i.e. will the data reported be aggregated by project, application or organizational unit?
Presentation technology – i.e. will the data be embedded as graphs in a document or provided online via a business analytics portal allowing interactive drill down capability?
Confidentiality and Audience – who will have access to the report results and how will it be distributed?
Review Process and Acceptance criteria – i.e. who will establish the reasonableness of the data prior to draft publication; who is responsible to for approving the final report and actioning its recommendations?
Feedback process – for improvement of benchmarking activity i.e. what is the process for continual improvement of the benchmarking process?

9. Dispute Resolution Process

If the Terms of Reference are established prior to the benchmarking activity and agreed by all parties, then any areas of contention should be resolved prior to the results being published. However, as mentioned earlier, in some circumstances there are significant financial risks for an organization that believes that it has been unfairly compared. It is recommended that if benchmarking is incorporated into contractual performance requirements then a formal dispute resolution process also be included as part of the contract.

Summary

Whilst the above warnings appear to indicate that comparative benchmarking is difficult to achieve, in our experience this is not the case. It is surprising in reality to see the results of pooling data into a benchmarking set and how well they align with results from external data sets from a similar environment. In our experience the rules of thumb derived from industry data are able to accurately predict the scale of effort or the cost of a project, indicating that the measures from one data set can be used to predict the results for another.

However, as consultants who have worked for over 20 years in the benchmarking industry we are constantly confronted with contracts that require performance targets based on a single number to be derived from a large heterogeneous data set. Such benchmarks are unlikely to deliver useful results and client expectations need to be managed from the outset. The Terms of Reference described above are provided as guidance for consideration when embarking on a benchmarking activity. Only some variables will apply to your unique situation. If they do apply, consider their impact and choose to accommodate or ignore them from an informed position; fail to consider them at your own risk.

Bibliography

Morris, P. January 2004. Levels of Function Point Counting.

/function-point-resources/downloads/Levels-of-Function-Point-Counting.pdf

Select M&S DCQ (Microsoft Word doc)

Morris, P. 2010. Cost of Speed. IFPUG Metrics Views. July 2010. Vol. 4. Issue 1: 14-18.

International Software Benchmarking Standards Group (ISBSG)