Volume 67, Issue 2 p. 158-164
Original Article
Free Access

Development of workplace-based assessments of non-technical skills in anaesthesia*

G. V. Crossingham

G. V. Crossingham

Specialist Registrar in Anaesthesia

Search for more papers by this author
P. J. A. Sice

P. J. A. Sice

Consultant Anaesthetist, Department of Anaesthesia, Plymouth Hospitals NHS Trust, Plymouth, UK

Search for more papers by this author
M. J. Roberts

M. J. Roberts

Research Fellow

Search for more papers by this author
W. H. Lam

W. H. Lam

Consultant Anaesthetist, Department of Anaesthesia, Plymouth Hospitals NHS Trust, Plymouth, UK

Search for more papers by this author
T. C. E. Gale

T. C. E. Gale

Consultant Anaesthetist, Department of Anaesthesia, Plymouth Hospitals NHS Trust, Plymouth, UK

Honorary Associate Professor, Institute of Clinical Education, Peninsula College of Medicine and Dentistry, Plymouth, UK

Search for more papers by this author
First published: 17 January 2012
Citations: 21
Dr G. Crossingham
Email: [email protected]

You can respond to this article at http://www.anaesthesiacorrespondence.com

Presented in part at the Annual Meeting of the Society for Education in Anaesthesia, Exeter, UK, March 2011


Non-technical skills are recognised as crucial to good anaesthetic practice. We designed and evaluated a specialty-specific tool to assess non-technical aspects of trainee performance in theatre, based on a system previously found reliable in a recruitment setting. We compared inter-rater agreement (multir-ater kappa) for live assessments in theatre with that in a selection centre and a video-based rater training exercise. Twenty-seven trainees participated in the first in-theatre assessment round and 40 in the second. Round- 1 scores had poor inter-rater agreement (mean kappa = 0.20) and low reliability (generalisability coefficient G = 0.50). A subsequent assessor training exercise showed good inter-rater agreement, (mean kappa = 0.79) but did not improve performance of the assessment tool when used in round 2 (mean kappa = 0.14, G = 0.42). Inter-rater agreement in two selection centres (mean kappa = 0.61 and 0.69) exceeded that found in theatre. Assessment tools that perform reliably in controlled settings may not do so in the workplace.

The Tooke report on Modernising Medical Careers [1] highlighted the need for specialty training to focus on the acquisition of excellence, rather than competence alone. Recent editorials have attempted to define how excellence in professionalism and other domains manifest in the workplace and highlight the importance of non-technical skills [2, 3]. Workplace-based assessments are an invaluable tool for assessing professional practice in a comprehensive and valid way; however, only the mini Clinical Evaluation Exercise (mini-CEX) and multi-source feedback amongst the assessment tools currently used in the UK attempt to assess non-technical skills. In addition, these tools focus on the achievement of basic clinical competence and employ methods with questionable accuracy, reliability and validity [4]. The mini-CEX has been shown to have wide inter-rater variability that results in poor discrimination between anaesthetic trainees [5], that is exacerbated by the lack of performance benchmarking and behavioural descriptors on the marking sheet. Variable scoring leniency and the face-to-face nature of the assessment may also contribute to inaccurate scores [5, 6]. Studies have established the value of multi-source feedback in certain settings [7], although concerns have been raised about victimisation by multi-source feedback raters [8]. These current workplace-based assessment tools have been described as stressful, time-consuming, artificial and difficult to organise [8], and rely on immediate access to an electronic portfolio in many specialties. Large numbers of assessors are required for each trainee to achieve a reliable assessment, suggesting that the feasibility of their use in high-stakes assessment is low [9]. In addition, students are able to select individual cases, case difficulty and specific assessors, despite evidence that the relationship between the observer and student may adversely influence the validity of the assessment [6, 10, 11].

None of the tools described above focus on the assessment of anaesthetic non-technical skills, despite evidence that they are crucial to effective performance in the operating theatre and, hence, to patient safety [12, 13]. A recent Delphi analysis by anaesthetists in key educator roles has identified the importance of developing workplace assessment tools to recognise and promote excellence [14]. Tools for the assessment of non-technical skills in anaesthesia have shown low reliability despite assessor training, possibly due to misclassification of elements within scoring systems [15], but recent work has shown good reliability when non-technical skills are assessed in a selection centre setting [16]. Our main aim was to design and evaluate a specialty-specific workplace-based assessment tool that is able to differentiate between non-technical aspects of trainees’ performance in theatre.


Development of the workplace-based assessment tool was part of a larger study into specialty training recruitment methodologies in anaesthesia, that was approved by NHS Cornwall & Plymouth Research Ethics Committee. All trainees appointed to post during 2007 and 2008 consented to undertake additional assessments to those normally encountered in the course of training and were assured that the outcomes would not influence their progress. Completion of the assessment score sheets was taken to indicate consent on the part of the assessors.

The main driver for development of the workplace-based assessment tool was the need to explore relationships between performance at selection and subsequent performance during training. We chose to assess the same five non-technical skills on which applicants were scored at selection. These were: communication; organisation and planning; situational awareness and decision-making; team working; and working under pressure. In 2008, the scope of the assessments was widened with the addition of a sixth attribute: empathy and sensitivity. The selection of non-technical skills for assessment was based on thorough job analyses undertaken by anaesthetists to define desirable attributes in anaesthetic practitioners [17, 18]. Other important considerations in devising the assessment tool were the requirement to standardise it across trainees, whilst at the same time making it appropriate to the different levels of training. This was achieved by assessing trainees once per year on the same non-technical skills while undertaking a ‘key index’ case, specific to trainee grade (Table 1). Appropriate and feasible key index cases were identified through discussions between anaesthetists involved in the local training programme.

Table 1. Key index cases for each trainee grade.
Grade Key index case
ST1 Performance of a rapid sequence induction on an ASA-1 or –2 patient
ST2 Anaesthetise a patient for a fractured neck of femur
ST3 Deliver anaesthesia for an elective caesarean section
ST4 Anaesthetise a patient for an elective joint replacement

The scoring system employed at the selection centre, which had shown good reliability in that setting [16], was deemed suitable without modification for scoring trainee performance in theatre. This comprised a four-point standardised anchored rating scale (1 = poor; 2 = borderline; 3 = satisfactory; 4 = good/excellent) for each of the six non-technical skills and a global rating score of overall candidate performance. For each skill, the rating scale was linked to a set of behavioural indicators matched to the four levels of performance (Table 2). Two assessors rated trainees independently – the consultant anaesthetist in charge of the case and the operating department practitioner (ODP). The overall in-training assessment score was a simple unweighted sum of the skill and global rating scores from both assessors resulting in a maximum possible score of 2 × 6 × 4 = 48 in the 2007–08 training year, and 2 × 7 × 4 = 56 in 2008–09 following inclusion of the score for empathy and sensitivity.

Table 2. Descriptors* used in the scoring system for personal attributes observed during in-theatre assessments.
Personal attribute Rating scale
1 – poor 2 – borderline 3 – satisfactory 4 – good/excellent
Communication (includes verbal and non-verbal communication) Poor verbal communication with patient. Little or no explanation of actions during assessment Some communication and explanations to patient – could be improved Communicated well with patient most of the time Excellent verbal communication and explanations with patient throughout
Failed to engage verbally with colleagues Erratic communication with theatre team Predominantly good engagement verbally with colleagues Excellent verbal communication maintained with team throughout
Empathy and sensitivity Created an uncomfortable atmosphere Attempted to reduce the tension or anxiety of patient Was able to reduce the tension or anxiety of patient Generated safe and trusting environment
Showed very little interest or understanding Showed some interest or understanding of the patient Responded to concerns with interest and understanding most of the time Responded to concerns with interest and understanding all the time
Organisation and planning Unable to plan structurally for approaching issues Attempted planning, but unable to see the global picture Able to formulate a reasonable plan for approaching issues Clear methodical plan for approaching issues
Unable to prioritise tasks presented Attempted prioritisation with mixed results Able to prioritise tasks appropriately Prioritised clearly and efficiently based on needs and urgency
Situation awareness and decision-making Failed to pick up on signs that indicated a change in action Picked up on marked changes in circumstances Picked up subtle changes in clinical condition Alert to symptoms and signs that may destabilise patient
Shifted focus to immediate worries or needs Little appreciation of wider requirements of situation Appreciated the wider requirements of the situation Appreciated the wider requirements of the situation and sought team input
Team working Failed to delegate and demonstrate leadership Attempted to delegate Delegated effectively Demonstrated leadership and authority with justification
Confrontational and critical of other team members Evidence of conflict with team members Non-confrontational approach Participative non-confrontational approach
Working under pressure Tense and agitated most of the time Became tense or agitated under pressure Remained calm most of the time Seemed relaxed and comfortable with demands of situation
Disregarded/ignored others’ opinions/questions Uneasy responding to others’ opinions/questions Listened and responded to others’ opinions/questions Used effective strategies to deal with impact of others’ opinions/questions
  • *not all descriptors are included in this table

Assessment packs were sent out to trainees three months following appointment and annually thereafter. These comprised two sealed envelopes, one for the consultant and one for the ODP assessing the case, each containing scoring instructions, score sheets and a return envelope. Once on a list with a suitable case, the trainee would hand the assessor envelopes to the consultant and ODP before the case started, asking them independently to complete the score sheets immediately after the case and mail them directly to the study team. Trainees were blind to the details of the rating scales and, to reduce possible leniency caused by the face-to-face aspect of the assessment, to the scores awarded. The assessors were however, asked to give informal feedback to the trainee on their performance during the case.

Following poor inter-rater agreement results in the first round of in-training assessments, 45-min training sessions for consultants and ODPs were conducted as part of local departmental education meetings at each of the six hospitals involved in the study. The session consisted of instruction in the use of the seven in-theatre assessment scoring sheets, followed by a benchmarking exercise in which participants were asked independently to score videos of trainee performance in a simulated operating theatre setting. Three videos of approximately 5-min duration were shown demonstrating poor, medium and good non-technical skills during performance of a rapid sequence induction. The training sessions were conducted between the end of the first round of the in-theatre assessments and the start of the second round.

Dot plots of the distributions of the overall in-training assessment scores were constructed. Inter-rater agreement between both the paired raters in the in-training assessments and between all raters in the benchmarking exercise was measured by quadratically weighted multi-rater kappa [19], an extension of Cohen’s kappa [20] for two or more raters. Kappa values < 0.40 represent poor agreement, values between 0.40 and 0.75 represent fair to good agreement, values > 0.75 represent excellent agreement. Quadratic weighting allows credit for partial agreement when the scale used is an ordinal one [19]. Generalisability (G) coefficients [21, 22] were calculated to assess the overall reliability of the in-training assessment using an (r:p)  × i design (raters within persons crossed with items), where the ‘items’ are the non-technical skill and global rating scores. The G coefficients represent the proportion of total score variance that is due to genuine differences in trainee performance. The formula used was G = Vp/[Vp + Vr/Nr + Vpi/Ni + Ve/(Ni × Nr)], where Vp, Vr, Vpi and Ve are the variances due to persons, raters, person-item interaction and error, and Nr and Ni are the numbers of raters (2) and items (six or seven), respectively. Values > 0.80 for high-stakes assessments and > 0.70 for formative or moderate-stakes assessments are considered desirable [22]. Minitab 15 (Minitab Inc., State College, PA, USA) was used to construct the dot plots, spss 17 (SPSS Inc., Chicago, IL, USA) was used to calculate variance components for the G-study and Excel 2007 (Microsoft Corporation, Redmond, WA, USA) was used to calculate kappas and G coefficients.


All 68 appointed trainees consented to undertake the in-theatre assessments, but not all did so. The response rate was 73% (27/37) in the 2007–08 training year and 59% (40/68) in 2008–09. Eighteen trainees were assessed in both training years. Twenty-five and 32 consultants and 24 and 38 ODPs acted as assessors in 2007–08 and 2008–09, respectively. Very few assessed more than one trainee per year.

Scores awarded in the in-theatre assessment showed negative skewness and a strong ceiling effect (Fig. 1). Across both years the percentages awarded in each category were: poor 0%; borderline 0.7%; satisfactory 25.5%; and good/excellent 73.8%. Trainees achieved high scores in both in-theatre assessments; 14/27 scored 46 or more (out of the maximum 48) in the first round and 22/40 scored 54 or more (out of the maximum 56) in the second round. The additional item score for empathy and sensitivity introduced in 2008–09 had little discriminatory effect: the range of scores increased by just one point from 12 to 13 points in the second year. As scores were awarded relative to the level of training, they were not necessarily expected to increase in the second year.

Details are in the caption following the image

Frequency distribution of in-theatre assessment scores, by year. Each symbol is the result for one trainee.

Inter-rater agreement for the first in-theatre assessment was poor, but was much better in the video training exercise in which 96 anaesthetists and 17 ODPs participated. Despite this, levels of agreement did not improve in the second round of assessments. The inter-rater agreement coefficients obtained in both rounds of assessments, the results of the training exercise and those previously found in our selection centres [16] are shown in Table 3. Of the assessments conducted in the second round, at least 56% were performed by consultants who had not attended a video training session; the corresponding proportion for ODPs was 59%. (The exact figures are unknown as some score sheets were returned anonymously).

Table 3. Quadratically weighted multi-rater kappa inter-rater agreement coefficients by item and assessment setting.
Item Selection centre 2007 In-theatre assessment 2007–08 Video training 2008 Selection centre 2008 In-theatre assessment 2008–09
Communication 0.52 0.29 0.78 0.69 0.25
Organisation 0.61 0.11 0.79 0.70 0.19
Working under pressure 0.54 0.41 0.83 0.67 0.05
Situational awareness and decision-making 0.58 0.06 0.75 0.73 0.18
Team working 0.65 0.19 0.77 0.68 0.25
Empathy and sensitivity* 0.76 0.58 -0.19
Global rating 0.77 0.11 0.84 0.79 0.28
Mean 0.61 0.20 0.79 0.69 0.14
  • *Not assessed in the 2007 selection centre or in the first in-theatre assessment.

The poor inter-rater agreement found in the assessment is reflected in the results of the generalisability analysis. In both years, the proportion of variance in item scores attributable to the trainees was < 20% whereas that attributable to raters was > 30% (Table 4). These high rater variances contribute to low reliability in the total in-theatre assessment score: the G coefficients were 0.50 in 2007–08 and 0.42 in 2008–09, indicating that no more than half of the variance in total scores was due to genuine differences in trainee performance.

Table 4. G-study variance components by source and training year.
Source of variability ITA 2007–08 ITA 2008–09
Variance component % of total variance Variance component % of total variance
Persons 0.041 19 0.035 16
Raters:persons 0.067 31 0.083 38
Items* 0.003 1 0.002 1
Persons × items 0.000 0 0.003 1
Error 0.103 48 0.096 44
  • *The items are the five or six non-technical skills and the global score. ITA, in-theatre assessment.


The results of the non-technical skills assessments in our study showed lower reliability when implemented in the workplace compared with their use in selection and video benchmarking. This article raises numerous issues regarding the workplace assessment of anaesthesia trainees. We have attempted to create a specialty-specific, structured tool that can reliably assess non-technical skills in the theatre environment. Although our scoring tool demonstrated good inter-rater reliability for selection to specialty training, this reliability has not translated into the assessment of workplace performance. This may be due to differing perspectives between consultants and ODPs when judging performance, inadequate numbers of trainees completing their assessments, the ceiling effect in scores and lack of opportunity for the assessors to become familiar with the scoring tool. A one-off live assessment in theatre contrasts with that in a controlled selection centre environment, where all assessors undergo training and rate 20–30 candidates per day.

The benchmarking exercise demonstrated good agreement amongst consultants and ODPs following only 45 min of training. This is in contrast to a recent study into the effect of rater training on the reliability of the Anaesthetists’ Non-Technical Skills (ANTS) scoring tool in rating video-based performance [15], which found only poor to moderate inter-rater agreement after an 8-h training session. The higher levels of agreement in our training session may be due to using partially scripted videos of simulated anaesthesia and an assessment tool that scores just seven items, rather than the 15 contained in the ANTS system.

Despite the encouraging results of the benchmarking exercise, inter-rater agreement for the subsequent in-theatre assessments remained poor. Although others have faced similar problems in ensuring reliability in workplace assessments [6, 15, 23], our study encountered several difficulties that may be more widely applicable. First, despite well-publicised sessions embedded in hospital departmental meetings, only limited numbers of consultants and ODPs attended our benchmarking exercise. This highlights one of the problems of locally administered assessor training. Although generic UK standards for training the trainers have been published [24], implementation has been locally sporadic resulting in inconsistent standards of training. Second, inter-rater agreement coefficients decrease when the range of performance observed is narrower [25, 26]. The video clips used for benchmarking purposefully exhibited performances covering the full spectrum of behaviours identified on the scoring grids. Consequently, the scores awarded did not exhibit the ceiling effect found in the live in-theatre assessments. Similarly, the range of performance observed at selection is naturally wider than that found amongst appointed trainees and these differences may explain much of the variation in inter-rater agreement coefficients. Possible causes of the ceiling effect include the high calibre of the appointed trainees, heightened performance under scrutiny, lack of discrimination in the scoring method, assessor leniency and lack of challenge in the key index cases. The scoring tool has since been adjusted to use a five-point scale and early results suggest that this has enabled effective discrimination amongst trainees by significantly increasing the range of scores awarded.

One further explanation for the suboptimal inter-rater reliability in the workplace may be that ODPs and consultants are assessing different aspects of performance. The scoring tool design was based on complex job analysis of anaesthetists, by anaesthetists. In our workplace-based assessment, instead of two consultants assessing each trainee (as occurred at selection but which is not feasible in everyday practice), performance was assessed by the consultant and ODP assigned to the list. This raises some interesting issues. For example, ODPs may believe other behaviours that were omitted from our scoring system to be important or not value as highly the ones that we did include.

There are a number of strengths of this assessment method. The key index cases provided partially standardised assessments that were appropriate to the level of trainee experience. Although we could have ensured absolute standardisation of case complexity by using a simulator [13], our in-theatre assessment aimed to assess how trainees performed within the real clinical environment. We opted to use global rating scores of non-technical skills in preference to checklist scores, which are regarded as less appropriate for assessing the overall performance of anaesthetists [27]. The trainees were blinded to the in-theatre assessment scores to reduce bias introduced by the face-to-face nature of the assessment, a criticism of existing mini-CEX and DOPS assessments [5, 28]. Trainees were, however, given structured qualitative feedback to facilitate their professional development. In contrast to existing assessments, the trainees were scored on a whole anaesthetic encounter rather than a ‘snapshot’ or a technical skill, in theory giving a fuller picture of clinical performance. However, despite selecting key index cases that were relatively common, trainees still found it difficult to attend appropriate theatre lists due to competing pressures of service provision and training in specialist clinical modules. This problem, encountered in other workplace-based assessments, is exacerbated by the restrictions of the European Working Time Directive regulations. One solution would be to conduct specialist training lists where multiple trainees could be assessed.

Research that evaluates workplace-based assessments in anaesthesia is scarce, but the need for effective frameworks for structured observation of practice has been acknowledged [29, 30]. We have identified some important lessons regarding the implementation of workplace-based assessments. Successful tools for a standardised assessment are difficult to apply to the workplace, particularly with regards to trainee compliance and assessor training. These difficulties may be alleviated by making the process more accessible – for example by running specific assessment lists, making the scoring tool available electronically and developing on-line training materials. One cannot assume that scoring systems that work well in controlled environments will translate smoothly to the workplace setting. This is an ongoing challenge due to many of the practicalities and pressures described above; however, anaesthesia departments who wish to develop their trainees through participation in quality workplace assessments must invest time and effort in creating reliable and feasible tools.


We thank Dr Peter Davies, Regional Advisor, and Dr John Saddler, Head of School, in the South West School of Anaesthesia for their support and assistance with this project. The authors are also grateful to the trainees, consultants and operating department practitioners who undertook the workplace-based assessments and the College Tutors who helped co-ordinate the assessments. We also thank Drs David Adams, Ian Anderson, Alison Carr and Jeremy Langton from the Anaesthesia Recruitment Validation Group in the South West Peninsula Deanery, for their contributions to this work.

    Competing interests

    This work was part of a larger pilot project funded by the UK Department of Health evaluating recruitment methodologies. No competing interests declared.