9. evaluation techniques


In groups or pairs, use the cognitive walkthrough example, and what you know about user psychology (see Chapter 1), to discuss the design of a computer application of your choice (for example, a word processor or a drawing package). (Hint: Focus your discussion on one or two specific tasks within the application.)


This exercise is intended to give you a feel for using the technique of cognitive walkthrough (CW). CW is described in detail in Chapter 9 and the same format can be used here. It is important to focus on a task that is not too trivial, for example creating a style in a word processing package. Also assume a user who is familiar with the notion of styles (and with applications on the same platform (e.g. Macs, PCs, UNIX, etc.)) but not with the particular word processing package. Attention should be given to instances where the interface fails to support the user in resolving the goal and where it presents false avenues.



What are the benefits and problems of using video in experimentation? If you have access to a video recorder, attempt to transcribe a piece of action and conversation (it does not have to be an experiment - a soap opera will do!). What problems did you encounter?


The benefits of video include: accurate, realistic representation of task performance especially where more than one video is used; a permanent record of the observed behaviour.

The disadvantages include: vast amounts of data that are difficult to analyse effectively; transcription; obtrusiveness; special equipment required.

By carrying out this exercise, you will experience some of the difficulties of representing a visual record in a semi-formal written format. If you are working in a group, discuss which parts of the video are most difficult to represent, and how important these parts are to understanding the clip.



In Section 9.4.2 (An example: evaluating icon designs), we saw that the observed results could be the result of interference. Can you think of alternative designs that may make this less likely? Remember that individual variation was very high, so you must retain a within-subjects design, but you may perform more tests on each participant.


Three possible ways of reducing interference are:

  • During the initial training period, swap back and forth between learning the two sets of icons, with the aim of getting the subjects used to swapping between the two sets of remembered icons. However, this design could be argued to suffer the same flaws as the original. If the abstract icons had been taught in isolation perhaps they might have fared far better.
  • We could invent a third set of 'random' icons (call them R). We could then interpose them in the experiment, that is present the icons in the orders RARN and RNRA. The intention is to swamp any transfer effect in the 'noise' of the random icons. It could be argued that our experiment then measures the robustness of the icon sets to such 'noise'!
  • We could give the subjects multiple presentations, for example ANAN and NANA presentation orders. This would not remove transfer effects, but it would give us some way to quantify them. Imagine that in the ANAN group the second presentation of the abstract icons was significantly worse than the first, but there was not a similar effect for natural icons in the NANA group. This would give us both positive evidence of a transfer effect, and perhaps some quantitative measure. However, even going from this additional evidence to a strong conclusion will be difficult.

Notice that all the above measures require additional subject time and one has to constantly weigh up the advantages of richer experiments against those of larger subject groups.



Choose an appropriate evaluation method for each of the following situations. In each case identify

(i) The participants.
(ii) The technique used.
(iii) Representative tasks to be examined.
(iv) Measurements that would be appropriate.
(v) An outline plan for carrying out the evaluation.

(a) You are at an early stage in the design of a spreadsheet package and you wish to test what type of icons will be easiest to learn.
(b) You have a prototype for a theatre booking system to be used by potential theatre-goers to reduce queues at the box office.
(c) You have designed and implemented a new game system and want to evaluate it before release.
(d) You have developed a group decision support system for a solicitor's office.
(e) You have been asked to develop a system to store and manage student exam results and would like to test two different designs prior to implementation or prototyping.


Note that these answers are illustrative; there are many possible evaluation techniques that could be appropriate to the scenarios described.

Spreadsheet package

(i) Subjects Typical users: secretaries, academics, students, accountants, home users, schoolchildren
(ii) Technique Heuristic evaluation
(iii) Representative tasks Sorting data, printing spreadsheet, formatting cells, adding functions, producing graphs
(iv) Measurements Speed of recognition, accuracy of recognition, user-perceived clarity
(v) Outline plan Test the subjects with examples of each icon in various styles, noting responses.

Theatre booking system

(i) Subjects Theatre-goers, the general public
(ii) Technique Think aloud
(iii) Representative tasks Finding next available tickets for a show, selecting seats, changing seats, changing date of booking
(iv) Measurements Qualitative measures of users' comfort with system, measures of cognitive complexity, quantitative measures of time taken to perform task, errors made
(v) Outline plan Present users with prototype system and tasks, record their observations whilst carrying out the tasks and refine results into categories identified in (iv).

New game system

(i) Subjects The game's target audience: age, sex, typical profile should be determined for the game in advance and the test users should be selected from this population, plus a few from outside to see if it has wider appeal
(ii) Technique Think aloud
(iii) Representative tasks Whatever gameplay tasks there are - character movement, problem solving, etc.
(iv) Measurements Speed of response, scores achieved, extent of game mastered.
(v) Outline plan Allow subjects to play game and talk as they do so. Collect qualitative and quantitative evidence, follow up with questionnaire to assess satisfaction with gaming experience, etc.

Group decision support system

(i) Subjects Solicitors, legal assistants, possibly clients
(ii) Technique Cognitive walkthrough
(iii) Representative tasks Anything requiring shared decision making: compensation claims, plea bargaining, complex issues with a diverse range of expertise needed.
(iv) Measurements Accuracy of information presented and accessible, veracity of audit trail of discussion, screen clutter and confusion, confusion owing to turn-taking protocols
(v) Outline plan Evaluate by having experts walk through the system performing tasks, commenting as necessary.

Exam result management

(i) Subjects Exams officer, secretaries, academics
(ii) Technique Think aloud, questionnaires
(iii) Representative tasks Storing marks, altering marks, deleting marks, collating information, security protection
(iv) Measurements Ease of use, levels of security and error correction provided, accuracy of user
(v) Outline plan Users perform tasks set, with running verbal commentary on immediate thoughts and considered views gained by questionnaire at end.



9.4 Complete the cognitive walkthrough example for the video remote control design.


Continue to ask the four questions for each Action in the sequence. Work out what the user will do and how the sytem will respond. If you can analyse B and C, you will find that Actions D to I are similar.

Hint: Remember that there is no universal format for dates.

Action J: Think about the first question. Will the user even know they need to press the transmit button? Isn't it likely that the user will reach closure after Action I?



9.5 In defining an experimental study, describe
(a) how you as an experimenter would formulate the hypothesis to be supported or refuted by your study
(b) how you would decide between a within-groups or between-groups experimental design with your subjects

answer available for tutors only


  • Determining independent variables or variables that can be controlled by the experimenter and will determine the number of experimental conditions, based on the number of different levels of all independent variables that will be tested.
  • Determining the dependent variables or phenomena that can be measured for subjects in the various experimental conditions.
  • Phrasing the hypothesis of the experiment in terms of an expected relationship between the independent and dependent variables.

(b) Deciding on the experimental design, in terms of within-groups or between-groups design, depends on the kinds of subjects you will use, how many resources are available for experimentation and the problems associated with learning effects. A within-groups design will require fewer subjects (and therefore be cheaper in terms of cost and time) but may exhibit bad learning effects if the experiment is not carefully designed. Students should demonstrate that they know the difference between within- and between-groups design. The former has each subject tested under all experimental conditions. Between-groups has each subject tested under only one condition.



9.6 What are the factors governing the choice of an appropriate evaluation method for different interactive systems? Give brief details.

answer available for tutors only

Any of the following may be included:

The stage in the cycle at which the evaluation is carried out Evaluation of a design seeks to provide information to feed the development of the physical artefact, and tends to involve design experts only and be analytic. Evaluation of an implementation evaluates the artefact itself, is more likely to bring in users as subjects and is experimental.

The style of evaluation Laboratory studies allow controlled experimentation and observation but lose something of the naturalness of the user's environment. Field studies retain the latter but do not allow control over user activity.

The level of subjectivity or objectivity of the technique More subjective techniques, e.g. cognitive walkthrough or think aloud, rely largely on the expertise of the evaluator, who must recognize problems and understand what the user is doing. Objective techniques, e.g. controlled experiments, should produce repeatable results that are not dependent on the persuasion of the particular evaluator.

The type of measures provided Quantitative measurement is usually numeric and can be easily analyzed using statistical techniques. Qualitative measurement is non-numeric and therefore more difficult to analyze, but can provide important detail that cannot be determined from numbers.

The information provided Techniques such as controlled experiments are good at providing low-level information, e.g. 'is a particular font readable?'. Higher-level information, e.g. 'is the system usable?', can be gathered using survey techniques, which provide a more general impression of the user's view of the system.

The immediacy of the response Methods such as think aloud record the user's behaviour at the time of the interaction. Others, e.g. post-task walkthrough, rely on the user's recollection of events.

The level of interference implied Techniques that are obvious to the user during the interaction run the risk of influencing the way the user behaves.

The resources required Resources to consider include equipment, time, money, subjects, expertise of evaluator and context.


EXERCISE 9.8 [extra - not in book]

This extended exercise is designed to give you practice in writing, testing and administering a questionnaire to a true user population. Although it does not train you in the very fine points of questionnaire design, it does alert you to the basic problems in obtaining valid responses from people.

In addition to practice in valid questionnaire design and questionnaire administration, the exercise asks you to focus on finding information about a user interface to a new computer system, by studying an analogous system. Its intent is to help you develop probing skills (through good question design). These skills can then be used to find out what failures and successes users are having with a system and even the underlying causes of these successes and failures.

The 7 steps of this exercise are:

1. Selection of an analogous interface to study.
2. Preparation of a draft questionnaire (1-2 pages).
3. Piloting of the draft questionnaire.
4. Preparation of a final questionnaire.
5. Administration of the final questionnaire.
6. Analysis of the results.
7. Write up and presentation of the results of the survey.

These steps are given in more detail below. Read through them all before you begin.

1. Decide on a new user interface for which you will collect information from potential users.
One of the methods for collecting this information is to look at existing user interfaces that have things in common with the interface you are designing, i.e. the computer program accomplishes the same or similar tasks, or you believe that the task that the program supports is in many ways similar to the task you will be supporting with your interface design. For example, if you were building a design for an interface that helped users find out which books were available in a university library system, you might look at the existing library system interface for accomplishing this task. If you were choosing to design a computer interface for ordering tickets to plays and concerts automatically, you might study a computer interface for obtaining cash from an automated teller machine (cashpoint).

2. The type of information you are to obtain about the user interface through the careful design of your questionnaire is:

(a) How easy has the system been for them to learn?
(b) What are the particular parts of the system that they are having the most trouble with?
(c) What kinds of recommendations do they have for improving the system?
(d) How useful are the manuals for the system?
(e) How much time did they spend learning the system?

From your reading about questioning people and good questionnaire design, you should know that you cannot directly ask the above questions and obtain very good answers: (a) is too ambiguous; (b) is much too broad to get useful answers; (c) is too difficult for new users; (d) is again ambiguous and the users may not have the information to answer (e). Also, since the amount of difficulty a person has with the system depends on that person's previous experience, whether they are computer science majors, whether they are highly motivated, whether they have a good friend who is helping them out a lot and whether they are very intelligent, questions have to be asked about these factors as well.

Design a questionnaire to administer to the users of the system of your choice. Take as much care as you can in the choice of your questions, taking into account the issues discussed.

3. Administer this draft questionnaire to 2 users to find out if they understand the questions in the same way you meant the questions. Give them the questionnaire to fill in and then asking them what their answers mean and why they thought your question meant. This is called pilot testing the questionnaire.

4. Use the feedback you received from your 2 trial respondents to redesign your questionnaire. If the design changes radically, it is a good idea to test out your questionnaire again on 2 other people.

5. When you think your questionnaire has been tested enough and will work on the targeted set of users, you have to find users outside of computer science who fit the eligibility requirements for your survey (as many as you can - six is a suggested minimum but note that this low number of respondents would not normally be used in a real-world study). Ask your chosen users to fill out one of your questionnaires.

6. Summarize the data collected from your questionnaires. The structured question answers are usually presented as percentages, e.g. 25 percent responded 'strongly disagree' to the question 'Should the system always have menus available?' Often the percentages are presented across demographic data, e.g. '30 percent of the women and 35 percent of the men would like to have fewer commands to learn.' A clear way to present this information is in tables.

Use the data results of your questionnaire to consider changes that might be made to the user interface to make it easier for users to learn and use the system. These can be changes in manuals and training as well as detailed changes to the interface commands and the documentation.

Now translate these ideas into how you would design your new interface to your system to solve the problems highlighted by your survey. Obviously, if the problems are in areas where there are few parallels between the studied system and your own, the information is of much less use than if it is directly applicable - be careful.

7. Write up and present the results of the survey. This should draw out the users' problems with the current interface, with the final portion discussing how your interface design will avoid these problems. You should also include a discussion of the reasons for each question or set of questions in your questionnaire, with an explanation of any changes you made between the draft and the final versions.

answer available for tutors only

extended project


EXERCISE 9.9 [extra - not in book]

Which evaluation methods do you think are most appropriate for group systems? What particular issues do evaluating group systems raise?

answer available for tutors only

This is a question where there are many possible answers but appropriate methods to discuss include ethnography and field based longitudinal studies, for assessing how group systems are actually used; and possible extensions to CW and heuristic evaluation to assess elements of the interface. The issues to be considered include the choice of method - adapting methods to suit groups; variation within as well as between groups, complexity of who benefits from the system and conflicts of interest. Context is also important. Co-operation is dependent on formation of groups, which cannot properly happen out of context. Evaluation must assess support of co-operation and must therefore consider context.

Individual exercises

ex.9.1 (ans), ex.9.2 (ans), ex.9.3 (ans), ex.9.4 (ans), ex.9.5 (ans), ex.9.6 (tut), ex.9.7 (tut), ex.9.8 (open), ex.9.9 (tut)

Worked exercises in book


Design an experiment to test whether adding colour coding to an interface will improve accuracy. [page 339]


You have been asked to compare user performance and preferences with two different learning systems, one using hypermedia (see Chapter 21), the other sequential lessons. Design a questionnaire to find out what the users think of the system. How would you go about comparing user performance with these two systems? [page 351]