Riccardo Poli
Brain Computer Interfaces Laboratory
School of Computer Science and Electronic Engineering
Caterina Cinel
Department of Psychology
University of Essex
Luca Citi
Brain Computer Interfaces Laboratory
School of Computer Science and Electronic Engineering
Francisco Sepulveda
Brain Computer Interfaces Laboratory
School of Computer Science and Electronic Engineering
Date: CONTACT AUTHOR:
Prof Riccardo Poli
Email: rpoli@essex.ac.uk
Phone: +441206872338
Keywords: ERP averaging, ERP signaltonoise ratio, highresolution averages, reactiontime distributions, variablelatency ERPs, grand averages.
While the study of singletrial Event Related Potentials (ERPs) has been considered of great importance since the early days of ERP analysis, in practice the presence of noise and artifacts has forced researchers to make use of averaging as part of their standard investigation methodology (Luck, 2005; Donchin and Lindsley, 1968; Cobb and Dawson, 1960; Handy, 2004).
Averaging is used in two ways in ERP analysis: to derive mean ERP waveforms for each subject taking part in an experiment and to compute averages of such waveforms (grand averages). There are essentially three classes of methods that are commonly used to resolve ERPs via averaging and a further class of methods where ERPs are reconstructed through the use of mathematical models. We will review these methods discussing their strengths and weaknesses in Sections 1.11.4, while we will look at grand averaging in Section 1.5 and we will summarise the main ideas and contributions of this paper in Section 1.6.
Stimuluslocked averaging requires extracting epochs from EEG signals starting at stimulus presentation and averaging the corresponding ERPs. This is probably the oldest ERP analysis technique, dating back to the days of analogue averaging devices (Lindsley, 1968). Yet, it is still an effective means of investigation (e.g., see Nieuwenhuis et al., 2004; Kopp et al., 1996; Handy, 2004).
An important problem with this form of averaging is that any ERPs whose latency is not phaselocked with the presentation of the stimuli may be significantly distorted or may even completely disappear as a result of averaging (Luck, 2005; Spencer, 2004). This is because the average, , of randomly shifted versions of a waveform, , is the convolution between the original waveform and the latency distribution, , for that waveform, i.e., (e.g., see Zhang, 1998). Given that latency distributions are nonnegative and unimodal, this typically means that a stimuluslocked average can only show a smoothed (lowpass filtered) version of each variablelatency ERP. Furthermore, whenever the latency distribution of an ERP is unknown, the degree to which it will appear deformed in the average and in what ways it will be deformed are also unknown, hampering the interpretation of averages.
The problem is particularly severe when the task is relatively difficult, since the variability in the latency of endogenous ERPs and response times increase with the complexity of the task (Luck, 2005; Polich and Comerchero, 2003). In these cases, multiple endogenous variablelatency ERPs may appear as a single large smooth wave in the average; a synthetic example is shown in Figure 1 (left). This makes it difficult to infer true brain area activity for any response occurring after the early exogenous potentials typically elicited by (and synchronised with) a stimulus.

In experiments in which the task requires participants to provide a clearly identifiable response, responselocked averaging can be used as an alternative to stimuluslocked averaging to help resolve variablelatency ERPs that are synchronised with the response (e.g., see Spencer, 2004; Keus et al., 2005; Töllner et al., 2008; Luck and Hillyard, 1990). If participants are told to respond as soon as they have made a decision as to what action to take, then averaging should be expected to resolve waves which are at an approximately constant temporal position in relation to the response. The use of responselocked averages can be advantageous, for example, when effects of different experimental conditions on ERPs are caused by processes related to response selection, response preparation or response inhibition, since these are likely to manifest themselves as responselocked ERPs (e.g., see Nieuwenhuis et al., 2003). In this case, however, the early responses associated and phaselocked with the stimulus will end up being blurred and hard to distinguish, since they are represented in the average by the convolution of their true waveform with the responsetime distribution (Zhang, 1998). An example illustrating this problem is shown in Figure 1 (right).
Thus, in forcedchoice experiments a researcher is presented with two alternative but often radically different or even conflicting representations of the same data: one based on stimuluslocked averaging and one based on responselocked averaging. Inferring whether a wave in the average represents a true effect or it is due to averaging biases can then be difficult. In addition, the deformations produced by blurring may lead to the incorrect evaluation of ERP parameters, such as the onset latency (which, in the average, reflects the fastest trials rather than typical ones). While one can qualitatively integrate the information provided by these two averages, and it is even possible to quantitatively morph them, it is unclear how reliable the result will be.
A key problem is that acquiring and averaging more data does not help increase the fidelity of the reconstructed signals because there is a systematic error (a distortion with a nonzero mean) in the averaging process. The lack of resolution for ERPs that are not phaselocked with external events is particularly problematic for difficult tasks that require several hundred milliseconds or even seconds to complete and that may involve multiple ERPs (e.g., related to stimulus evaluation, response selection, and action). This limits the applicability of stimuluslocked and responselocked ERP averaging for investigating processes taking place in the presence of richer and more realistic sets of stimuli and tasks.
A third alternative to resolve variablelatency ERPs is to attempt to identify them in each trial and estimate their latency. Then, shifting trials on the basis of estimated latencies and averaging may bring out the desired ERP from its noise background.
In some cases, simple techniques can be used to identify latencies of known waves. For example, P300s can be located by finding the largest positive deflection in a time window between 300ms and, say, 800ms after stimulus presentation (e.g., see Spencer et al., 2000) or the point at which the area under the signal in that time window has reached 50% of its maximum value (Luck, 2005).
An important issue about ERPlocked averaging is that most methods require prior knowledge about the ERP to be located. For example, one might need to tell an algorithm whether the ERP of interest is positive or negative, its approximate duration and in what particular time window after stimulus presentation this is likely to occur. Without this information automated detection algorithms have very little hope of finding the latency of the waves of interest. While such knowledge is often available, information can be contradictory. For example, the shape of ERPs may depend on whether one uses AC or DCcoupled amplifiers and the degree to which preprocessing filters affect the frequency spectrum of such ERPs. Furthermore, the polarity and amplitude of an ERP may be reference and electrodedependent. Nonetheless, provided one is careful to refer to studies where experimental conditions closely match those of his or her own experiment, for relatively simple experimental conditions and ERPs, reasonably reliable knowledge to feed into an ERP latencymeasuring algorithm can be found. However, how would we know what variablelatency ERPs will be present at different stages in the processing of a complex stimulus or the carrying out of a taxing cognitive task? Stimuluslocked or responselocked averaging might be unable to help in the identification of such ERPs. Also, if one hypothesises the existence of a particular ERP and then runs a latency detection algorithm for it on a singletrial basis, it is very likely that the hypothesis will appear to be corroborated irrespective of whether or not the ERP really exists. For example, if one locates positive peaks in random fragments of EEG and averages enough of them, the average will show a clear but entirely artefactual peak.
A related problem is that latency detection algorithms assume that the ERP of interest is present in every trial and we just need to find it. What if an ERP is not always elicited by the stimuli? The ERP might be, for example, dependent on whether a participant attended a stimulus, whether a participant was rested or tired, etc. (e.g., see Wagner et al., 2000; Bonala et al., 2008). If an ERP was absent frequently, running a latencymeasuring algorithm on trials where the ERP did not occur would inundate the averaging process with bias and noise.
An approach to average ERPs which accounts for trialtotrial variability without requiring prior knowledge was introduced by Woody (1967). This is an adaptive filtering technique where the standard average is used as a starting template for an iterative process. For each epoch the lag corresponding to the maximum covariance with this template is found. Then, each epoch is shifted by its lag and the shifted trials are averaged. The hypothesis is that this will result in a new template that is more accurate than the original one because ERPs were aligned. The process is iterated until a fixedpoint is reached, i.e., no further shifts are required.
Wastell (1977) analysed the accuracy of Woody's method, finding that the part of the signal that most closely matches the template may not be the ERP of interest. Recently, Thornton (2008) studied the behaviour of Woody's filter as a function of the signaltonoise ratio (SNR) in the data and found that below a SNR of 5dB the method becomes unreliable. Since in real ERP recordings SNR values tend to be much worse than 5dB, the filter may frequently lead to incorrect shifting of trials resulting in ERPlocked averages that significantly misrepresent reality. So, Woody's method gives accurate results only when the ERP of interest is large, sufficiently dissimilar from noise and with latency distributions with relatively small standard deviations (Luck, 2005).
Naturally, it is possible to improve these techniques (e.g., see Thornton, 2008). However, all methods that realign trials based on ERP latencies are likely to suffer from a clearcentre/blurredsurround problem. That is, after shifting trials based on an ERP's latencies, all instances of that ERP will be synchronised, effectively becoming fixedlatency elements. However, stimuluslocked ERPs will now become variablelatency ERPs. Also, all (other) ERPs that are phaselocked with some other event (e.g., the response), but not with the ERP of interest, will remain variablelatency. Not surprisingly, then, they will appear blurred and distorted in a ERPlocked average.
In summary, ERPlocked averaging is safe to use to reveal information about variablelatency ERPs that are known to exist, whose main characteristics have been identified using other methods and that are present in every trial being averaged. However, one needs to be very careful when using them as tools for identifying newly hypothesised ERPs.
To overcome the problems of the methods reviewed above and better reconstruct ERPs, researchers have explored a variety of tools from statistics, signal processing, etc. All these methods make strong assumptions on the definition and nature of the ERPs to be reconstructed and on the nature of their interactions. Below we review some key techniques.
Let us assume that the signal recorded in a forcedchoice experiment is the sum of two ERPsa stimuluslocked ERP, , and responselocked ERP, and that the responsetime does not affect their shape but only their relative position within an trial. Under these assumptions it is possible to recover the two ``true'' ERPs from the responselocked average, , the stimuluslocked average, , and the responsetime distribution, (Zhang, 1998; Hansen, 1983). The approach effectively involves jointly solving the two equations and for and in the frequency domain and then antitransforming the result. The technique has recently been extended (Yin et al., 2009), e.g., to deal with the case of experiments involving cues in addition to stimuli and responses. A potential problem for this technique is that it may be difficult to check the degree to which the assumptions it relies on are valid for a particular experiment. Also, the technique cannot recover variablelatency ERPs that are not phasedlocked with an externally observable event: only partial information can be recovered and only under further strong assumptions.
Under a linear model of ERP interaction, when it is reasonable to assume that some ERPs are present with substantially the same amplitude and latency in two experiments while other ERPs are present only in one, it may be possible to isolate the former from the latter. For example, by adopting Kok's (1988) additive model of interaction between motorrelated potentials (MRPs) and P300s, Salisbury et al. (1994) were able to compute an average MRP waveform and to subtract MRP contamination from the P300's average waveform. The technique was later refined by Salisbury et al. (2001) who corrected the effects of MRPs on P300s on a trialbytrial basis by carefully matching pairs of trials in two variants of an experiment according to the associated response time. An advantage of subtraction techniques of this type is that complex mathematical manipulations of the data are not required. Naturally, if the ERPs of interest are of variable latency, or variablelatency ERPs other than MRPs are present, the average of the recovered waveforms will still be affected by the lowpass filtering effects discussed in Sections 1.11.2. Also, since the variance of the difference of stochastic variables is the sum of their variances, the process of subtracting ERPs (whether averaged or not) increases the noise affecting the data by a factor , which may need to be compensated by the acquisition of more data.
Principal Component Analysis (PCA) has been suggested as a powerful statistical tool for the analysis of EEG and ERPs since the mid sixties (Donchin, 1966; Streeter and Raviv, 1966). PCA is based on the idea that the data are in fact a linear combination of ``principal components'' which need to be identified. PCA components are orthogonal and they maximally account for the variance present in the data. Because of this, it is often possible to accurately represent the original data with a small set of components. Two forms of PCA are used in ERP analysis: one where one wants to find components that represent the covariance in the measurements taken at different electrodes, and one where one is interested in modelling the temporal variations in a signal. The lattertemporal PCAis relevant in the context of this section.
Temporal PCA has been effective in identifying ERPs and in clarifying how ERPs vary as a result of changes in the stimuli, treatments or subject groups (e.g., see Do and Kirk, 1999; Dien et al., 2003; Donchin, 1966; Kayser and Tenke, 2006; Spencer et al., 2000). However, the technique makes strong assumptions (see Donchin and Heffley, 1978). Firstly, PCA is linear: it assumes that ERPs do not interact. Secondly, it assumes that the major sources of variance (the principal components) are orthogonal; so different, but correlated ERPs may end up being represented by a single component. Thirdly, the technique implicitly assumes that only the amplitudes of ERPs vary, not their latency; when this is not the case, the PCA component associated to an ERP may totally misrepresent reality (Donchin and Heffley, 1978). Thus, the use of PCA in ERP analysis requires significant care, and there is evidence that results may be misleading (e.g., see Beauducel and Debener, 2003). Variablelatency ERPs cannot be properly resolved with this technique.
Independent Component Analysis (ICA) (e.g., see Hyvärinen et al., 2001) has also seen considerable popularity in the analysis of EEG and ERPs (Makeig et al., 1996,1997; Jung et al., 2001; Makeig et al., 2002,1999). If a set of signals is the result of linearly superimposing some statistically independent sources, ICA can decompose the signals into their primitive sources. These are called ``independent components''. When ICA is applied to the signals recorded at different electrodes on the scalp, it can separate important sources of EEG and ERP variability. This can then be exploited, for example, for removing artifacts. The use of ICA for reconstructing ERPs with varying latency has also been trialled (Jung et al., 2001). In the presence of variablelatency ERPs, the method tends to allocate different ICA components to different ERPs if they originate from different areas. ICAbased reconstruction of variablelatency ERPs requires that the different ICA components that capture separate ERPs be appropriately temporally shifted so as to realign the components. Then, at least in principle, antitransforming the shifted ICA components together with any nonshifted ones should reconstruct a signal where all ERPs are fully resolved. However, the ICA component alignment process is manual and to some extent arbitrary, as is the identification of the number of ERPs that need reconstructing. So, the method must be guided by prior knowledge and results may present significant inter and intraexperimenter variability. Also, it is hard to interpret what exactly the final resulting ``average'' represents, since it is essentially a morph between stimuluslocked and ERPlocked averages. Finally, the application of ICA to ERP analysis is based on strong assumptions: the linearity of the brain as a conduction medium, the statistical independence of the sources of electrical activity, the fixed position of such sources, the absence of conduction delays in the brain, and the nonGaussianity of the statistical distributions of the sources (Makeig et al., 1996,1997; Jung et al., 2001; Makeig et al., 2002,1999). It may be difficult to verify to what extent these assumptions are tenable for a specific experiment.
In ERP analysis a grand average is simply an average of the average waveforms obtained on a subject by subject basis. The purpose of grand averages is the reduction of the noise that may affect singlesubject averages and the identification of the commonalities between such averages. A defect of grand averages is that they may not represent the waveforms recorded from individual subjects well (Luck, 2005). One reason for this is that, even if all subjects present the same sequence of ERPs in a given condition, such ERPsparticularly variablelatency onesare likely to have different latencies in the averages of different subjects. Thus, averaging such averages will produce lowpass filtering effects similar to those affecting ordinary ERP averages.
While grand averages are the most widespread technique to combine evidence across subjects, it is important to also consider the alternative of simply averaging all trials pertaining to a certain condition irrespective of subject. This is because the two strategies address different questions: grand averaging answers the question of what the ERPs of a typical subject in a certain condition look like, averaging across subjects addresses the question of what the typical waveform for the ERPs recorded in a particular condition is.
Grand averages and averages across subjects are mathematically very similar. Let be the th ERP recorded for subject . The subject's average is given by , where is the number of trials for that subject. Then, the grand average is given by
Both grand averages and averages across subjects can be computed for all forms of average discussed in Sections 1.11.3.
As we see from Sections 1.11.4, a more precise and direct way of identifying variablelatency ERPs as well as measuring their latency and amplitude is needed. This need is particularly pressing in the presence of complex and realistic tasks where precise knowledge may not be available about which ERPs are present and how their amplitudes and latencies are affected by particular conditions.
In this paper we propose a simple technique which we believe can achieve this: binning trials based on their recorded response time and then computing bin averages. This has the potential of solving the problems of stimuluslocked, responselocked and ERPlocked averages, effectively reconciling them. In particular, responsetime binning can significantly improve the resolution with which variablelatency waves can be recovered via averaging. The reason is simple.
The idea is that if one selects out of a dataset all those epochs where a participant was presented with qualitatively identical stimuli and gave the same response within approximately the same amount of time, it is reasonable to assume that similar internal processes will have taken place (we will call this a cognitive homogeneity assumption). So, within those trials, ERPs that would normally have a widely variable latency might be expected to, instead, present a much narrower latency distribution. Thus, if we bin epochs on the basis of stimuli, responses and response times, we should find that, for the epochs within a bin, the stimulus, the response, as well as fixed and variablelatency ERPs are much more synchronised than if one did not divide the dataset. Averaging such epochs should, therefore, allow the rejection of noise while at the same time reducing also the undesirable distortions and blurring associated with averaging (see Sections 1.11.3) and avoiding the complexities, strong assumptions or manual labour involved in the application of the methods described in Section 1.4. Responsetime binning and averaging should result in clearer descriptions of brain activity without the need for prior knowledge of the phenomena taking place and ERPs elicited in response to the stimuli. In this paper we describe our implementation, analysis and evaluation of this technique.
Many studies on the relationship between reaction times and the amplitude and the latency of ERPs have been reported in the literature (see, for example, Kutas et al., 1977; McCarthy and Donchin, 1981; Donchin et al., 1978). Typically they rely on the trialbytrial measurement of the amplitude and/or latency of ERPs and the statistical analysis of their covariance with corresponding response times (see Childers et al., 1987). In a smaller fraction of the studies, however, trials were divided up into broad groups by reaction time, e.g., fast vs slow responses (as in Woodman and Luck, 1999; Makeig et al., 1999), and were then averaged. We are also aware of one study (Roth et al., 1978) where ERPs trials were grouped by responsetime quartiles and one (Gratton et al., 1988) where trials were grouped using four predefined responsetime intervals. While this prior work presents some similarities with what we propose here, there are also significant methodological and philosophical differences. We discuss them below.
Firstly, the subdivision of trials into groups based on response time, in previous work is virtually always motivated by the desire to measure the amplitude or latency of one specific wave (e.g., the P300, as in Roth et al., 1978) in each group and then to relate such measures to the corresponding response times. Here, instead, we are proposing the use of responsetime binning not just to better understand the relationship between specific waves and reaction times, but chiefly as a method to identify and resolve such waves in the first place thanks to the increased resolution it provides.
Secondly, as we will see, responsetime binning increases the effective resolution of bin averages in an inverse proportion to the bin size. So, although for testing purposes here we used four bins as in (Roth et al., 1978) and (Gratton et al., 1988) (we throw away our fourth bin being largely made up of outliers), we propose to use as many bins as the cardinality of the dataset can support.
Thirdly, we suggest that to truly benefit from the resolution enhancements provided by responsetime binning one needs to use it in forcedchoice experiments and after trials have already been divided up by stimulus type and response, to ensure that the bins are as homogeneous as possible. Failing to do so may render the technique pointless from the resolvingpower viewpoint. In our tests we use a forcedchoice setup and we do not simply divide the trials into `Correct' and `Incorrect', but into `True Positives', `True Negatives', `False Positives' and `False Negatives'. On the contrary, for instance, the experimental setup in (Roth et al., 1978) was not one of forcedchoice (the absence of a response within 800ms from the stimulus was taken as a negative response). So, the trials that were averaged in the last quartile were not homogeneous.
Fourthly, while binning is useful also in the case of short response times, we propose that it is really in experiments requiring the processing of complex stimuli or the performance of complex tasks with correspondingly longer reaction times that the binning technique can provide the biggest advantages over other ERP averaging methods. However, previous work involving the use of responsetime bins has mainly focused on simple tasks. For example, in (Roth et al., 1978) reaction times in the first and fourth quartiles were 366ms and 540ms, respectively, while in (Gratton et al., 1988) four 50mswide bins covered the range 150349ms.
Finally, we should note that, whenever one divides up a noisy dataset into subsets and then studies the subsets separately, each subset contains fewer trials and, thus, inference of true effects and ERPs from the noise is harder. So, there is a tradeoff between the desire to gain more precise information by averaging signals that are more homogeneous and the loss of precision due to the reduced noise rejection associated to smaller sets. Here we analyse this tradeoff via an evaluation of how dividing up trials by reactiontime affects SNR. To the best of our knowledge, an analysis of this kind has never been reported in the literature. Furthermore, for the firsttime we will formally relate the use of bins to the resolution with which fixed latency and variablelatency ERPs can be recovered via averaging by connecting responsetime binning to the theory proposed in (Zhang, 1998; Hansen, 1983). So, the paper also fills significant theoretical gaps.
In ERP experiments EEG signals are partitioned into epochs. In our tests with responsetime binning we used epochs starting at the onset of a stimulus and lasting 1200ms.
Naturally, responsetime binning requires deciding how many bins to use and how wide each bin should be. Given that binning reduces the number of epochs contributing to ERP statistics (e.g., bin averages) thereby increasing the noise affecting them, it may be best to start with a small number of bins to check if this reveals previously undetected regularities. If this is the case, and more resolution is desired, one can divide the dataset more finely. If noise becomes a problem, one can test participants for longer or over multiple sessions with the reasonable expectation that the additional data will further increase our knowledge of a phenomenon (which is not necessarily the case for standard averages when waves with varying latencies are present). Thus, to demonstrate our technique, we divided epochs into three main bins.
In many conditions, response times have highly skewed distributions with long upper tails (e.g., see Figure 3). Therefore, unless one is specifically interested in studying waves corresponding to unusually long response times, to avoid their smearing effect on averages it is important to discard events in the extreme right tails of responsetime distributions. In this paper we chose to discard the trials falling in the rightmost 10% quantile (i.e., the 10th decile) of the distributions.
Once these anomalous data have been removed, we are faced with an important dilemma. In principle, it would seem desirable to create responsetime bins equally temporally spaced, i.e., all of the same width. This would tend to give the same temporal resolution to bin averages. However, because responsetime distributions are skewed, doing so would create bins with very unequal numbers of epochs in them, resulting in bin averages being affected by radically different noise levels. A better alternative from this point of view is to consider bins which correspond to equal areas under the distribution. Because of the shape of responsetime distributions, this may produce bins of unequal sizes and averages with different effective resolutions.
In our work we adopted this approach. Thus, from the 90% of the trials left after removing the 10th decile of the distribution, we created three bins: one gathering the lower 30% of response times (bin 1), one for the middle 30% (bin 2) and one for the longer 30% (bin 3).
EEG signals were acquired using a BioSemi ActiveTwo system with 64 preamplified DCcoupled electrodes spaced evenly over the scalp. Additional electrodes were placed at the earlobes for offline referencing, at the left and right external canthi to record horizontal electrooculogram (HEOG), and infraorbitally to record vertical electrooculogram (VEOG). Signals were acquired at 2048 samples per second, were then bandpassfiltered between 0.15 and 40 Hz and, finally, were downsampled to 512 samples per second.
Effects of eye blinks and vertical components of saccades were reduced by using the timedomain linear regression between each channel and the VEOG. That is, we subtracted from each EEG channel a proportion of the signals recorded by the two VEOG channels; the proportion was obtained by computing the correlation between the EEG signals recorded at each electrode with the VEOG signals and dividing by the VEOG's power (Luck, 2005; Verleger et al., 1982).
We then applied to each bin an artifact rejection procedure which involved computing the first () and third () quartiles of the voltages at each time step across all the epochs in a bin. The procedure then removed all epochs where the signal was outside the range
To further reduce the effect of outliers, instead of simply averaging trials, unless otherwise stated, we used 40%trimmed averages. These are robust measures of central tendency that are less sensitive to outliers than the ordinary mean (Huber, 1981). They have been shown to provide significant increases in reliability compared to ordinary averages in ERP and eventrelated desynchronization analysis (Gasser et al., 1986; Burgess and Gruzelier, 1999; Rousselet et al., 2008). Trimmed averages are computed as follows. For each time step, the voltages recorded in the epochs in a bin are sorted, and the upper and lower 40% are discarded. The remaining central (and somehow most representative) 20% of the voltages are then averaged. The process is repeated for each time step.
To evaluate responsetime binning we modified an forcedchoice experiment designed by Esterman et al. (2004) where the task requires detecting a target coloured letter in the presence of distractor letters of different colours. Participants, stimuli and procedure are described below.
Six students from the University of Essex (average age: 24 years; five females; one participant was left handed). All had normal or correctedtonormal vision and had normal colour vision.
On each trial, participants were presented with a fourletter string. The first and last letters were always `S'. Of the two middle letters, one was always an `O', while the other was either an `L' or an `X' . Letters subtended an angle of vertically. The horizontal gap between letters also subtended . The first and last letters were always white, while the colour of the two middle letters could either be red, green or blue, but never the same colour. The background was black.
Each letter string was randomly presented in one of four regions of the display. These extended from the centre of the screen to its topleft, topright, bottomleft and bottomright corners, respectively. The horizontal displacement of the inner edge of a string with respect to the centre of the screen varied between and (as described in Section 2.3.3). The vertical displacement of the string was always identical to the horizontal displacement.
In the experiment participants had to decide whether or not, on each display, a target letter was presented. The target was always an `L' of a specific colour.
The experiment was divided into blocks of 40 trials each. The target was present in 20% of the trials. At the beginning of the experiment participants were told the colour of the target letter. Every six blocks the target colour was changed so as to test each participant on each target colour on an equal number of trials. Target colour order was counterbalanced across subjects.
To control the level of difficulty of the experiment, the stimuli to be presented in each block were carefully chosen in relation to the frequency of targets and nontargets as well as the possibility of being deceived by letters having one (but not both) the features of the target. The colour and letter combinations used for the two middle letters in the stimulus string in the 40 trials in each block were as follows: eight trials with the target (an `L' of a specific colour) and an `O' of a nontarget colour; eight trials with an `O' of the target colour and an `L' of a nontarget colour; eight trials where an `O' and an `L' both of a nontarget colour were presented; four trials where an `X' with the target colour and an `O' with a nontarget colour were presented; four trials where an `X' had the nontarget colour and an `O' had the target colour; eight trials where an `X' and an `O' were presented, both in a nontarget colour. In every block trial order was randomised.
A white dot was always visible at the centre of the display. At the beginning of each trial, the dot was replaced by a fixation cross for 500ms, and then the letter string briefly appeared in one of the quadrants. Participants were instructed to gaze at the white dot/cross and to try not to move their eyes when the stimulus string was presented. The string was displayed for a duration which was adjusted at the end of each block of the experiment, according to the percentages of correct responses in the block. The objective was to keep a subject's accuracy between 75% and 90%. This procedure ensured stimulus presentation was fast enough to make target detection relatively difficult, while at the same time discouraging participants from guessing too often.
The duration of the stimulus display varied between 50ms and 150ms (with intermediate steps that were multiple of the inverse of our computer screen refresh rate, which was 60Hz). All participants started at 150ms. The most frequent presentation times were 83 and 100ms.
The horizontal and vertical displacements of the letter string were also changed in relation to performance. The first block's displacements were . If a participant's accuracy was too high, displacements were increased in following blocks to , and then, if necessary, to .
Participants gave their responses by pressing the left button of a mouse with the index finger for `Yes' responses and the right button with the middle finger for `No' responses. Each response was followed by a interval of 1 second after which the next trial started.
After a practice session, each participant completed six blocks with each target colour for a total of 18 blocks.
In this section we will empirically evaluate the binning technique using the data collected in the experiment described above. Trials were divided into four categories (true positives, true negatives, false positives and false negatives) according to whether the target was present or absent and whether the response was `Yes' or `No'. Unless otherwise stated, the results for each category are based on cumulating the trials of all subjects. So, most of the ERP averages we show are across subjects (see Section 1.5). In our experiment, these are qualitatively very similar to grand averages as illustrated in Figure 2 for our four conditions. We will also report some singlesubject results to illustrate the applicability of the method to the study of withinsubject ERP variability.

We show the responsetime distributions recorded in our experiments for these four conditions in Figure 3 (note that amplitudes have been normalised so that the curves are proper density functions, i.e., the area under each curve is unitary; abscissas are in seconds). For each condition we created three bins, each containing 30% of the distribution (the rightmost 10% of the distribution was discarded). Bin boundaries are shown as vertical lines in Figure 3.

The medians and standard deviations, estimated using the standard robust estimator provided by 1.4826 times the median absolute deviation from the median (or MAD for short), for the whole distribution as well as the bins in each condition are shown graphically in Figure 3 and numerically in Table 1. Table 1 also reports the ranges of response times associated to each bin in different conditions. The number of trials in each class are shown in Table 2.
Medians and Standard Deviations  
True Positives  True Negatives  False Negatives  False Positives  
All  
Bin 1  
Bin 2  
Bin 3  
Responsetime Ranges  
True Positives  True Negatives  False Negatives  False Positives  
All  0.002.00  0.002.00  0.002.00  0.002.00 
Bin 1  0.000.67  0.000.56  0.000.56  0.000.70 
Bin 2  0.670.84  0.560.72  0.560.81  0.700.96 
Bin 3  0.841.43  0.721.29  0.811.48  0.961.70 
True Positives  True Negatives  False Negatives  False Positives  
All  521  2967  521  351 
Bins  156  890  156  105 
Although the distributions in Figure 3 have some similarities, they are in fact pairwise statistically different because of the large size of our samples. This is clearly shown by the results of the KolmogorovSmirnov test for distributions presented in Table 3. This suggests that dividing the data into four categories by response and presence/absence of target makes sense.
True Positives  True Negatives  False Negatives  False Positives  
True Positives  N/A  0  0.00028  
True Negatives  0  N/A  0.0013  0 
False Negatives  0.0013  N/A  
False Positives  0.00028  0  N/A 
We used 40%trimmed averages to represent the typical behaviour of ERPs for different stimulusresponse pairs (i.e., true/false positives/negatives) both for the bins and for the whole sample.
Let us start by looking at what happens if we compute stimuluslocked and responselocked averages of the four classes without binning. Figure 4 shows the results of this process for the electrodes Cz and Pz, for our four conditions. In the plots on the left the stimulus onset is at 0ms. For easier comparison, the responselockedaverage plots on the right were independently shifted so that the stimulus is also at 0ms. Let us analyse these plots.

We can clearly see from Figure 4 that, as expected, the stimuluslocked averages on the left show the early ERPs clearly. Conversely, there is very little detail on patterns of activity after the first 400ms, i.e., preceding and immediately following the response. The situation is completely reversed when we consider the responselocked averages on the right. Here the early potentials are effectively impossible to discern, while waves of activity in the proximity of the response and their individual differences appear to be well resolved.
The differences between the stimuluslocked and the responselocked representations of brain activity are generally very large. As an illustration, in Figure 6 (top) we have superimposed the stimuluslocked and responselocked averages recorded in Cz for the True Negatives. It would be difficult to integrate the two averages into a unified interpretation. In general, it is hard to discern how far away from the synchronising event (whether stimulus or response) we can go before the waves shown in the averages in Figure 4 are just artifacts.
Let us now look at the averages obtained using responsetime binning for the different conditions.
Let us start from the true negative trials. The first row of Figure 5 shows the stimuluslocked averages for channels Cz and Pz. It is immediately apparent how much crispier than in Figure 4(left) the different ERPs are when using bins. Also, it is clear how different response times are in fact associated with different amplitudes and latencies in ERPs, particularly for the late ERPs following the exogenous responses. We should note, however, that bin 3 has a much wider responsetime distribution than the other two bins (see Figure 3). The relative lack of late activity in bin 3 can, therefore, be partly attributed to residual latency jitter.
The improvement in the effective resolution is also confirmed by the averages for the false negatives shown in Figure 5 (second row). Using binning we can now see how qualitatively different the ERPs produced in the presence of an incorrect response can be. For example, it is clear how bin 3 deviates from the others: in Cz we can see an early potential which is not present in the averages for the other bins, while the large positive wave occurring in those bins between approximately 500ms and 700ms is totally absent from the third bin. Why is this happening? We can easily hypothesise explanations, but we will not attempt to interpret these findings in this work. We should stress, however, that it is really the binning technique that has made it possible to ask these questions in the first place, by identifying otherwise undetectable differences.

As shown in Figure 5, the false positives present some similarities with the true negatives. These include, for example, the effective absence of a positive wave between approximately 500ms and 700ms in the average for bin 3.
If we look at the averages for the true positives (bottom of Figure 5) we find that binaverage differences are much reduced, with only bin 1 showing an amplitude elevation between 300ms and 800ms. This suggests that the events taking place in the correct recognition of a target vary much less with the reaction time than for other categories. So, in a sense, the binning technique is telling us that in this case ordinary stimuluslocked averages can be trusted more than in the other cases. This is a general application of the binning technique. Whenever the plots for the averages of different bins coincide in a particular range of times, we should expect ordinary averages to be reasonably reliable for that range of times and vice versa.
In principle one can average the trials in a bin either aligning them on the stimulus or on the response. As we mentioned in Sections 1.1 and 1.2, stimuluslocked and responselocked averages can accurately resolve only waves which are phase locked with the corresponding reference event and it is, therefore, difficult to integrate the information they provide into a unified picture. So, it seems reasonable to ask whether binning has any beneficial effects in this respect.

Unsurprisingly, when bins are narrow, aligning the epochs in a bin based on stimulus onset or response produces very similar averages, as illustrated in Figure 6 (rows 2 to 4 left) for the true negatives for channel Cz. Note how similar the responselocked and stimuluslocked averages are for bins 1 and 2. This is common for all conditions. Only in bin 3 we can see discrepancies between the two averages. The reason is that, despite our removing the 10% of the distribution corresponding to the longest response times, bin 3 still has a much bigger responsetime variance than the other two bins. The averaging biases discussed in Sections 1.1 and 1.2 will, therefore, manifest themselves also in bin 3, albeit to a lesser degree than in the absence of binning. It is then not surprising to see that for that bin the early ERPs are only well captured by the stimulus locked average, and vice versa. Note, however, that by showing differences in the two plots, the binning technique reveals that if one wants to study more precisely what happens in unusually long trials, the response time distribution needs to be divided more finely. Of course, since this reduces the number of trials present in each bin, one needs to test enough subjects and each subject for long enough (e.g., in multiple sessions) to ensure noise levels are sufficiently low.
Overall we can see that binning effectively brings the two main ways of studying ERPstimuluslocked and responselocked averagingcloser, effectively unifying them for narrow bins.
On the right of Figure 6 we show plots of the difference between bin averages and the corresponding average obtained using all trials. For all bins, absolute differences of 2V or more are present over periods of several hundred of milliseconds, particularly in the central region of the epochs. In that region, relative errors of 30% or more are common across all bins, with bin 1 showing differences of over 6V which correspond to relative errors of nearly 70%. This suggests that a large proportion of the variance in ERP averages is actually accounted for by latency jitter.
The results presented in the previous section are of a qualitative nature. The question of the degree to which averages constructed via responsetime binning are effective at resolving and properly representing ERPs needs to be addressed more formally. Let us start by checking whether observed differences are statistically significant. This can be done as follows.
If we focus on one particular time step, we can treat each bin as a univariate sample of the amplitudes recorded at that particular time step in the epochs in the bin. We can then use the KolmogorovSmirnov test to check whether the samples in pairs of bins might be drawn from the same distribution. Since singletrial amplitudes are very noisy, ERP amplitudes are rarely estimated by looking at a single sample. So, instead of passing to the test the amplitudes of a specific sample, we can use amplitude averages taken over small intervals centred around the time of interest in each trial. The values obtained via the test when comparing bin amplitudes at a specific time will then reveal whether differences at that time are significant.
For example, if we look at the false positives and channel Cz across all subjects and compare bins 1 and 2, bins 1 and 3 and bins 2 and 3, we find that amplitude differences are all highly statistically significant between 550ms and 600ms ( , and , respectively), but are not in the interval 850ms900ms (, and , respectively).
A comprehensive representation of the intervals where amplitude differences between bins are significant is provided by what we could call a KolmogorovSmirnovgram (or KSgram for short), i.e., a plot of the values obtained when sliding a time window over the trials and running the KolmogorovSmirnov test on the average amplitudes recorded in the window in a pair of bins. A diagram showing the KSgrams obtained for our three bins for channel Cz and for the true negative trials across all subjects is shown in Figure 7(top left). The 5% significance level is represented by the horizontal dashed line in the figure. The KSgrams in the figure were computed using a 30mswide sliding window. For reference, Figure 7(top right) shows the averages for the bins.
As one can easily see, all plots in Figure 7(top left) are on the `statistically significant' side of the diagram for a large proportion of an epoch. More precisely, the KSgram for bins 1 and 3 shows that there are statistically significant amplitude differences between the two bins for 36.3% of the epoch; the KSgram for bins 1 and 2 indicates significant differences for 59.0% of the epoch; and ERP amplitudes in bins 2 and 3 are statistically significantly different for 57.5% of the epoch. This cannot be explained by chance: binning by responsetime must be capturing significant regularities in the ERPs evoked during the experiment. Furthermore, the major deviations in the bin averages observable in Figure 7(top right) are all highly statistically significant. Additionally, bintobin comparisons show that the early potentials which one would not expect to be heavily modulated by condition and responsetime are indeed mostly below the significance threshold.
The results we reported above were based on cumulating the trials of all subjects. However, the binning technique can be applied also on a subjectbysubject basis. The plots in rows 2 to 4 on the left of Figure 7 show KSgrams for three typical subjects (S2, S3 and S5). The corresponding plots on the right show their bin averages. The KSgrams and bin averages obtained in the case of S3 and S5 (as well as S4 and S6, not reported) are qualitatively very similar to those obtained across all subjects. That is we see that ERP amplitude distributions are significantly different across bins for a large proportion of the epochs, despite the fact that singlesubject bins contain only 1/6 of the total dataset. In the case of S2 (and S1, not reported), instead, we find that differences are much smaller and only occasionally statistically significant. Upon inspection of the error rates for these subjects we found that they had adjusted to the low target frequency in the experiment (20%) and tended to respond `No' significantly more often than average. So, it is likely that they used a different strategy than the other subjects, which may explain the presence of the large positive ERP phasedlocked with the stimulus presentation and immediately following the exogenous ERPs. These subjects were also characterised by the effective absence of variablelatency ERPs. This in turn led to the lack of significant differences between the bins as highlighted by their KSgrams.

An important question concerns the effects that dividing up a data set into bins based on response time has on noise. In Section 1.6 we suggested that there is a tradeoff between the desire to gain precise information by averaging more homogeneous signals and the loss of precision due to the reduced noise rejection associated to smaller sets. Here we want to check this hypothesis.
To address this issue we measured the SignaltoNoise Ratio of the averages for different sets of trials using the technique developed by Schimmel (1967). That is, for each dataset and at each time step, we averaged the evenindex and oddindex epochs in the dataset separately, obtaining the signals and , respectively. We then estimated the SNR of the mean as follows:
Let us consider these results in detail. Firstly, we should note that the SNR for the false negatives is substantially better than for the other classes. This is not surprising since this is by far our largest category as we showed in Table 2. Also, it is not surprising to see that as we move from the `All' column to the `Bin' columns we see a drop in SNR. However, what is really important to verify is whether the drop is any different from what we would expect if we randomly split a dataset into subsets.
Because the bins include 30% of the total sample, the theoretical SNR drop we would expect to see in the presence of random bins is , i.e., , as we move from the first column to the second, third and fourth in Table 4. However, the average drop in SNR when going from the whole group to the bins varies from 1.8dB to 4.1dB, with an average of 3.1dB. The smallest SNR loss is associated with bin 1, where on average SNR drops by only about 1.5dB. All this is possible because some of the variance present in the averages in the absence of binning is in fact due to variablelatency ERPs that effectively act as noise on the mean. With binning, instead, there is much less variability associated with variablelatency waves.
So, not only the reduction in the biases of averaging brought about by responsetime binning results in an improvement in resolution, but it is also responsible for the SNR on the mean remaining significantly higher than expected despite the large samplesize reduction due to binning.
All  Bin 1  Bin 2  Bin 3  Average SNR drop by class  
True Positives  18.4 dB  16.8 dB  14.3 dB  11.8 dB  4.1 dB 
True Negatives  22.5 dB  22.4 dB  19.7 dB  20.1 dB  1.8 dB 
False Negatives  15.6 dB  12.8 dB  12.9 dB  12.5 dB  2.9 dB 
False Positives  16.5 dB  15.0 dB  11.7 dB  11.5 dB  3.8 dB 
Average SNR drop by bin  1.5 dB  3.6 dB  4.3 dB 
In this section we will relate responsetime binning to the work by Zhang (1998); Hansen (1983) (see Section 1.4) thereby formally clarifying the reasons why this averaging technique increases the resolution with which ERPs can be recovered.
At first, let us make the same assumptions as in (Zhang, 1998; Hansen, 1983). Let us assume that there are two additive ERPs in the signals recorded in a forcedchoice experimenta stimuluslocked ERP, , and a responselocked ERP, and that is the responsetime density function. Under the further assumption that response times do not affect the shape of these ERPs, the stimuluslocked average can be expressed as , where is the convolution operation. A similar equation can be written for the responselocked average .
Let us consider in what ways binning by response time would affect this result. Let us define a function that returns 1 if is true, and 0 otherwise. Then can be seen as a membership function for the trials belonging to a bin characterised by response times within the interval . Thus, the product represents the distribution of response times within the bin. This can be turned into a probability density function by dividing it by .
It is then clear that the stimuluslocked bin average, which we denote as , is given by
Apart from a scaling factor, the key difference between the two is that is the product of and a rectangular windowing function, . In the frequency domain, therefore, the spectrum of , which we denote with , is the convolution between the spectrum of , denoted as , and the spectrum of a translated rectangle, . This is a scaled and rotated (in the complex plane) version of the sync function (i.e., it behaves like ). The function has a large central lobe whose width is inversely proportional to the bin width, . Thus, when convolved with , behaves as a low pass filter. Therefore, is a smoothed and enlarged version of . In other words, while is still a lowpass filter, it has a higher cutoff frequency than .
We illustrate this effect in Figure 8 for the class of true negatives. The figure shows the amplitude of the frequency response, , of the response time distribution as well as the frequency responses for bins 1, 2 and 3, . These were computed via the discrete Fourier transform of the distributions and obtained from the raw data. To improve the accuracy of the result, we derived high resolution representations of these distributions by using the Parzen window method (Parzen, 1962).
We should note how all the frequency responses shown in Figure 8 are characteristic of lowpass filters. The narrowest of all is the one associated with the full response time distribution (i.e., when no binning is performed). This has a cutoff frequency of 0.42Hz. This implies that without binning a stimuluslocked average can reproduce without significant distortion only responselocked ERPs which represent extremely slow potentials occurring over a period of perhaps 1 second or more.
The second lowest cutoff frequency, 0.79Hz, is associated with bin 3. It is not surprising to see that bin 3 has still a very low cutoff frequency, because this bin is the widest of all three, covering a large portion of the long upper tail of the response time distribution. Despite this, however, binning still has effectively doubled the resolution of averaging for this bin.
The representation of in bin averages is even less deformed than the representation of the same ERP in the ordinary average for bins 1 and 2, which have cutoff frequencies of 2.38Hz and 2.84Hz, respectively. For these bins, therefore, stimuluslocked averaging has improved the resolving power on responselocked ERPs by six or seven times. Thus, bin averages can reliably resolve features of responselocked ERPs down to durations of the order of 100200ms.
The narrower the bin the smaller the deformations. So, if a higher resolution is required, one just needs to use narrower bins and to acquire correspondingly more trials. In fact, for sufficiently small bins, the bin average is an unbiased estimator of the true ERP. To illustrate this, let us imagine that we pick a bin which is so narrow that we can consider constant on it. So, . Then, if we take the limit for the bin size , we get that approaches more and more a Dirak delta function. So, from the properties of the convolution operator we obtain:
All of the properties discussed in this section also hold for binned responselocked averages.
Let us consider a more general case. Let us assume that there are three additive ERPs in the signal recorded in a forcedchoice experiment: the and waves mentioned above, and a variablelatency ERP, . Let be a stochastic variable representing the response time in a trial; is its density function. Similarly, let be a stochastic variable representing the latency of the ERP and let be the corresponding density function (or latency distribution). As above, let us further assume that response time and latency do not affect the shape of these ERPs. Under these assumptions we obtain the following equation for the stimuluslocked average :
Zhang (1998) considered a special version of this equation and showed that if stimuluslocked and responselocked ERPs, and , are absent and if the latency, , and the lag, , between the variablelatency ERP and the response are statistically independent, then some information about can be recovered from the knowledge of , and . In particular one can find the amplitude of the Fourier transform of . Because the phase information cannot be recovered, however, the reconstruction of is not possible (see Zhang, 1998, for more details).
An interesting question is whether different assumptions on the relation between and might in fact allow one to derive more information about , while, perhaps, being more psychophysiologically tenable. Our objective in this section, however, is more modest. Starting from Equation (3) we want to see how binning affects the resolution with which is represented in a stimuluslocked average.
Let us start by considering the most general conditions possible. Let and be described by an unspecified joint density function . So, the latency and responsetime distributions are marginals of this joint distribution, i.e.,
Focusing our attention on the subset of the trials falling within the responsetime bin , i.e., such that , changes the joint distribution of and into
The marginal of this distribution with respect to gives us the response time distribution for the responsetime bin :
The key difference between and , apart from a scaling factor, is that is the product of and a windowing function, . In the frequency domain, therefore, the spectrum of , which we denote with , is the convolution between the spectrum of , denoted as , and the spectrum of the window, . All we know about the function is that it can never be negative, being in fact a probability. Therefore, its spectrum must have a nonzero component at . However, this does not necessarily imply that behaves as a low pass filter. So, in general we cannot say for sure whether is wider than , which would imply that binning increases the resolution of in the average. As we will see below, however, under mild assumptions on the relationship between and this is actually the case. Let us consider two cases.
In Section 1.6 we put forward the following cognitive homogeneity assumption: if one considers those epochs where a participant was presented with qualitatively similar stimuli and gave the same response within approximately the same amount of time, it is reasonable to assume that similar internal processes will have taken place. Under this assumption fixed and variablelatency ERPs will appear much more synchronised than if one looked at an undivided dataset. The cognitive homogeneity assumption effectively implies that within a stimulus/response class when takes a particular value, the value of is also approximately determined and vice versa.
To ease our mathematical analysis, let us idealise this assumption imagining that where is some unknown deterministic function. Note that this assumption is the exact opposite of the independence assumption of Zhang (1998) since in our model is dependent on via the relation .
Because of the dependency between and , we have that is 1 if and 0 otherwise. So,
Binning increases the resolution of responselocked ERPs because it considers a narrower range of response times. In the model studied in the previous section this benefit was also available for variablelatency ERPs since discarding trials whose response time was outside a particular range corresponded to rejecting variablelatency ERPs whose latency is outside some interval. So, it is reasonable to expect that something like this would still happen even if wasn't a deterministic function of , as long as there was a sufficiently strong correlation between and . Indeed, in such a case, in a scatterplot of the pairs associated to all the trials in a dataset, we would find that the data cloud tends to align (to a degree that depends on how strong the correlation between and is) along a line, such as the straight line obtained via linear regression. Picking a subset of the trials corresponding to values within an interval is equivalent to taking a vertical slice of the cloud. An illustrative example is shown in Figure 9(a). As shown in the figure, if the correlation is strong, the data in the vertical slice of the plot are also essentially the same data belonging to a horizontal slice corresponding to a latency interval . So, binning by response time should be expected to also produce a corresponding binning by latency. Stimuluslocked averaging of these data should then present an improved resolution not only for responselocked ERPs but also for variablelatency ERPs. Note that this would happen irrespective of whether the correlation between and is positive or negative. If the correlation is not very strong, however, we need to take an analytic approach to understand the effects of binning.
In the previous section we assumed that a deterministic functional
relationship between response time and latency existed,
and it was invertible. was dependent on but
was totally arbitrary. Here, instead, we want to look at an
orthogonal set of assumptions. We will assume that the joint
distribution between of and is arbitrary but that there exists
a relationship between and of the form
, where is an arbitrary function (model) and is a
stochastic variable with density function
,
which is statistically independent from .
Under these assumptions we find that
In general, responsetime binning improves the resolution of averaging if is narrower in the time domain (and, correspondingly, wider in the frequency domain) than . This, in turn, happens if the windowing function is narrower than in the time domain. Naturally, while this is a likely scenario, we cannot be absolutely certain that this will happen because we don't know the latency distribution . However, we know that the narrower the error distribution and the bigger , the narrower . So, we should expect that in the presence of strong enough correlations between and , response time binning will increase the resolution of averaging.
To be more precise, one needs to specialise the analysis to specific
forms of . So, let us consider a specific case where the
regression function of on is linear, i.e.,
and, so,
, where and are the regression
coefficients and is a Gaussian stochastic variable with
zero mean and variance . Under these assumptions we
find that
To give an idea of the width and shape of this function in realistic situations, we consider the scatterplot of response times vs latencies of P300 for the `accurate condition' reported in (Kutas et al., 1977, page 794, Figure 2, left). We digitised the figure and we redraw it in Figure 9(b). The regression line provided with the original figure was ms (, , ) which shows a significant correlation between and . Solving the for we obtain ms. Using the data in Figure 9(b) we then estimated ms for this line (as standard, was estimated using the RMS of residuals). We then selected the response time bin delimited by ms and ms, which is indicated by the vertical dashed lines in the figure. With these data in hand, we then computed the windowing function corresponding to this bin. The function is almost a perfect Gaussian with a mean of 570ms (indicated by the horizontal dashed line in the figure) and a standard deviation of 160ms. The function is shown in Figure 9(b) rotated by 90 degrees so that its abscissas correspond to the latency of P300s. The figure also reports a P300 latency histogram (a discretisation of the estimated , again rotated by 90 degrees). As one can easily see a significant fraction of the latency histogram does not overlaps with the windowing function. Thus, is narrower than resulting in the average over this bin having a significantly higher resolving power in relation to P300 than the ordinary average.

The theory developed above makes no assumptions as to whether the trials being averaged relate to a single subject or multiple subjects. Although it is perhaps most naturally applicable in a singlesubject setting, in this section, we show that the theory is also valid for averages across subjects and grand averages. What changes in different settings are the responsetime and latencydistributions that determine the degree of blurring characterising each averaging technique.
Let and be the responsetime and latency distributions of the th subject, respectively. Under the same assumptions as for Equation (3), a subject's stimuluslocked average is:
Because of natural betweensubject variability, we should expect to find that responsetime distributions, , and latency distributions, , will vary in shape and location across subjects. Therefore, being , , and weighted sums of such distributions, they will generally be wider than their singlesubject counterparts. Thus, and will be affected by a ``betweensubjects'' lowpass filtering effect in addition to the ``withinsubject'' blurring characterising singlesubject ERP averages.
Since Equations (5) and (6) have exactly the same form as Equation (3) and their convolution kernels are lowpass filtering ones as those in that equation, the theory developed in Section 4.2 is applicable to grand averages and averages across subjects as it is to singlesubject data. Applying responsetime binning will, therefore, increase the resolving power also of group averages.
Stimuluslocked, responselocked and ERPlocked averaging are the standard methods for reducing artifacts as well as precisely evaluating the shape, amplitude and latency of specific waves in ERP analysis. All have been exceptionally effective in building up our knowledge on how the brain reacts to stimuli and on the processes that may take place in different tasks.
However, they all suffer from what we could call a keyhole or a magnifyingglass effect. That is, while these techniques are able to increase the resolution of specific ERPs, they do so at the cost of putting everything else out of focus. To build a clearer picture of the ERPs evoked in the brain during a task, an experimenter needs to carefully analyse and qualitatively integrate the averages produced by multiple techniques. This is particularly difficult to do for variablelatency ERPs which are not locked with externally measurable synchronising events, such as the onset of a stimulus or the response of a participant, because they may effectively fall in the blind spot for both stimuluslocked and responselocked averaging.
Some variablelatency ERPs could be resolved by a suitable ERPlocked averaging process. However, even large and conspicuous waves such as the P300 are difficult to detect on a trial by trial basis. So, averaging based on the latency of waves identified by a detection algorithm may, in fact, lead to mixing the ERPs of interest with other totally unrelated elements, thereby biasing and distorting the result. In addition, ERPlocked averaging requires prior knowledge about the presence and shape of the target wave. In practice, this prevents the use of the method to reveal novel or unsuspected waves.
In this paper we have proposed an extremely simple techniquebinning trials based on response times and then averagingthat can alleviate the problems mentioned above. The technique is based on a simple cognitive homogeneity assumption: that roughly the same cognitive processes and ERPs occur in trials where stimulus condition, participant response and response time are approximately the same. For this reason, in such trials, the distribution of latencies of all variablelatency ERPs (including those phaselocked with the response) should be narrower than if one considered an undivided dataset. As a result, averaging the trials in a responsetime bin should provide a clearer picture of the patterns of brain activity taking place in the conditions associated with those trials.
We assessed the binning technique both empirically and theoretically. For empirical validation we used an experiment in which the task is relatively difficult, requiring identifying and conjoining multiple features, and where response times varied from around 400ms to over 2 seconds. We evaluated the results in a number of ways, including: a comparison between stimuluslocked and responselocked averages which showed how these are essentially identical under responsetime binning; an analysis of statistical significance of interbin amplitude differences using KolmogorovSmirnovgrams; and an analysis of the signaltonoise ratios with and without binning. From the theoretical point of view, we provided a comprehensive analysis of the resolution of singlesubject averages, grand averages and averages across subjects, which showed that there are resolution benefits in applying responsetime binning even when there is still a substantial variability in the latency of variablelatency ERPs after responsetime binning. An improvement of resolving power can be expected whenever there is some correlation (positive or negative) between response time and the latency of an ERP.
This body of evidence suggests that averaging after responsetime binning produces clearer representations of brain activity, revealing ERPs and helping in the evaluation of the amplitude and latency of ERP waves. Additionally, the method is extremely simple to use (even retrospectively) and requires no prior knowledge on the ERPs to be enhanced or revealed by the averaging process.
Naturally, the binning method has also limitations. For example, a variety of factors determine whether and how a subject perceives the stimuli and the strategy adopted to decide which answer to produce to such stimuli. These include: the stimuli presented in previous trials; whether the subject's attention or gaze shifted as a result of earlier stimuli; the subject inadvertently zoning out at the time stimulus presentation for the present trial and the corresponding need to resort to guessing; etc. As a result, different processes may be taking place in a subject's brain even within trials characterised by the same stimuli, response and responsetime thereby violating our cognitive homogeneity assumption. In these cases bin averages will represent a blend of the ERPs produced by such processes, as ordinary averages would.
Also, it reasonable to assume that as task complexity increases, the correlation between the latency of ERPs that are not phaselocked with the response (e.g., associated with intermediate psychological operations) and the responseitself will be reduced. Therefore, binning might be unable to reduce the temporal variance of such ERPs and, so, would provide no resolution improvement for them.
As to future research, in a sense, we can think of responsetime binning as a spot in the middle ground between singletrial analysis and ordinary averages. In the future we would like to better explore this middle ground. For example, we would like to see if binning using gradual membership functions can provide even better reconstruction fidelity (particularly in relation to the Gibbs phenomenon), if setting bin sizes on the basis of the noise in the data may be beneficial to make best use of the available trials, if responselocked and stimuluslocked averages can be jointly used (e.g., in the frequency domain) to further refine the reconstruction of ERPs, if it is possible to integrate the information obtained from different bins into a unified representation of ERPs, if the theory can be extended to cases where the cognitive homogeneity assumptions is violated, etc.
A second line of future research relates to the use of averaging in Brain Computer Interfaces (BCIs) (Birbaumer et al., 1999; Wolpaw et al., 1991; Pfurtscheller et al., 1993; Wolpaw et al., 2000; Farwell and Donchin, 1988). Indeed, ERP averaging is also a key element in many BCIs, and it is precisely from trying to understand its effects in BCI that this work originally emerged. Many BCI systems (e.g., Citi et al., 2008; Rakotomamonjy and Guigue, 2008; Bostanov, 2004) make decisions by repeatedly presenting all the stimuli in a set and averaging the corresponding outputs produced by a classifier. If all the steps in the calculation of a classifier's output are linear (and in many cases they are), averaging the outputs of the classifier is equivalent to computing the output produced by it in the presence of an average ERP waveform. In other words, effectively many BCI systems rely on ERP averaging. So, our analysis of the effects of averaging is directly applicable them. We hope that the responsetime binning technique will provide us with a deeper understanding of how users of BCI systems responds to stimuli and of what are the best stimuli for BCI control.
The authors would like to thank the Associate Editor (Dr Dean Salisbury) and the anonymous reviewers for their extremely useful comments and suggestions for improving this manuscript. This work was supported by the Engineering and Physical Sciences Research Council under [grant ``Analogue Evolutionary Brain Computer Interfaces'', EP/F033818/1]; and the Experimental Psychological Society (UK) [grant ``Binding Across the Senses''].
This document was generated using the LaTeX2HTML translator Version 200221 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html split 0 ERPRTbinning.tex
The translation was initiated by Riccardo Poli on 20100211
Riccardo Poli 20100211